Code
= read_ods("data.ods") d
Expectation Maximisation Algorithm for Bivariate Normal data
Aman Das
September 4, 2023
In this article we compute Maximum-Likelihood Estimate for the parameters of a Bivariate Normal distribution.
We also learn to handle missing data points in our data set using Expectation Maximization Algorithm.
Say \(\vec{X} \sim f( \vec{x} \mid \theta)\) .
Say we have a sample of \(X\): \(\vec{x}\)
Then likelihood function \(L(\theta \mid \vec{x})\) is a measure of how likely the occurrence of the sample is for a given theta.
\[ L(\theta \mid \vec{x}) = f(\vec{x} \mid \theta) \]
Often log-likelihood \(\log(L(\theta \mid \vec{x}))\) denoted by \(l(\theta \mid \vec{x})\)) is used due to convenience.
Say \(\vec{X_1}, \vec{X_2} \ldots \vec{X_n} \sim f(\vec{x} \mid \theta)\) .
But \(\theta\) is not known. We need to find an estimate of \(\theta\) based on the known sample data.
Thus the estimate is some function \(T(\vec{X_1}, \vec{X_2} \ldots \vec{X_n}) = \hat{\theta}\) .
There are various ways to judge the usefulness of our estimated \(\hat{\theta}\).
The estimate \(\hat{\theta}\) such that likelihood \(L(\hat{\theta} \mid \vec{x})\) is maximzised is useful in many scenarios. We call such estimates as Maximum Likelihood Estimate (MLE of \(\hat{\theta_{MLE}}\)).
MLE may be computed using a combination of log-likelihood, Algebra, Calculus and other techniques.
Sometimes, there are missing data points in \(\vec{x_i}\). In such cases we may not be able to compute the MLE analytically. Thus a numerical approach called Expectation Maximization Algorithm is used.
The data is on “fasting” (F) and “after meal - postprandial” (PP) blood sugar levels (measured in mg/dL) of 20 individuals. For a few individuals, one of the two blood sugar levels was not measured as per the doctor’s recommendation. Assume that (F, PP) measurements are i.i.d. bivariate normal with unknown means, variances and correlation.
Data is imported from an Open Document Foundation Spreadsheet data.ods
.
Order the dataset logically so that data points of similar missing data are together.
F | PP |
---|---|
78 | 128 |
89 | 118 |
93 | 134 |
76 | 117 |
85 | 130 |
84 | 122 |
86 | 131 |
73 | 137 |
97 | 119 |
88 | 123 |
79 | 135 |
91 | NA |
77 | NA |
95 | NA |
92 | NA |
81 | NA |
NA | 127 |
NA | 129 |
NA | 125 |
NA | 121 |
The distribution is considered to be bivariate normal. Thus the distribution parameters are \(\mu_X\), \(\mu_Y\), \(\sigma^2_X\), \(\sigma^2_Y\) and \(\rho\).
The OLS estimates of the parameters are: 84.3636364, 126.7272727, 47.6528926, 49.1404959, -0.311683
We use the interim estimates thus obtained to infer the missing data values.
We make use of the following result:
\[ E(Y \mid x) = \mu_Y + \rho \frac{\sigma_Y}{\sigma_X} (x - \mu_X) \]
imptmissed = function(d, t) {
db1 = d[ complete.cases(db[ , 1]), ]
db2 = d[ complete.cases(db[ , 2]), ]
db1[ , 2] = t[2] + ( ( t[5] * sqrt(t[4]) * (db1[ , 1] - t[1]) ) / sqrt(t[3]) )
db2[ , 1] = t[1] + ( ( t[5] * sqrt(t[3]) * (db2[ , 2] - t[2]) ) / sqrt(t[4]) )
res = bind_rows(db1, db2)
return(res)
}
d2 = bind_rows( da, imptmissed(db, theta) )
The Imputed data set is as follows:
F | PP |
---|---|
78.00000 | 128.0000 |
89.00000 | 118.0000 |
93.00000 | 134.0000 |
76.00000 | 117.0000 |
85.00000 | 130.0000 |
84.00000 | 122.0000 |
86.00000 | 131.0000 |
73.00000 | 137.0000 |
97.00000 | 119.0000 |
88.00000 | 123.0000 |
79.00000 | 135.0000 |
91.00000 | 124.6268 |
77.00000 | 129.0579 |
95.00000 | 123.3608 |
92.00000 | 124.3103 |
81.00000 | 127.7919 |
84.27993 | 127.0000 |
83.66607 | 129.0000 |
84.89379 | 125.0000 |
86.12150 | 121.0000 |
We continue the process by computing the interim estimates and imputing missing data repeatedly.
We pray that the data converges. Let us observe:
thetalog = tibble(
muX = theta[1],
muY = theta[2],
sigma2X = theta[3],
sigma2Y = theta[4],
rho = theta[5]
)
l = 15
for ( i in seq(l) ) {
theta = estbvn(d2)
thetalog = thetalog %>% add_row(
muX = theta[1],
muY = theta[2],
sigma2X = theta[3],
sigma2Y = theta[4],
rho = theta[5]
)
d2 = bind_rows(da, imptmissed(db, theta))
}
kable(thetalog)
muX | muY | sigma2X | sigma2Y | rho |
---|---|---|---|---|
84.36364 | 126.7273 | 47.65289 | 49.14050 | -0.3116830 |
85.14806 | 126.2574 | 29.44717 | 40.65642 | -0.3758238 |
85.27806 | 126.1378 | 30.73413 | 40.63218 | -0.4205278 |
85.30227 | 126.1021 | 31.26110 | 40.69501 | -0.4353826 |
85.30641 | 126.0899 | 31.43959 | 40.71755 | -0.4399703 |
85.30689 | 126.0854 | 31.49526 | 40.72451 | -0.4413570 |
85.30682 | 126.0838 | 31.51250 | 40.72655 | -0.4417789 |
85.30674 | 126.0833 | 31.51789 | 40.72715 | -0.4419092 |
85.30670 | 126.0830 | 31.51959 | 40.72733 | -0.4419501 |
85.30668 | 126.0830 | 31.52014 | 40.72738 | -0.4419631 |
85.30667 | 126.0830 | 31.52032 | 40.72740 | -0.4419673 |
85.30667 | 126.0829 | 31.52038 | 40.72740 | -0.4419687 |
85.30666 | 126.0829 | 31.52040 | 40.72740 | -0.4419692 |
85.30666 | 126.0829 | 31.52041 | 40.72740 | -0.4419693 |
85.30666 | 126.0829 | 31.52041 | 40.72741 | -0.4419694 |
85.30666 | 126.0829 | 31.52041 | 40.72741 | -0.4419694 |
Notice that each the estimate sequences converged within 15 iterations. We take the limit of these sequences to be our Estimates according to the EM algorithm.
Thus our parameter estimates are:
muX | muY | sigma2X | sigma2Y | rho |
---|---|---|---|---|
85.30666 | 126.0829 | 31.52041 | 40.72741 | -0.4419694 |
Therefore, using an iterative method, one can compute the Maximum Likelihood estimates for parameters, even if some data points are missing.