Handling Missing Data in R

Expectation Maximisation Algorithm for Bivariate Normal data

Author

Aman Das

Published

September 4, 2023

In this article we compute Maximum-Likelihood Estimate for the parameters of a Bivariate Normal distribution.

We also learn to handle missing data points in our data set using Expectation Maximization Algorithm.

Theory

Likelihood

Say \(\vec{X} \sim f( \vec{x} \mid \theta)\) .

Say we have a sample of \(X\): \(\vec{x}\)

Then likelihood function \(L(\theta \mid \vec{x})\) is a measure of how likely the occurrence of the sample is for a given theta.

\[ L(\theta \mid \vec{x}) = f(\vec{x} \mid \theta) \]

Often log-likelihood \(\log(L(\theta \mid \vec{x}))\) denoted by \(l(\theta \mid \vec{x})\)) is used due to convenience.

Estimate

Say \(\vec{X_1}, \vec{X_2} \ldots \vec{X_n} \sim f(\vec{x} \mid \theta)\) .

But \(\theta\) is not known. We need to find an estimate of \(\theta\) based on the known sample data.

Thus the estimate is some function \(T(\vec{X_1}, \vec{X_2} \ldots \vec{X_n}) = \hat{\theta}\) .

There are various ways to judge the usefulness of our estimated \(\hat{\theta}\).

Maximum Likelihood Estimate

The estimate \(\hat{\theta}\) such that likelihood \(L(\hat{\theta} \mid \vec{x})\) is maximzised is useful in many scenarios. We call such estimates as Maximum Likelihood Estimate (MLE of \(\hat{\theta_{MLE}}\)).

MLE may be computed using a combination of log-likelihood, Algebra, Calculus and other techniques.

Expectation Maximization Algorithm

Sometimes, there are missing data points in \(\vec{x_i}\). In such cases we may not be able to compute the MLE analytically. Thus a numerical approach called Expectation Maximization Algorithm is used.

Data Wrangling

Import Dataset

The data is on “fasting” (F) and “after meal - postprandial” (PP) blood sugar levels (measured in mg/dL) of 20 individuals. For a few individuals, one of the two blood sugar levels was not measured as per the doctor’s recommendation. Assume that (F, PP) measurements are i.i.d. bivariate normal with unknown means, variances and correlation.

Data is imported from an Open Document Foundation Spreadsheet data.ods.

Code
d = read_ods("data.ods")

Reformat Dataset

Order the dataset logically so that data points of similar missing data are together.

Code
da = d[ complete.cases(d), ]
db = d[ !complete.cases(d), ]
dba = db[ complete.cases(db[ , 1]), ]
dbb = db[ complete.cases(db[ , 2]), ]

l1 = length(da)
l2 = length(dba)
l3 = length(dbb)

d = bind_rows(da, dba, dbb)
db = bind_rows(dba, dbb)
kable(d)
F PP
78 128
89 118
93 134
76 117
85 130
84 122
86 131
73 137
97 119
88 123
79 135
91 NA
77 NA
95 NA
92 NA
81 NA
NA 127
NA 129
NA 125
NA 121

Expectation Maximisation Algorithm

OLS estimate for known dataset

The distribution is considered to be bivariate normal. Thus the distribution parameters are \(\mu_X\), \(\mu_Y\), \(\sigma^2_X\), \(\sigma^2_Y\) and \(\rho\).

Code
estbvn = function(d) {
  td = t(d)
  
  X = td[1, ]
  Y = td[2, ]
  
  l = length(X)
  
  mux = mean(X)
  muy = mean(Y)
  
  s2x = mean( (X-mux) * (X-mux) )
  s2y = mean( (Y-muy) * (Y-muy) )
  s2xy = mean( (X-mux) * (Y-muy) )
  rho = s2xy / sqrt(s2x * s2y)
  
  theta = c(mux, muy, s2y, s2x, rho)
  return(theta)
}

theta = estbvn(da)

The OLS estimates of the parameters are: 84.3636364, 126.7272727, 47.6528926, 49.1404959, -0.311683

Imputing Missing values

We use the interim estimates thus obtained to infer the missing data values.

We make use of the following result:

\[ E(Y \mid x) = \mu_Y + \rho \frac{\sigma_Y}{\sigma_X} (x - \mu_X) \]

Code
imptmissed = function(d, t) {
  db1 = d[ complete.cases(db[ , 1]), ]
  db2 = d[ complete.cases(db[ , 2]), ]
  
  db1[ , 2] = t[2] + ( ( t[5] * sqrt(t[4]) * (db1[ , 1] - t[1]) ) / sqrt(t[3]) )
  db2[ , 1] = t[1] + ( ( t[5] * sqrt(t[3]) * (db2[ , 2] - t[2]) ) / sqrt(t[4]) )
  
  res = bind_rows(db1, db2)
  return(res)
}

d2 = bind_rows( da, imptmissed(db, theta) )

The Imputed data set is as follows:

Code
kable(d2)
F PP
78.00000 128.0000
89.00000 118.0000
93.00000 134.0000
76.00000 117.0000
85.00000 130.0000
84.00000 122.0000
86.00000 131.0000
73.00000 137.0000
97.00000 119.0000
88.00000 123.0000
79.00000 135.0000
91.00000 124.6268
77.00000 129.0579
95.00000 123.3608
92.00000 124.3103
81.00000 127.7919
84.27993 127.0000
83.66607 129.0000
84.89379 125.0000
86.12150 121.0000

Iteration

We continue the process by computing the interim estimates and imputing missing data repeatedly.

We pray that the data converges. Let us observe:

Code
thetalog = tibble(
  muX = theta[1],
  muY = theta[2],
  sigma2X = theta[3],
  sigma2Y = theta[4],
  rho = theta[5]
)

l = 15
for ( i in seq(l) ) {
  theta = estbvn(d2)
  
  thetalog = thetalog %>% add_row(
    muX = theta[1],
    muY = theta[2],
    sigma2X = theta[3],
    sigma2Y = theta[4],
    rho = theta[5]
  )
  
  d2 = bind_rows(da, imptmissed(db, theta))
}

kable(thetalog)
muX muY sigma2X sigma2Y rho
84.36364 126.7273 47.65289 49.14050 -0.3116830
85.14806 126.2574 29.44717 40.65642 -0.3758238
85.27806 126.1378 30.73413 40.63218 -0.4205278
85.30227 126.1021 31.26110 40.69501 -0.4353826
85.30641 126.0899 31.43959 40.71755 -0.4399703
85.30689 126.0854 31.49526 40.72451 -0.4413570
85.30682 126.0838 31.51250 40.72655 -0.4417789
85.30674 126.0833 31.51789 40.72715 -0.4419092
85.30670 126.0830 31.51959 40.72733 -0.4419501
85.30668 126.0830 31.52014 40.72738 -0.4419631
85.30667 126.0830 31.52032 40.72740 -0.4419673
85.30667 126.0829 31.52038 40.72740 -0.4419687
85.30666 126.0829 31.52040 40.72740 -0.4419692
85.30666 126.0829 31.52041 40.72740 -0.4419693
85.30666 126.0829 31.52041 40.72741 -0.4419694
85.30666 126.0829 31.52041 40.72741 -0.4419694

Notice that each the estimate sequences converged within 15 iterations. We take the limit of these sequences to be our Estimates according to the EM algorithm.

Thus our parameter estimates are:

muX muY sigma2X sigma2Y rho
85.30666 126.0829 31.52041 40.72741 -0.4419694

Conclusion

Therefore, using an iterative method, one can compute the Maximum Likelihood estimates for parameters, even if some data points are missing.