aman9das - Handling Missing Data in R

In this article we compute Maximum-Likelihood Estimate for the parameters of a Bivariate Normal distribution.

We also learn to handle missing data points in our data set using Expectation Maximization Algorithm.

Theory

Likelihood

Say \(\vec{X} \sim f( \vec{x} \mid \theta)\) .

Say we have a sample of \(X\): \(\vec{x}\)

Then likelihood function \(L(\theta \mid \vec{x})\) is a measure of how likely the occurrence of the sample is for a given theta.

\[ L(\theta \mid \vec{x}) = f(\vec{x} \mid \theta) \]

Often log-likelihood \(\log(L(\theta \mid \vec{x}))\) denoted by \(l(\theta \mid \vec{x})\)) is used due to convenience.

Estimate

Say \(\vec{X_1}, \vec{X_2} \ldots \vec{X_n} \sim f(\vec{x} \mid \theta)\) .

But \(\theta\) is not known. We need to find an estimate of \(\theta\) based on the known sample data.

Thus the estimate is some function \(T(\vec{X_1}, \vec{X_2} \ldots \vec{X_n}) = \hat{\theta}\) .

There are various ways to judge the usefulness of our estimated \(\hat{\theta}\).

Maximum Likelihood Estimate

The estimate \(\hat{\theta}\) such that likelihood \(L(\hat{\theta} \mid \vec{x})\) is maximzised is useful in many scenarios. We call such estimates as Maximum Likelihood Estimate (MLE of \(\hat{\theta_{MLE}}\)).

MLE may be computed using a combination of log-likelihood, Algebra, Calculus and other techniques.

Expectation Maximization Algorithm

Sometimes, there are missing data points in \(\vec{x_i}\). In such cases we may not be able to compute the MLE analytically. Thus a numerical approach called Expectation Maximization Algorithm is used.

Data Wrangling

Import Dataset

The data is on “fasting” (F) and “after meal - postprandial” (PP) blood sugar levels (measured in mg/dL) of 20 individuals. For a few individuals, one of the two blood sugar levels was not measured as per the doctor’s recommendation. Assume that (F, PP) measurements are i.i.d. bivariate normal with unknown means, variances and correlation.

Data is imported from an Open Document Foundation Spreadsheet data.ods.

Code

d = read_ods("data.ods")

Reformat Dataset

Order the dataset logically so that data points of similar missing data are together.

Code

da = d[ complete.cases(d), ]
db = d[ !complete.cases(d), ]
dba = db[ complete.cases(db[ , 1]), ]
dbb = db[ complete.cases(db[ , 2]), ]

l1 = length(da)
l2 = length(dba)
l3 = length(dbb)

d = bind_rows(da, dba, dbb)
db = bind_rows(dba, dbb)
kable(d)

F	PP
78	128
89	118
93	134
76	117
85	130
84	122
86	131
73	137
97	119
88	123
79	135
91	NA
77	NA
95	NA
92	NA
81	NA
NA	127
NA	129
NA	125
NA	121

Expectation Maximisation Algorithm

OLS estimate for known dataset

The distribution is considered to be bivariate normal. Thus the distribution parameters are \(\mu_X\), \(\mu_Y\), \(\sigma^2_X\), \(\sigma^2_Y\) and \(\rho\).

Code

estbvn = function(d) {
  td = t(d)
  
  X = td[1, ]
  Y = td[2, ]
  
  l = length(X)
  
  mux = mean(X)
  muy = mean(Y)
  
  s2x = mean( (X-mux) * (X-mux) )
  s2y = mean( (Y-muy) * (Y-muy) )
  s2xy = mean( (X-mux) * (Y-muy) )
  rho = s2xy / sqrt(s2x * s2y)
  
  theta = c(mux, muy, s2y, s2x, rho)
  return(theta)
}

theta = estbvn(da)

The OLS estimates of the parameters are: 84.3636364, 126.7272727, 47.6528926, 49.1404959, -0.311683

Imputing Missing values

We use the interim estimates thus obtained to infer the missing data values.

We make use of the following result:

\[ E(Y \mid x) = \mu_Y + \rho \frac{\sigma_Y}{\sigma_X} (x - \mu_X) \]

Code

imptmissed = function(d, t) {
  db1 = d[ complete.cases(db[ , 1]), ]
  db2 = d[ complete.cases(db[ , 2]), ]
  
  db1[ , 2] = t[2] + ( ( t[5] * sqrt(t[4]) * (db1[ , 1] - t[1]) ) / sqrt(t[3]) )
  db2[ , 1] = t[1] + ( ( t[5] * sqrt(t[3]) * (db2[ , 2] - t[2]) ) / sqrt(t[4]) )
  
  res = bind_rows(db1, db2)
  return(res)
}

d2 = bind_rows( da, imptmissed(db, theta) )

The Imputed data set is as follows:

Code

kable(d2)

F	PP
78.00000	128.0000
89.00000	118.0000
93.00000	134.0000
76.00000	117.0000
85.00000	130.0000
84.00000	122.0000
86.00000	131.0000
73.00000	137.0000
97.00000	119.0000
88.00000	123.0000
79.00000	135.0000
91.00000	124.6268
77.00000	129.0579
95.00000	123.3608
92.00000	124.3103
81.00000	127.7919
84.27993	127.0000
83.66607	129.0000
84.89379	125.0000
86.12150	121.0000

Iteration

We continue the process by computing the interim estimates and imputing missing data repeatedly.

We pray that the data converges. Let us observe:

Code

thetalog = tibble(
  muX = theta[1],
  muY = theta[2],
  sigma2X = theta[3],
  sigma2Y = theta[4],
  rho = theta[5]
)

l = 15
for ( i in seq(l) ) {
  theta = estbvn(d2)
  
  thetalog = thetalog %>% add_row(
    muX = theta[1],
    muY = theta[2],
    sigma2X = theta[3],
    sigma2Y = theta[4],
    rho = theta[5]
  )
  
  d2 = bind_rows(da, imptmissed(db, theta))
}

kable(thetalog)

muX	muY	sigma2X	sigma2Y	rho
84.36364	126.7273	47.65289	49.14050	-0.3116830
85.14806	126.2574	29.44717	40.65642	-0.3758238
85.27806	126.1378	30.73413	40.63218	-0.4205278
85.30227	126.1021	31.26110	40.69501	-0.4353826
85.30641	126.0899	31.43959	40.71755	-0.4399703
85.30689	126.0854	31.49526	40.72451	-0.4413570
85.30682	126.0838	31.51250	40.72655	-0.4417789
85.30674	126.0833	31.51789	40.72715	-0.4419092
85.30670	126.0830	31.51959	40.72733	-0.4419501
85.30668	126.0830	31.52014	40.72738	-0.4419631
85.30667	126.0830	31.52032	40.72740	-0.4419673
85.30667	126.0829	31.52038	40.72740	-0.4419687
85.30666	126.0829	31.52040	40.72740	-0.4419692
85.30666	126.0829	31.52041	40.72740	-0.4419693
85.30666	126.0829	31.52041	40.72741	-0.4419694
85.30666	126.0829	31.52041	40.72741	-0.4419694

Notice that each the estimate sequences converged within 15 iterations. We take the limit of these sequences to be our Estimates according to the EM algorithm.

Thus our parameter estimates are:

muX	muY	sigma2X	sigma2Y	rho
85.30666	126.0829	31.52041	40.72741	-0.4419694

Conclusion

Therefore, using an iterative method, one can compute the Maximum Likelihood estimates for parameters, even if some data points are missing.