Correlation with missing values. Is least squares an acceptable option?


I have been tasked with finding a correlation matrix for a lot of variables. Many of them have missing values. I read here that pairwise deletion may not be the best way of dealing with this situation, but if I use listwise deletion I am afraid I'll end up with too little data.

I am wondering if I could estimate the missing values using least squares, and calculate the correlation using the now complete data set.


Posted 2017-11-24T11:23:16.053

Reputation: 31

What's the missing ratio? There are many other ways to impute data (k-NN, mode, median). Have you visualized your data? – Mephy – 2017-11-24T11:56:15.067

for pairwise, use the 95% bound closest to 0 to check if the correlation is worth pursuing then MLR or caret classification to see if the missingness is random. – ran8 – 2017-11-30T06:04:50.013



Yes you "can" use least squares to do that. However, you would probably need some parametric assumptions to do it, which in turn might require some knowledge about the correlation matrix...

To go back to your original problem, I would actually argue against your reference about pairwise deletion. The weird behavior he identifies mostly happens because the estimator is computed with a dataset that is too small. The empirical estimator of Pearson's correlation is NOT guaranteed to be positive-definite. Also, there is some sort of pattern in its missing data: entries for the $X_3$ are missing only when $X_1 \neq X_2$. Let us instead assume that entries are missing purely randomly.

The estimator you compute using $\hat\Sigma$ := cor(X, use="pairwise.complete.obs"), for a data matrix $X$ still makes a lot of sense. The biggest difference with the non-missing data case is that you now get some entries of $\hat\Sigma$ that are computed on less data; but they still estimate the quantity you wish to estimate.

What is less clear is how to use $\hat\Sigma$ in the desired application: the fact that some entries were computed on less data means that some entries might be less reliable than others. However, notice that even in the classical case, some entries are less reliable than others anyway (this depends on the underlying distribution).

To understand the impact of using all available pairwise data to compute $\hat\Sigma$ in the presence of missing data, consider the two following situations:

Situation 1: You have a $n \times d$ data matrix $X$ and compute the empirical correlation with $\hat\Sigma^{(1)}$ := cor(X).

Situation 2: You have a $[nd(d-1)/2] \times d$ data matrix $X$ in which you have $n$ pairwise observations for all $d(d-1)/2$ pairs of variables and NA everywhere else. in this case you compute the empirical correlation with $\hat\Sigma^{(2)}$ := cor(X, use="pairwise.complete.obs").

In both situations, you have the same number of non NA pairs (although more non NA entries in the second case). However, you have, for $i_1,j_1 \neq i_2,j_2$ \begin{align*} \mathrm{cor}(\hat\Sigma^{(1)}_{i_1,j_1},\hat\Sigma^{(1)}_{i_2,j_2}) \neq 0 \end{align*} assuming the your variables are indeed related, and \begin{align*} \mathrm{cor}(\hat\Sigma^{(2)}_{i_1,j_1},\hat\Sigma^{(2)}_{i_2,j_2}) = 0. \end{align*} Somehow, I tend to think the zero correlation between the estimates might be a good thing more that a bad one...

All this to say, pairwise deletion is not necessarily a bad way to proceed.


Posted 2017-11-24T11:23:16.053

Reputation: 301