What would be the best way to impute data?


Other than just filling in with the mean of a feature, what other methods are there which can work well? I am trying to decide whether or not to use a denoising-autoencoder or just impute with the mean or any other method which can perform well enough which is quickly implementable.

Also is it necessary to make any assumptions about the underlying distribution? The denosing-autoencoder seems attractive here because you do not need to make any explicit assumptions.

Ganesh Anand

Posted 2016-09-13T10:21:50.330

Reputation: 11



You can treat the missing feature as the target variable of a sub-problem and create a classifier (e.g., a linear model, SVM, etc) for it.

Ryan Zotti

Posted 2016-09-13T10:21:50.330

Reputation: 3 849

1But how would you do it if the feature that is missing is not the same everywhere. The data is such that the feature vectors are about 1900 long and most of them are missing at some sample or the other. – Ganesh Anand – 2016-09-14T18:13:14.077


Let's say that we have a perfect algorithm for imputing data. You give him a dataset with missing data for some features and it predicts them.

I had had such an data imputation algorithm, I would have use it as a classifier. The fact the we can reduce classification to data imputation means that dat imputation is as hard as classification.

As Rayn said, you can indeed do so. Imputation algorithm could have been evaluated as classification algorithm but this is not very common. One of the reason is that the rule by which you hide data from the imputation algorithm is subjective and might effect the results significantly.

In whatever method you use, you assume that the data behave in someway and missing due to some rule.

It is also an important question, since errors in the imputation (you will probably have such), will effect the process down the road.

I prefer to avoid imputation and let the prediction algorithm cope with the missing data. There are many such algorithms. A different approach is to use few methods of imputation, choose few simple classifiers (so their complexity won't have a too large influence on the results) and compare the predications. In many cases, there is no big difference. When there is a big difference, of course that you will prefer to proceed into a more detailed analysis using the winner. However, before that try to understand why it has a big advantage. That might lead to interesting insights about your data.


Posted 2016-09-13T10:21:50.330

Reputation: 2 463


There is not such thing as "the best way to impute data". The best method will always depend on your specific application and model. Just remember the No Free Lunch Theorem.

There are many ways in which you can impute values. You can ignore the rows containing missing values, you can impute values based on the other rows (mean, using classification, etc.), or you can even replace the missing values by a constant. Which method to use will depend on how much effort you want to put into the imputation algorithm and the final results.

As usual, I'd start by using a simple method (the average, as you suggested), and increase the complexity of the imputation only if the results are not satisfactory enough.

Pablo Suau

Posted 2016-09-13T10:21:50.330

Reputation: 1 507