How can I predict the true label for data with incomplete features based on the trained model with data with more features?



Suppose I have a model that was trained with a dataset that contains the features (f1, f2, f3, f4, f5, f6). However, my test dataset does not contain all features of the training dataset, but only (f1, f2, f3). How can I predict the true label of the entries of this test dataset without all features?

Dae-Young Park

Posted 2020-07-23T10:44:28.770

Reputation: 31



Assuming that you have access to the training data set, you could use an autoencoder network to predict what features f4, f5, f6 'could be' for the test data set. The way to do this is to train the autoencoder on the training data set with features f1, f2, f3 as inputs, and then use f1,f2,f3,f4,f5,f6 as the output of the network. The autoencoder then effectively learns to map any input samples with (f1,f2,f3) to (f1,f2,f3,f4,f5,f6). By passing your test data through the autoencoder, you can then use the output and pass it to your model.


Posted 2020-07-23T10:44:28.770

Reputation: 111

Thank you so much! I will attempt to learn this network model – Dae-Young Park – 2020-07-28T01:00:19.150


I assume you trained your model on (f1, f2, f3, f4, f5, f6) and in your test data you sometimes have (f1, f2, f3) and sometimes have for example (f1, f2, f3, f4, f5, f6), right? Because if your test data always have (f1, f2, f3), then isn't it better to just train a model on available features?

So if my assumption is correct what I would do is to manipulate the training set a bit, keeping some training set with (f1, f2, f3, f4, f5, f6) and some others with (f1, f2, f3) with replacement of real values in their (f4, f5, f6) by e.g. mean of respective feature. So all training set still have (f1, f2, f3, f4, f5, f6) but some of them have manipulated (f4, f5, f6). Then finally when testing, do the same manipulation to those test data that have a smaller number of features.

I think like this your model learn how to predict base on (f1, f2, f3) when other features are not available. but at the same time, take advantage of all features if they are all available.

It's probably not the best approach but it worth to try.


Posted 2020-07-23T10:44:28.770

Reputation: 196

Um.. Thank you so much for your explanation – Dae-Young Park – 2020-07-28T01:09:20.837

I think your explanation means 'imputation' for missing value – Dae-Young Park – 2020-07-28T01:12:55.520