Should Feature Selection processes be apply on training data or on all data?


I've realized that on examples and guides, sometimes feature selection processes (correlation elimination, backward/stepwise) are applied on the train data after splitting all data but on the other hand, sometimes its applied on all data.

So is there any clear answer for that? Which is more logical?


Posted 2020-01-26T09:56:17.463

Reputation: 173



Like any preprocessing step, feature selection must be carried out using the training data, i.e. the process of selecting which features to include can only depend on the instances of the training set.

Once the selection has been made, i.e. the set of features is fixed, the test data has to be formatted with the exact same features. This step is sometimes called "applying feature selection" but it's an abuse of language: it's only about preparing the test data with the features which were previously selected during the training stage.

Applying feature selection on the test data is a mistake because the training depends on it, so that would mean that the model "has seen" the instances of the test set and therefore invalidate the results on the test set.


Posted 2020-01-26T09:56:17.463

Reputation: 12 600