How data are prepared during training, testing and in production?

3

2

Most of real world datasets have features with missing values. Replacing missing values with an appropriate value such as its mean, is considered as a good step in feature engineering. Some times we also standardize/normalize feature columns before feeding it to train an model.

Before modelling we also split our dataset to training and testing sets.

My first query is how do we do feature engineering in this splitted dataset?

Do we use a global mean of the unsplitted features to replace the missing value of those features in both training and testing set or should we use local means of those sets?

Like the above question how do we do normalization to a train, test dataset?

The last but an important question, in productions we mostly get feature values one at a time (think a row of features), how do we feature engineer such data rows?

Eka

Posted 2020-12-16T15:08:15.560

Reputation: 211

Answers

3

The principle in supervised ML is quite simple: the "method" which is going to be used to predict the response variable must be fully determined from the training set and only from the training set. In other words, anything which doesn't belong to the training set cannot be used.

As a consequence, feature engineering, i.e. choosing how to prepare/represent/normalize features must be done using only the training set. This includes any feature selection/extraction step.

Note that once the final data preparation process is fully determined, it can and should be applied exactly the same way on the test set or in production. This means that for instance normalization does not involve recomputing any parameter, it uses the ones calculated on the training set.

See also a few related questions:

Erwan

Posted 2020-12-16T15:08:15.560

Reputation: 12 600

how would you handle "outlier detection"? prune outliers from training split, but not test? – HashRocketSyntax – 2021-01-06T01:08:01.690

would you forgo stratifying splits by label all together and just chose random samples hoping you get lucky by stumbling upon a good distribution? to some extent, isn't a split supposed to be an accurate representation of the population? – HashRocketSyntax – 2021-01-06T01:09:28.047

1@HashRocketSyntax outlier detection can be either supervised or not, in your example the outliers are removed from the training set so I assume you're talking about the unsupervised case, whereas my answer was only about supervised learning. I don't see any problem with removing outliers in the same way from both the training set and test set. However the test set shouldn't be involved in removing outliers from the training set, that would be a potential source of data leakage. – Erwan – 2021-01-06T01:30:24.963

Similarly my answer doesn't talk about the way the training set and test set are prepared, so I don't see why you interpret it this way about stratified sampling. – Erwan – 2021-01-06T01:30:38.970