Manual feature engineering based on the output


So, I'm working on a ML model that would have as potential predictors : age , a code for his city , his social status ( married / single and so on ) , number of his children and the output signed which is binary ( 0 or 1 ). Thats the initial dataset I have.

My prediction would be based on those features to predict the value of signed for that person.

I already generated a prediction on unseen data. Upon validation of the results with the predicted result vs the real data , I have 25% accuracy. While cross-validation gives me 65% accuracy. So I thought : Over-fitting

Here is my question, I went back to the early stages of the whole process, and started creating new features. Example : Instead of the code for the city which makes no sense to have that as input to a ML model, I created classes based on the percentage of signed, the city ,with the higher percentage of 'signed' ( the ouput ) , gets assigned to a higher value of class_city, which improved a lot in my Correlation Matrix the relationship signed-class_city which makes sense. Is what I'm doing correct? or shouldn't I create features based on the ouput? Here is my CM :

After re-modelling with 3 features only ( department_class , age and situation ) i tested my model on unseen data made of 148 rows compared to 60k rows in the training file.

First model with the old feature ( the ID of the departement ) gave 25% accuracy while the second model with the new feature class_department gave 71% ( Again on unseen data )

Note : First model with 25% has some other features as ID's ( they might be causing the model to have such a weak accuracy with the deparment_ID )


Posted 2019-03-19T14:08:18.553

Reputation: 1 704

This is called "target encoding." As you and the answers note, the encoding should not involve your test set(s). Also possibly problematic are rare levels in the variable. – Ben Reiniger – 2019-03-19T18:24:01.683

Thank you for your answer. my test set is untouched and un-used during the whole process if that's what you mean by not involving my test set – Blenz – 2019-03-20T08:21:05.663



You can create features based on output values, but you should be careful in doing this.

When you use the value of class_city (based on percentage of signed for that city) for a given data point, note that this calculation cannot include the current data point, since you will not have the value of ‘signed’ during prediction.

One way to handle this is to split the total data you have into three parts - estimation, train, test. The estimation set is used only to estimate the class_city values for each city. These values can then be used in the train and test data. This way, you have the label values without your model doing anything ‘unfair’. For testing, you can infact use the data from estimation+train sets to estimate the class_city values for use in the test set. The same holds true for any unseen data. You can use the class_city values estimated from all the previous data points.

In the context of time series data, for example, the class_city value for any data point can potentially use information from all previous data points, and should not use any information from future data points!


Posted 2019-03-19T14:08:18.553

Reputation: 611

Thank you for your answer. I'm not quite familiar with the concept of data points, but this makes sense! – Blenz – 2019-03-20T08:29:55.970

1This last part about future data is important. It's also important not to use data from the current date, as this would also be a data leak. – Dan Carter – 2019-03-20T12:28:09.727

Yes this is exactly what i was looking for ! – Blenz – 2019-03-22T17:01:45.567


No you should not do this, it is causing a data leak. Data leaks happen when the data you are using to train a machine learning algorithm happens to have the information you are trying to predict.

It will give your model information about your test data during training. It will cause your test scores to be overly optimistic and will make the model worse at generalizing to totally unseen data.

Good resource on data leakage

Simon Larsson

Posted 2019-03-19T14:08:18.553

Reputation: 3 498

Thanks for the answer, what would you recommend in case of ID's being given as input in a dataset? ( in the case of department_code ), what's a common approach for that kind of data to improve my model? – Blenz – 2019-03-19T15:34:23.407

Glad it helped! That is really a new question not directly related to this and is not really for the comments. If you feel that the question you asked here has been answered, mark this as correct. If you have new things you want answered you should open a new question. – Simon Larsson – 2019-03-19T16:17:20.360

Your comment makes sense, but my results contradict your answer, i'm waiting for maybe a better answer, if not, i'll mark your question as correct ( again it makes sense that i shouldn't give away information on the values i want to predict in features ) Note : I'm posting my results with the old feature and the new feature on unseen data! – Blenz – 2019-03-19T16:27:07.493

How does the result contradict my answer? You got higher score on both train and test set, which is what you expect from data leakage. Or is it something that I am missing? I am only addressing "shouldn't I create features based on the ouput?". – Simon Larsson – 2019-03-19T16:38:26.203

Are you saying that if i change the test dataset, i will get a completely different result? probably lower? If not, the whole purpose of this is to get a higher accuracy on unseen data? Correct me if i'm wrong , thanks in advance! – Blenz – 2019-03-19T16:46:44.593

2Yes, that is what I am saying. Data leakage gives you an overly optimistic test score. – Simon Larsson – 2019-03-19T16:53:53.220