So, I'm working on a ML model that would have as potential predictors : age , a code for his city , his social status ( married / single and so on ) , number of his children and the output signed which is binary ( 0 or 1 ). Thats the initial dataset I have.
My prediction would be based on those features to predict the value of signed for that person.
I already generated a prediction on unseen data. Upon validation of the results with the predicted result vs the real data , I have 25% accuracy. While cross-validation gives me 65% accuracy. So I thought : Over-fitting
Here is my question, I went back to the early stages of the whole process, and started creating new features. Example : Instead of the code for the city which makes no sense to have that as input to a ML model, I created classes based on the percentage of signed, the city ,with the higher percentage of 'signed' ( the ouput ) , gets assigned to a higher value of class_city, which improved a lot in my Correlation Matrix the relationship signed-class_city which makes sense. Is what I'm doing correct? or shouldn't I create features based on the ouput? Here is my CM :
After re-modelling with 3 features only ( department_class , age and situation ) i tested my model on unseen data made of 148 rows compared to 60k rows in the training file.
First model with the old feature ( the ID of the departement ) gave 25% accuracy while the second model with the new feature class_department gave 71% ( Again on unseen data )
Note : First model with 25% has some other features as ID's ( they might be causing the model to have such a weak accuracy with the deparment_ID )