I am working on a classification project in which some features are linked and I'm not sure how to handle them.
I will simplify my project like that :
- There are different jobs, and multiple persons working on those jobs.
- Those persons are fit for their job.
- Replacing someone can only have negative effect or no effect.
I have a person1 that can be replaced by a person2 for a job, and the goal is to predict if it has a negative impact or not.
Each person has is own properties like : weight, age, height, IQ,..
A job also has properties like : manualJob, localization, temperature,..
When I list my features I have something like that :
(Person1 = P1, Person2 = P2, and the data has been normalized)
P1_weight P2_weight P1_IQ P2_IQ manualJob tempJob neg_impact 0 0.25 0.50 0.25 0.25 1 0.25 1 1 0.75 0.25 0.50 0.25 0 0.50 0 2 0.50 0.75 0.75 0.50 1 0.25 1 ...
There should be a high interaction between P1_weight && P2 weight features, (as well between IQs features) that we want to capture in order to predict the neg_impact feature.
1. Now the change between P1_weight and P2_weight is important, but does a classic model as RandomForest can capture the link between those 2 features ? Same for other subject properties (P1_IQ && P2_IQ, P1_height && P2_height,..)
2. I'm afraid that if I reduce the difference between P1_weight and P2_weight into a single feature as the diff (P1_weight - P2_weight), I will lose some information. For example the P1_weight would probably be correlated to the 'manualJob' feature, and if I delete P1_weight this information would be lost?
3. I was thinking that maybe I could preprocess those linked features with out-of-fold prediction, and use those predictions as input with the rest of the features original features (manualJob, tempJob, ..). Is it a good idea ? Which kind of model would be better for the preprocessing to capture the correlation of those linked features ?
NB : my set only contains ~1000 elements