What should be the criteria to select features using correlation factors between features?


For the titanic dataset, I have done some feature engineering (one-hot encoded the features) and now I have developed a heatmap to view the correlation between different features.

  1. I'm not able to understand what to do with them. Lets say two features are highly correlated, eg, in the image, Name_title_mr and sex_0 (ie. male) are having correlation factor of 0.84. So, does that mean I should drop Name_title_mr, since sex_0 is a very important feature. (This we know with experience, that sex in titanic is very important but is there any way we could view this as well by just observing the heatmap?).

  2. One more doubt I have is: How would I know that I can just add two features, like sibsp and parch ? I have seen many kernels where they just create one feature with no_of_family_members by just adding sibsp and parch.

Will that adding of two features helpful? and should I drop the individual features sibsp and parch in that case?enter image description here

  1. sibsp_1 and sibsp_0 (sibsp_1 is when sibsp=2 and sibsp_0 is when sibsp=1) seems to be highly negatively correlated. Should I consider some action on this as well?

Tushar Seth

Posted 2019-08-17T08:11:00.553

Reputation: 123

Question was closed 2019-08-18T20:45:35.530



There are several questions here, but let me try to answer the overall issue. How do I know which features to drop or add to make the best model?

  • There is no rule that can be used to say “always drop all but one feature if X correlation exists”. I usually start with > +/- 0.8 but you should test on a per dataset basis.

  • feature engineering, creating new features or eliminating features, is the art side of analytics. Take into account what you know about the dataset, what you can learn from data analysis and experimenting. Often all of the experiments don’t get captured in data kernels (especially those that don’t have positive impacts) so it just looks like magic to people reviewing the final/posted solution.


Posted 2019-08-17T08:11:00.553

Reputation: 214