What can be done with highly correlated variables (>.95 and <-.95)



I hope we can remove the highly correlated variables based on the feature importance may be with PCA etc.

Is there anything we can do with highly correlated variables/

Thanks in advance !

Vivek Ananthan

Posted 2020-02-07T12:56:03.360

Reputation: 205



An alternative to the one provided by @Kasra is dimensionality reduction. It's another way of solving your multicollinearity problems, while avoiding deleting variables more or less arbitrarily.

You can use simpler, linear techniques such as PCA, or more complex non-linear techniques such as Autoencoders. t-SNE is a non-linear technique that is typically used for visualization, I do not recommend to use it for a Training set.


Posted 2020-02-07T12:56:03.360

Reputation: 4 928


I think merging such correlated features and create a new one, will also be a good idea. In that way we will not lose any information.

For example, sum up the values of different correlated features and take an average of it, will be the very basic option.

vipin bansal

Posted 2020-02-07T12:56:03.360

Reputation: 1 322


You need to remove them. Redundant features only increase the computation time, increase model complexity (with no benefit) which means making interpretation of model/analysis more sophisticated and if they are many, removing them prunes your vector space by improving the density of information in dimensions of vector space (it helps e.g. in finding nearest neighbors).

Kasra Manshaei

Posted 2020-02-07T12:56:03.360

Reputation: 5 323

1In the situation of two highly correlated features, actually choosing one does the same job. Question mentioned .95 which means they are practically the same. Plus the fact that merging strategy needs to be chosen. What is ur idea for merging? – Kasra Manshaei – 2020-02-07T17:44:43.727