Detecting redundancy with Pearson correlation in continuous features

2

1

I have a set of variables that I want to use for a regression or a classification problem. Having computed the correlation matrix of these variables, I discovered that some of them has an inter-variables Pearson correlation values as high as 1.

  1. Does this mean that these variables hold redundant information for the learner?
  2. Is it safe to remove one of them without risking information loss? if yes, how to chose the one to remove?

MedAli

Posted 2016-03-12T17:44:12.350

Reputation: 245

Answers

4

If the correlation between two features $x_1$ and $x_2$ is 1 that means that you can write $x_1 = c\cdot x_2 + a$. The only knew knowledge there is are those two constants, the individual values can be retrieved knowing this. I highly doubt there is anything a machine learning algorithm can learn from this and it is a fact that for some having this kind of correlation between features can hurt your performance quite a bit, so I would test it a bit but I would say it's very likely you can remove one of the two, and which one is not going to matter.

Jan van der Vegt

Posted 2016-03-12T17:44:12.350

Reputation: 8 538

2

Yes and Yes.

Variance Inflation Factor is a common method for addressing your concerns.

vif answer

https://en.m.wikipedia.org/wiki/Variance_inflation_factor

Correlation based feature selection is another approach.

https://en.wikipedia.org/wiki/Feature_selection#Correlation_feature_selection

Chris

Posted 2016-03-12T17:44:12.350

Reputation: 221