Correlation vs Multicollinearity

3

2

I have been taught to check correlation matrix before going for any algorithm. I have a few questions around the same:

1. Pearson Correlation is for numerical variables only.

What if we have to check the correlation between a continuous and categorical variable?

2. I read some answer where Peter Flom mentioned that there can be scenarios where correlation is not significant but two variables can be multi-collinear?

3. Removing the variable is the only solution?

I was asked in an interview if we are removing one of the correlated variables, then how this multi-collinearity come?

since pandas.corr() checks correlation between all variables.

4. How is multi-collinearity is different from correlation?

4

Great answer by Leevo, just let me point out one thing: Perfect multicollinearity means that one variable is a linear combination of another. Say you have $$x_1$$ and $$x_2$$, where $$x_2 = \gamma x_1$$. This causes various problems as discussed in this post.

The main takeway (to put it simple) is, that $$x_1$$ and $$x_2$$ basically carry the same information (just "scaled" by $$\gamma$$ in the case of $$x_1$$). So there is no benefit of including both. In fact there is a problem with including both since multicollinearity will "confuse" the model because there is no unique effect of $$x_1, x_2$$, when considered jointly, on some outcome $$y$$.

Look the following situation (R code):

y = c(5,2,9,10)
x1 = c(2,4,6,8)                           ### = 2   * x2
x2 = c(1,2,3,4)                           ### = 0.5 * x1
cor(x1, x2, method = c("pearson"))


The correlation between $$x_1$$ and $$x_2$$ equals 1 (so of course a linear combination). Now when I try to make a simple linear OLS regression:

lm(y~x1+x2)


The result is:

Coefficients:
(Intercept)           x1           x2
1.0          1.1           NA


The second term has been dropped by R (due to perfect multicollinearity).

We can run a regression on each term separately:

Call:
lm(formula = y ~ x1)

Coefficients:
(Intercept)           x1
1.0          1.1


...and...

Call:
lm(formula = y ~ x2)

Coefficients:
(Intercept)           x2
1.0          2.2


Now you can see that the coefficient for $$\beta_2$$ is simply $$2\beta_1$$ because $$x_1$$ is $$2 x_2$$. So nothing to learn from including both, $$x_1, x_2$$ since there is no additional information.

Basically the same problem can occur if correlation between $$x_1,x_2$$ is really high. See some more discussion in this post. Thus given strong correlation, one should be cautious to include both variables. The reason is that in this case, your model cannot really tell apart the effect of $$x_1$$ and $$x_2$$ on some outcome $$y$$, so that you may end up with weak predictions (among other problems).

3

I'll go through your questions one by one:

What if we have to check the correlation between a continuous and categorical variable?

One option is to use point biserial correlation. You can read more here. That's not the only option of course, you can find a good series of examples here.

Removing the variable is the only solution?

No it's not, you can use dimensionality reduction techniques to "summarize" multicollinear variables. That's what I usually do to control multicollinearity, I strongly prefer this to removing a variable arbitrarily. The most common technique is Principal Component Analysis, but the list is really endless. Other very common meaures are t-SNE, and Autoencoders if you are into Neural Networks.

How multicollinearity is different from correlation?

Correlation measures the association between two variables. This association can be either very noisy or not. Two variables can be highly correlated but their scatterplot can be very "spread out".

Multicollinearity is a stronger concept instead. It happens when two variables are linearly associated, so that the variation of one can be used to explain the variation of the other in great detail. That represents a problem for regressions, since a small change in a variable can completely mess up the estimation of your parameters. That doesn't happen with all correlated variables.

Of course there is some connection between the two. Two variables that are highly multicollinear must by definition be highly correlated as well, but they are not the same. Most importantly, multicollinearity is a problem for the reliability of your model, while correlation is not.

Thanks, Leevo. About the last point, I am still confused. Do you mean that collinearity gives a quantifiable measurement to the change in one variable which is brought by a uit in change in another variable. – Payal Bhatia – 2019-08-07T06:37:25.033

To my understanding, multicollinearity happens when the variables are "way too similar". That's probably an excess of correlation, to such an extent that the parameters of your regressions are tilted and not reliable enough. – Leevo – 2019-08-07T06:39:49.410

Leevo, can you site an example or ellaborate the point where you say Two variables can be highly correlated but their scatterplot can be very "spread out" ? – GadaaDhaariGeek – 2019-08-21T11:00:31.940

0

Pearson Correlation is for numerical variables only.

Ans: No

What if we have to check the correlation between a continuous and categorical variable? Pearson r coefficient.

I read some answer where Peter Flom mentioned that there can be scenarios where correlation is not significant but two variables can be multi-collinear?

Ans: Peter is correct.

Removing the variable is the only solution?

No. It depends on your problem and specific objectives.