Is Pearson coefficient a good indicator of dependency between variables?

3

1

Once I have been asked how would I calculate correlation between two time series. Since I am new to data science I answered: "I would just calculate the Pearson correlation coefficient". That wasn't a good answer since as demonstrated in the figure below, dependency between two variables may not be linear and the Pearson coefficient can be close to $0$ (parabola, circle). I have seen people on Kaggle always starting with correlation matrix and discarding data that are not correlated.

My question is: is Pearson coefficient always a good measure of correlation between variables and we should always rely on it? Can you give practical example when some variable in your problem was important, but Pearson indicated there is hardly any correlation?

Artificially generated data

WoofDoggy

Posted 2018-05-30T09:08:53.913

Reputation: 303

Answers

1

As you show, Pearson coefficient is clearly not a good measure of how variables depend on each other. A better measure is distance correlation. A nice property of distance correlation is that distance correlation 0 implies independence. An easy example that can happen in real life is that a variable is the square of another variable, as you show in your first picture. In this case, the Pearson correlation will be 0, but the distance correlation will be around 0.5. I think it is a mistake for Kagglers to choose variables in terms of Pearson correlations.

David Masip

Posted 2018-05-30T09:08:53.913

Reputation: 5 101

Thanks for answer @David Masip! The example with quadratic dependency would be good if I have both arms of the parabola. If only one arms is visible (lets say I have positive values) then Pearson would be close to $1$. Distance correlation maybe a good way to decide if two variables are independent. – WoofDoggy – 2018-05-30T11:14:21.220

I totally agree! – David Masip – 2018-05-30T12:26:53.210

2

Can you give practical example when some variable in your problem was important, but Pearson indicated there is hardly any correlation?

Sure, when there is underlying nonlinear relation between two RVs. Pearson correlation investigates linear relations, therefore a low value does not exclude nonlinear relations.

pcko1

Posted 2018-05-30T09:08:53.913

Reputation: 3 275

Nice @pcko1. So should I perform some additional tests for nonlinear relations afterwards? – WoofDoggy – 2018-05-30T11:15:24.430

Absolutely. I would personally check also the distance correlation, as suggested above :) – pcko1 – 2018-05-30T11:22:23.387

1

If you are considering time series only, you may have an the option to run a linear regression model, considering one variable as dependent.

If you can get a good R² and residual plot, with whatever transformation to make a linear model, then you may be able to assess there is a correlation between both.

Elliot

Posted 2018-05-30T09:08:53.913

Reputation: 971

Nonlinear relations are not exhibited by such analysis though. – pcko1 – 2018-05-30T09:30:14.363

What do you mean by exhibited ? – Elliot – 2018-05-30T09:34:17.860

Demonstrated, revealed, etc. Having white residues in your correlation analysis does not mean that there aren't nonlinear relations between the variables. Therefore, you cannot imply that there is no dependency by this technique, as the OP asked. – pcko1 – 2018-05-30T09:37:44.930

True. But if you find white residues with a good R² for polynomial regression or regression with transformed data then it proves the relation between the two variables is well found with either your coefficients or transformation, and therefore exhibits it. – Elliot – 2018-05-30T09:57:27.840