3

1

Once I have been asked how would I calculate correlation between two time series. Since I am new to data science I answered: "I would just calculate the Pearson correlation coefficient". That wasn't a good answer since as demonstrated in the figure below, dependency between two variables may not be linear and the Pearson coefficient can be close to $0$ (parabola, circle). I have seen people on Kaggle always starting with correlation matrix and discarding data that are not correlated.

My question is: is Pearson coefficient always a good measure of correlation between variables and we should always rely on it? Can you give practical example when some variable in your problem was important, but Pearson indicated there is hardly any correlation?

Thanks for answer @David Masip! The example with quadratic dependency would be good if I have both arms of the parabola. If only one arms is visible (lets say I have positive values) then Pearson would be close to $1$. Distance correlation maybe a good way to decide if two variables are independent. – WoofDoggy – 2018-05-30T11:14:21.220

I totally agree! – David Masip – 2018-05-30T12:26:53.210