Is there any logic to adding a threshold to see if two variables are related?


I have two variables $X$ and $Y$ given as tuples of $(x, y)$, and I want to see if there is a relationship between the two variables. I can do so by finding the correlation coefficient.

However, I found that by selecting an arbitrary subset of the data (e.g. $(x, y) | x > k$ ), I can get a higher correlation coefficient and a stronger result. Is doing so mathematically sound? I have no a priori reason to believe that certain data points are "more important" than others, to put it simply.

Jack Black

Posted 2018-08-03T05:36:44.813

Reputation: 23



No, it's not sound. You're doing data dredging.

Try this thought experiment. Suppose you generate random points (no actual relationship). Then you compute the correlation coefficient of all the points; the correlation coefficient of all points where $x>k$; and the correlation coefficient of all points where $x<k$. Realistically, one of the two latter values will be larger than the correlation coefficient of all the points. So you'll always be in a position where if you pick a threshold you can increase the correlation coefficient. This is true even if the points are randomly generated.

Thus, your procedure introduces bias.

This is similar to the problem of p-hacking.


Posted 2018-08-03T05:36:44.813

Reputation: 2 721


I think you may have a sample size effect on your correlation coefficient. You may find the following references useful:


Posted 2018-08-03T05:36:44.813

Reputation: 3 728