Pearson correlation method using absolute values and relative values


I have a dataset with election results and crime rates per city. For each variable I have an absolute value (i.e. Total votes, Total crimes) and a relative value (i.e. Percentage shares of votes).

I want to calculate the correlation coefficient for some variables, but in the process I had a question about what value I need to use, if relative values or absolute values.

First I calculated z score for absolute values and then I calculated the correlation using excel. I also used pandas.DataFrame.corr() and pearsonr from scipy.stats.stats in python, in order to corroborate results.

For example, if I use absolute values I will get a positive correlation between candidate 1 and candidate 2.

x = df['Abs Cand 1'].tolist()
y = df['Abs Cand 2'].tolist()

print (pearsonr(x,y))
(0.95209664861187004, 0.0)

However, if I use relative ones I will get a negative correlation:

x = df['Rel Cand 1'].tolist()
y = df['Rel Cand 2'].tolist()

print (pearsonr(x,y))
(-0.99704737036262991, 0.0)

I was confused when I saw both results, and now I need some orientation to understand those differences.

Thanks in advance!


Posted 2016-10-05T19:03:58.390

Reputation: 195

1are candidate1 and candidate2 the votes for each candidate? in this case the absolute values are probably positively correlated because votes simply increase for both candidates with the size of the city and relative values are negatively correlated because candidate1 = 100%-candidate2. wouldn't you want to know the correlation between crime rates and votes for one of the candidates? – oW_ – 2016-10-05T19:08:32.730

@oW_ actually thats the main idea. crime rates vs votes, but I got stuck when I saw those differences. For example, using crime rates and votes, which value could be the best? – estebanpdl – 2016-10-05T19:33:51.413

1turning values into percentages is simply multiplying the values with a scalar. that should not change the correlation coefficient. – oW_ – 2016-10-05T19:44:48.500

That makes sense. Thanks, @oW_ . So, in order to get a correlation coefficient for different elections, relative values would be a better choice, since absolute values are related with size of the city. this would be right? – estebanpdl – 2016-10-05T20:22:35.863



In general, the correlation coefficient is "invariant to separate changes in location and scale in the two variables". In particular, you can mix relative with absolute values.

However, that only works if you scale the variables globally. You can't scale every individual data point (here on a city level). If this was a county wide election, you could scale the city values by the county population.

But it sounds like your crime rates are on a per city level. In this case you should scale the votes on a city level as well to make them comparable. This will change the correlation coefficient and give a different result than with absolute values. I think using percentages is more intuitive in your case.


Posted 2016-10-05T19:03:58.390

Reputation: 5 477