How to get correlation between two categorical variable and a categorical variable and continuous variable?

98

106

I am building a regression model and I need to calculate the below to check for correlations

  1. Correlation between 2 Multi level categorical variables
  2. Correlation between a Multi level categorical variable and continuous variable
  3. VIF(variance inflation factor) for a Multi level categorical variables

I believe its wrong to use Pearson correlation coefficient for the above scenarios because Pearson only works for 2 continuous variables.

Please answer the below questions

  1. Which correlation coefficient works best for the above cases ?
  2. VIF calculation only works for continuous data so what is the alternative?
  3. What are the assumptions I need to check before I use the correlation coefficient you suggest?
  4. How to implement them in SAS & R?

GeorgeOfTheRF

Posted 2014-08-03T13:07:24.143

Reputation: 1 608

4

I'd say CV.SE is a better place for questions about more theoretical statistics like this. If not, I'd say that the answer to your questions depend on context. Sometimes it makes sense to flatten multiple levels into dummy variables, other times it's worth to model your data according to multinomial distribution, etc.

– ffriend – 2014-08-03T14:00:11.460

Are your categorical variables ordered ? If yes, this can influence the type of correlation you want to look for. – nassimhddd – 2014-08-06T06:58:25.637

do you mean p-value is the same as correlation coefficient r? – Ayo Emma – 2017-03-31T11:48:27.217

The solution above with ANOVA for categorical vs. continuous is good. Small hiccough. The smaller the p-value, the better the "fit" between the two variables. Not the other way around. – myudelson – 2017-09-27T14:52:39.337

i have to face same problem in my research. but i couldn't find the correct method to solve this issue. so if you can please be kind enough to give me the references you have found. – user89797 – 2015-10-14T08:04:25.360

Answers

97

Two Categorical Variables

Checking if two categorical variables are independent can be done with Chi-Squared test of independence.

This is a typical Chi-Square test: if we assume that two variables are independent, then the values of the contingency table for these variables should be distributed uniformly. And then we check how far away from uniform the actual values are.

There also exists a Crammer's V that is a measure of correlation that follows from this test

Example

Suppose we have two variables

  • gender: male and female
  • city: Blois and Tours

We observed the following data:

observed values

Are gender and city independent? Let's perform a Chi-Squred test. Null hypothesis: they are independent, Alternative hypothesis is that they are correlated in some way.

Under the Null hypothesis, we assume uniform distribution. So our expected values are the following

expected value

So we run the chi-squared test and the resulting p-value here can be seen as a measure of correlation between these two variables.

To compute Crammer's V we first find the normalizing factor chi-squared-max which is typically the size of the sample, divide the chi-square by it and take a square root

crammers v

R

tbl = matrix(data=c(55, 45, 20, 30), nrow=2, ncol=2, byrow=T)
dimnames(tbl) = list(City=c('B', 'T'), Gender=c('M', 'F'))

chi2 = chisq.test(tbl, correct=F)
c(chi2$statistic, chi2$p.value)

Here the p value is 0.08 - quite small, but still not enough to reject the hypothesis of independence. So we can say that the "correlation" here is 0.08

We also compute V:

sqrt(chi2$statistic / sum(tbl))

And get 0.14 (the smaller v, the lower the correlation)

Consider another dataset

    Gender
City  M  F
   B 51 49
   T 24 26

For this, it would give the following

tbl = matrix(data=c(51, 49, 24, 26), nrow=2, ncol=2, byrow=T)
dimnames(tbl) = list(City=c('B', 'T'), Gender=c('M', 'F'))

chi2 = chisq.test(tbl, correct=F)
c(chi2$statistic, chi2$p.value)

sqrt(chi2$statistic / sum(tbl))

The p-value is 0.72 which is far closer to 1, and v is 0.03 - very close to 0

Categorical vs Numerical Variables

For this type we typically perform One-way ANOVA test: we calculate in-group variance and intra-group variance and then compare them.

Example

We want to study the relationship between absorbed fat from donuts vs the type of fat used to produce donuts (example is taken from here)

donuts

Is there any dependence between the variables? For that we conduct ANOVA test and see that the p-value is just 0.007 - there's no correlation between these variables.

R

t1 = c(164, 172, 168, 177, 156, 195)
t2 = c(178, 191, 197, 182, 185, 177)
t3 = c(175, 193, 178, 171, 163, 176)
t4 = c(155, 166, 149, 164, 170, 168)

val = c(t1, t2, t3, t4)
fac = gl(n=4, k=6, labels=c('type1', 'type2', 'type3', 'type4'))

aov1 = aov(val ~ fac)
summary(aov1)

Output is

            Df Sum Sq Mean Sq F value  Pr(>F)   
fac          3   1636   545.5   5.406 0.00688 **
Residuals   20   2018   100.9                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

So we can take the p-value as the measure of correlation here as well.

References

Alexey Grigorev

Posted 2014-08-03T13:07:24.143

Reputation: 2 460

Fasntastic answer by @Alexey. I read up polychoric/polyseries correlations online after reading your comment. They are technique for estimating the correlation between two latent variables, from two observed variables. I don't think that is what you asked for, and it is not comparable to Alexey's answer. – KarthikS – 2016-10-03T05:22:17.220

1Your first example is NOT about categorical vs categorical, rather it is categorical vs numerical, in fact you are looking at city against number of males (females, respectively) which is numerical. Categorical vs categorical would be, say, city vs colour of the eyes or shapes or anything else, but by no means would it be the number of representative of the gender. – gented – 2017-03-13T18:35:22.113

1Thanks Alexey for the details. Based on more research i found about polyserial and polychloric correlation. How is your approach better than these? Please explain – GeorgeOfTheRF – 2014-08-17T09:58:57.173

1I'm not aware of these things, sorry. – Alexey Grigorev – 2014-08-17T14:04:27.123

1@AlexeyGrigorev If our data is not normally distributed, should kruskal-wallic be used instead of one-way anova? Thanks in advance. – ebrahimi – 2018-08-29T19:28:33.703

Besides, could you please let me know If the categorical feature is a binary one, Mann-Whitney-U test should be used? – ebrahimi – 2018-08-29T19:40:00.630

Mann-Whitney-U is a non-parametric test. It can be used in case the data is not normally distributed and is too small to apply Independent Samples t-test(as categorical variable is binary). – Loochie – 2020-03-01T15:55:03.527