What measures can I use to find correlation between categorical features and binary label?

2

1

For analyzing numerical features, we have correlation. What measures do we have to analyse the relevance of a categorical feature to the target value? If there isn't a direct measure, how can we achieve this?

Chi-squared test is known, but I can't find any implementation of it for categorical values. One other way is to label encode into numerical values, but that assigns certain priority to higher valued labels.

ProgramSpree

Posted 2019-05-08T05:07:30.093

Reputation: 33

Answers

3

Checking if two categorical variables are independent can be done with the Chi-Squared test of independence where we perform a hypothesis test for that. Let's say A & B are two categorical variables then our hypotheses are:

H0: A and B are independent

HA: A and B are not independent

We create a Contingency table that counts for the combination of outcomes from two variables, if the null hypothesis(H0) is correct then the values of the contingency table for these variables should be distributed uniformly. And then we check how far away from uniform the actual values are.

For Example Suppose we have two variables in the dataset

  • Obesity: Obese, Not obese
  • Marital Status: Married, Cohabiting, Dating

We observe the following data: Oij for a cell (i,j) is the observed count in given data

             |    dating   |   married    |  cohabiting  |  Total       |
  -----------|------------:|:------------:|:------------:|:------------:|
  Obese      |      81     |     147      |   103        |   331        | 
  Not obese  |      359    |     277      |   326        |   962        |
  Total      |      440    |     424      |   429        |   1293       |  

Expected Counts Calculation i.e. Expected counts if H0 was true.

Eij for a cell (i,j) as Eij=row j total * column i total / table total

             |    dating   |   married    |  cohabiting  |
  -----------|------------:|:------------:|:------------:|
  Obese      |      113    |     109      |   110        | 
  Not obese  |      327    |     316      |   319        |

X2-statistics Calculation Statistics

Assuming independence, we would expect that the values in the cells are distributed uniformly with small deviations because of sampling variability so we calculate the expected values under H0 and check how far the observed values are from them.

We use the standardized squared difference for that and calculate Chi-square statistics that under H0 follows χ2 distribution with df=(n−1)⋅(m−1) where n & m are the number of categories in the first & second variable respectively.

\begin{equation} \chi^2 = \sum_ i \sum_ j \frac{ ({O_{ij} - E_{ij}})^2}{ E_{ij}} \end{equation}

χ2 value comes out to be 30.829

We can use R to find the p-value

tbl = matrix(data=c(81, 147, 103, 359, 277, 326), nrow=2, ncol=3, byrow=T)
dimnames(tbl) = list(Obesity=c('Obese', 'Not obese'), Marital_status=c('Dating', 'Married','Cohabiting'))
chi_res = chisq.test(tbl)
chi_res
        Pearson's Chi-squared test
data:  tbl
X-squared = 30.829, df = 2, p-value = 2.021e-07

Since p-value < 0.05 we reject the null hypothesis, we can conclude that obesity and marital status are dependent.

There also exists a Crammer's V that is a measure of correlation that follows from this test.

Putting values in the formula, R code

sqrt(chisq.test(tbl)$statistic / (sum(tbl) * min(dim(tbl) - 1 )))

0.1544

So we can say there is a weak positive correlation between obesity and marital status.

I hope I am clear with the explanation.

Curiousdeveloper

Posted 2019-05-08T05:07:30.093

Reputation: 46

Everything looks good! But shouldn't the degree of freedom be 5? The degree of freedom is k-1 where k is the number of observations. So in this example df = 6 - 1 =5. Please correct me if I am wrong. – user 3317704 – 2020-02-04T22:13:51.680