Checking if two categorical variables are independent can be done with the Chi-Squared test of independence where we perform a hypothesis test for that. Let's say A & B are two categorical variables then our hypotheses are:

H0: A and B are independent

HA: A and B are not independent

We create a Contingency table that counts for the combination of outcomes from two variables, if the null hypothesis(H0) is correct then the values of the contingency table for these variables should be distributed uniformly. And then we check how far away from uniform the actual values are.

**For Example**
Suppose we have two variables in the dataset

- Obesity: Obese, Not obese
- Marital Status: Married, Cohabiting, Dating

We observe the following data:
Oij for a cell (i,j) is the observed count in given data

```
| dating | married | cohabiting | Total |
-----------|------------:|:------------:|:------------:|:------------:|
Obese | 81 | 147 | 103 | 331 |
Not obese | 359 | 277 | 326 | 962 |
Total | 440 | 424 | 429 | 1293 |
```

Expected Counts Calculation i.e. Expected counts if H0 was true.

Eij for a cell (i,j) as
Eij=row j total * column i total / table total

```
| dating | married | cohabiting |
-----------|------------:|:------------:|:------------:|
Obese | 113 | 109 | 110 |
Not obese | 327 | 316 | 319 |
```

X2-statistics Calculation
Statistics

Assuming independence, we would expect that the values in the cells are distributed uniformly with small deviations because of sampling variability
so we calculate the expected values under H0 and check how far the observed values are from them.

We use the standardized squared difference for that and calculate Chi-square statistics that under H0 follows χ2 distribution with df=(n−1)⋅(m−1) where n & m are the number of categories in the first & second variable respectively.

\begin{equation}
\chi^2 = \sum_ i \sum_ j \frac{ ({O_{ij} - E_{ij}})^2}{ E_{ij}}
\end{equation}

χ2 value comes out to be 30.829

We can use R to find the p-value

```
tbl = matrix(data=c(81, 147, 103, 359, 277, 326), nrow=2, ncol=3, byrow=T)
dimnames(tbl) = list(Obesity=c('Obese', 'Not obese'), Marital_status=c('Dating', 'Married','Cohabiting'))
chi_res = chisq.test(tbl)
chi_res
Pearson's Chi-squared test
data: tbl
X-squared = 30.829, df = 2, p-value = 2.021e-07
```

Since p-value < 0.05 we reject the null hypothesis, we can conclude that obesity and marital status are dependent.

There also exists a Crammer's V that is a measure of correlation that follows from this test.

Putting values in the formula, R code

```
sqrt(chisq.test(tbl)$statistic / (sum(tbl) * min(dim(tbl) - 1 )))
0.1544
```

So we can say there is a weak positive correlation between obesity and marital status.

I hope I am clear with the explanation.

Everything looks good! But shouldn't the degree of freedom be 5? The degree of freedom is k-1 where k is the number of observations. So in this example df = 6 - 1 =5. Please correct me if I am wrong. – user 3317704 – 2020-02-04T22:13:51.680