Can i use chi square test to remove a particular variable from the model?



I have 5 variables - $v_{1}$, $v_{2}$, $v_{3}$, $v_{4}$, and $v_{5}$. All are categorical variables.

I conducted a $\chi^{2}$ (chi-square) test to understand the relationship between these variables. I could see that $v_{3}$ and $v_{4}$ have p-values of 0. I understand that these two variables are dependent on each other. Should I remove one of the variable from further analysis?

What if the same $v_{3}$ and $v_{4}$ are independent of other variables, like $v_{1}$, $v_{2}$, and $v_{5}$?


Posted 2015-08-09T12:05:42.797

Reputation: 707


I suggest you to read about multicollinearity.

– Aleksandr Blekh – 2015-08-10T04:50:12.297

What type of analysis do you intend to apply to the variables, this will affect the answer, as will the size of your data set. In some circumstance highly correlated variables may contain useful discriminative information. Ultimately run your analysis with the variable removed and with it left in and see how the results differ. – image_doctor – 2015-08-10T08:24:15.340

@image_doctor - Thank you very for your advice. My intention of doing chisquare test is to check for the collinearity between the variables. I got confused between the purposes of collinearity and chisquare test. Hence i raised the question. Thanks again. – Arun – 2015-08-10T10:36:31.123



What you described is the Filter Method. One of the three methods of Feature Selection. The most common mistake that one could do is to make the feature selection as an independent step on your whole process and then decide which model he will use. The most appropriate way to do it is to include your Feature Selection process in your accuracy test and try different kind of feature selections. The best way for the feature selection is the domain knowledge. If you don't have it, then you start working with one of the three methods, like the Filter Method.

What you need to do is to test the chisquare or mutual information on your features with your label column. This will give you which of the features have low effect on your prediction and you can remove it.

You can find a briefly detailed article about Feature Selection here and here.


Posted 2015-08-09T12:05:42.797

Reputation: 3 340


I agree with the answer by Tasos. Another approach to your work that you may wish to take, is using loglinear analysis on your categorical variables. This is a useful ploy whenever you are dealing with three or more categorical variables. Put simply, it's basically regression for categorical variables and is very simple to employ.


Posted 2015-08-09T12:05:42.797

Reputation: 204