How to automate ANOVA in Python

3

3

I am at the dimensionality reduction phase of my model. I have a list of categorical columns and I want to find the correlation between each column and my continuous SalePrice column. Below is the list of column names:

categorical_columns = ['MSSubClass', 'MSZoning', 'LotShape', 'LandContour', 'LotConfig', 'Neighborhood', 'Condition1',
                       'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd',
                       'Foundation', 'Heating', 'Electrical', 'Functional', 'GarageType', 'PavedDrive', 'Fence',
                       'MiscFeature', 'SaleType', 'SaleCondition', 'Street', 'CentralAir']

Because its categorical vs continuous, I've read that ANOVA is the best way to go but I have never used it before and couldn't find a concise implementation of it in Python. I want to loop through and output the correlation between each element in the list and the SalePrice column.

Andros Adrianopolos

Posted 2019-07-14T14:43:53.253

Reputation: 322

Answers

1

I am not sure ANOVA is the best and easiest way to find correlation between these categorical features and your target. You may see this great post where they propose many other methods along with ANOVA. If you persist to use ANOVA test or Kruskal-Wallis H Test, you need to know how it works to give you that notion of correlation (variation of variance among groups of categoricals). It is nicely explained in that post:

ANOVA estimates the variance of the continuous variable that can be explained through the categorical variable. One need to group the continuous variable using the categorical variable, measure the variance in each group and comparing it to the overall variance of the continuous variable. If the variance after grouping falls down significantly, it means that the categorical variable can explain most of the variance of the continuous variable and so the two variables likely have a strong association. If the variables have no correlation, then the variance in the groups is expected to be similar to the original variance.

Once you understand how it works, implementing it and automating it is not difficult. In fact scipy and statsmodels have ANOVA. Check this post out, where they demonstrate in details how to perform ANOVA test on an actual dataset and estimate the correlation between categorical variable and continuous target. It is just a matter of putting these pieces together and change a bit to make it work for your own dataframe.

TwinPenguins

Posted 2019-07-14T14:43:53.253

Reputation: 3 728

What would you suggest for this if not ANOVA? – Andros Adrianopolos – 2019-07-15T08:20:03.943

Did you see that post? There were some suggestions. I personally do not have good experience for correlation of cat. and num., usually I end up training a model e.g. a GBT and look at dependency plots like SHAP values to infer alike-correlation conclusions. – TwinPenguins – 2019-07-15T20:16:55.547

I did but it just gave a list of suggestions. I have decided to go with a 1-way ANOVA using Python but now I'm trying to figure out how to do that right after one-hot encoding my categorical variables. – Andros Adrianopolos – 2019-07-16T04:03:33.697

OK, this brings me to ask you why you do one-hot encoding, your ML model or..?! One-hot encoding is one of my no-go methods, of course depending on what model you wanna pick. Check this recent great post benchmarking alternative categorical encoding methods: https://towardsdatascience.com/benchmarking-categorical-encoders-9c322bd77ee8?gi=bbab8453212e.

– TwinPenguins – 2019-07-16T05:25:10.873

Why are you against OHE? – Andros Adrianopolos – 2019-07-16T08:40:18.557

Also noticed this post about correlation https://dzone.com/articles/correlation-between-categorical-and-continuous-var-1, similar to one of the aboves.

– TwinPenguins – 2019-07-16T12:44:08.243

About OHE: Well, it is not personal :), it just not suitable in many situations but it is widely being used since it is easiest!! If you have high cardinal cat. variables for example, it increases your feature space exponentially and this has a significant impart on training a Tree-based model like RandomForest or GBT. And one has problem if one of the levels is not in the test test. If you search about pitfalls of OHE, you learn more that one needs to take extra care one using it. – TwinPenguins – 2019-07-16T12:47:36.423