3

3

I am at the dimensionality reduction phase of my model. I have a list of categorical columns and I want to find the correlation between each column and my continuous `SalePrice`

column. Below is the list of column names:

```
categorical_columns = ['MSSubClass', 'MSZoning', 'LotShape', 'LandContour', 'LotConfig', 'Neighborhood', 'Condition1',
'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd',
'Foundation', 'Heating', 'Electrical', 'Functional', 'GarageType', 'PavedDrive', 'Fence',
'MiscFeature', 'SaleType', 'SaleCondition', 'Street', 'CentralAir']
```

Because its categorical vs continuous, I've read that ANOVA is the best way to go but I have never used it before and couldn't find a concise implementation of it in Python. I want to loop through and output the correlation between each element in the list and the `SalePrice`

column.

What would you suggest for this if not ANOVA? – Andros Adrianopolos – 2019-07-15T08:20:03.943

Did you see that post? There were some suggestions. I personally do not have good experience for correlation of cat. and num., usually I end up training a model e.g. a GBT and look at dependency plots like SHAP values to infer alike-correlation conclusions. – TwinPenguins – 2019-07-15T20:16:55.547

I did but it just gave a list of suggestions. I have decided to go with a 1-way ANOVA using Python but now I'm trying to figure out how to do that right after one-hot encoding my categorical variables. – Andros Adrianopolos – 2019-07-16T04:03:33.697

OK, this brings me to ask you why you do one-hot encoding, your ML model or..?! One-hot encoding is one of my no-go methods, of course depending on what model you wanna pick. Check this recent great post benchmarking alternative categorical encoding methods: https://towardsdatascience.com/benchmarking-categorical-encoders-9c322bd77ee8?gi=bbab8453212e.

– TwinPenguins – 2019-07-16T05:25:10.873Why are you against OHE? – Andros Adrianopolos – 2019-07-16T08:40:18.557

Also noticed this post about correlation https://dzone.com/articles/correlation-between-categorical-and-continuous-var-1, similar to one of the aboves.

– TwinPenguins – 2019-07-16T12:44:08.243About OHE: Well, it is not personal :), it just not suitable in many situations but it is widely being used since it is easiest!! If you have high cardinal cat. variables for example, it increases your feature space exponentially and this has a significant impart on training a Tree-based model like RandomForest or GBT. And one has problem if one of the levels is not in the test test. If you search about pitfalls of OHE, you learn more that one needs to take extra care one using it. – TwinPenguins – 2019-07-16T12:47:36.423