## Removing Categorial Features in Linear Regression

3

This is more of a design question regarding linear regression. Here is some info on our dataset:

• Our dataset has 8 features; 3 of them being categorical. We are willing to perform linear regression to fit our target data.
• We have tried including all of our 8 features (categorical ones being encoded in integer) and doing the linear regression.
• We have also tried taking out the categorical features and running the linear regression algorithm for each possible combination of our 3 categorical features. That of course yields in a lot of regression runs (each category has 4 possible values; so 64 to be exact).

The latter approach gave us better results. So, fixing the categorical features and creating a new dataset for each combination turned out to give better estimations.

What does this tell us about our data?

• Categorical features make the data non-linear, so they should indeed be taken out? For example, there is a complex relationship between categorical feature A and categorical feature B that can't be captured in linear regression?
• If I am forced to step out of linear regression, which algorithm should I try to apply? I would prefer to have one dataset with 8 features instead of having 64 datasets with 5 features. There should be an algorithm that can capture this model.

What do you mean by 'better results'? A lower $R^2$? – Elias Strehle – 2018-03-05T17:29:06.283

You should think in scope of your research question: what am I trying to solve by using linear regression? Linear regression doesn’t care about the form of the predictor variables as long as your residuals are approximately iid Normally distributed. – Jon – 2018-03-05T21:08:52.340

What's wrong with ANOVA? It's considered a subset of multiple regression where the categorical factors are evaluated, depending on the specficition, either against an intercept or against the base value of the factor. – DJohnson – 2018-03-08T17:08:01.303

I think it depends...if the question focus is on continuous variables then you can discard the categorical features. But discarding features with no valid statistical reason is Wrong. – mnm – 2020-09-07T00:21:51.720

2

I think using Linear Regression is not a good option as,

1. This performs very well on numeric variables(categorical -> binary).
2. Cannot handle Missing Data(suggestible to ignore those records).
3. When you are trying to predict, By chance a category is not trained then it cannot predict. For example: A variable has 4 categories and in the random sample it could pick only 3 categories, if the test has 4th category then it will throw an error. So, we should be careful when the data is divided between Test and Train

Now, what are the other algorithms which are available:

1. Random Forest(you cannot use this when any category variable has more than 52 categories, in your case it shouldn't be an issue)
2. XGBoost

Do let me know if you need any additional explanations.

1

Sounds like you have a lot of complex categorical variables in your model. Here's what I would do to see which ones are significant and which ones are not. For each of the 4 categorical variables, you will only need 3 binary variables to represent the options. If all 3 binary options are 0, then the fourth category is 1, so it simplifies the model a little. Here's what I would do:

1) Run a regression model for each categorical variable using the binary variables. You'll have 4 models in total.

2) Run these models with backwards stepwise regression. You should analyze these four models to look for similarities or patterns, maybe something will jump out at you.

3) After all four models are run using this selection method, run a final regression using backwards stepwise selection using only the significant variables from the previous 4 runs.

This will leave you with a final model (and results) without cluttering the regression with all 64 iterations of the categorical variables. If the significant variables from this are too cumbersome, maybe only discuss or highlight the most significant independent variables or trim it down some other way.

Good luck, let us know how it goes!

1

Encoding categorical variables as integers is generally bad for linear regression, because the model will interpret that to mean that category 2 is twice as significant as category 1, and so on, which is not necessarily true. It isn't surprising that you got bad results.

A better approach is to encode your categories with dummy variables. Let's say your categorical variables are C1, C2, and C3, each taking values from 1 to 4. Then we can have twelve 0/1 dummy variables corresponding to each possible category for each categorical variable. For any input exactly four dummy variables will be 1 and the rest will be zero.

Your linear regression now looks like:

$$\hat{y}=a_1*d_{C11}+a_2*d_{C12}+a_3*d_{C13}+a_4*d_{C14} + a_5*d_{C21}+a_6*d_{C22}+a_7*d_{C23}+a_8*d_{C24} + a_9*d_{C31}+a_{10}*d_{C32}+a_{11}*d_{C33}+a_{12}*d_{C34} + a_{13}*x_1+a_{14}*x_2+a_{15}*x_3+a_{16}*x_4+a_{17}*x_5$$

where $x_1$ through $x_5$ are your numerical inputs.

If a given input has C1=1, C2=4, and C3=3, for example, then this would reduce to:

$\hat{y}=a_1*1+a_8*1+a_{11}*1$

It's also possible to do the same thing with 64 dummy variables for each possible combination of the categorical variables as you were doing, but in a single linear regression as above.

If you are still not getting good results with linear regression, then consider using Gradient Boosting Regression Trees.

what do the a variables stand for? Also is this one-hot encoding? – mLstudent33 – 2019-04-22T10:46:09.093

Also wouldn't there be 3 dummy variables that are "1" and 9 that are "0" for C1, C2 and C3 each with 4 categories? Ie. if there were C1, C2, C3 and C4, we would end up with four dummy variables that are "1" with the rest "0" – mLstudent33 – 2019-04-22T11:12:49.333

1Yes, the features here are the concatenated one-hot encodings of the categorical variables. The reason I have 17 variables is because the original question specifies 3 categorical variables each taking 4 possible values and 5 numerical variables, so 3*4+5 = 17. – Imran – 2019-04-23T21:54:01.337

1

What I would do to optimise the performance of linear regression

1. One Hot encode the categorical features
2. Use PCA to reduce the dimensionality of the data
3. Scale the data (subtract the mean, divide by the standard deviation)
4. Train the regression model on the reduced scaled dataset

If you have enough data (>10k examples), you could even train a neural network on the data to capture the complex relationships between features which linear regression wouldn’t capture.

PCA is a feature extraction method and obfuscates the original features. Further, One hot encoding will blow up the analysis combinations. – mnm – 2020-09-07T00:15:37.403

1

We have tried including all of our 8 features (categorical ones being encoded in integer) and doing the linear regression.

If it is not dummy encoding and your categories can not be ranged - that is wrong. For example, doing this: [apples, bananas, strawberries] -> [0, 1, 2] for almost all tasks will be not correct.

We have also tried taking out the categorical features and running the linear regression algorithm for each possible combination of our 3 categorical features. That of course yields in a lot of regression runs (each category has 4 possible values; so 64 to be exact).

Also needed to be revised. If some of your combinations have very small amount of cases you can not trust the results.

So,

Our dataset has 8 features; 3 of them being categorical. We are willing to perform linear regression to fit our target data.

Do dummy encoding. If you see some strict relationship between categories you can also add one-hot dummy encoding: category A - category B 1 - 1 2 - 0 2 - 3 1 - 1 ...

Possibly here category A and category B are strongly correlated while both of them have category 1. Create new feature for this case.

If I am forced to step out of linear regression, which algorithm should I try to apply? I would prefer to have one dataset with 8 features instead of having 64 datasets with 5 features. There should be an algorithm that can capture this model.

Forests, XGBoost as mentioned earlier. For this you do not need one-hot or dummy encoding. By the way, usage of simple Decision Trees may give you beautiful pattern of relationships between categories and it's influence on target variable.

Try simple neural network after dummy and one-hot encoding too.

1

All the answers mentioned are great, but what I will do is ( a noob )

• Go with RF first to get the feature Importance.

• Then plot the Hierarchical Clustering and plotting the Dendograms using Scipy to see which columns are close to each other as the model this them to be..

• After That, Go with CatBoost to give your model the final touch.. CatBoost is really very effective when you have categorical data... It will handle all of them automatically...(try it if you haven't)