3

This is more of a design question regarding linear regression. Here is some info on our dataset:

- Our dataset has 8 features; 3 of them being categorical. We are willing to perform linear regression to fit our target data.
- We have tried including all of our 8 features (categorical ones being encoded in integer) and doing the linear regression.
- We have also tried taking out the categorical features and running the linear regression algorithm for each possible combination of our 3 categorical features. That of course yields in a lot of regression runs (each category has 4 possible values; so 64 to be exact).

The latter approach gave us better results. So, fixing the categorical features and creating a new dataset for each combination turned out to give better estimations.

What does this tell us about our data?

- Categorical features make the data non-linear, so they should indeed be taken out? For example, there is a complex relationship between categorical feature A and categorical feature B that can't be captured in linear regression?
- If I am forced to step out of linear regression, which algorithm should I try to apply? I would prefer to have one dataset with 8 features instead of having 64 datasets with 5 features. There should be an algorithm that can capture this model.

What do you mean by 'better results'? A lower $R^2$? – Elias Strehle – 2018-03-05T17:29:06.283

You should think in scope of your research question: what am I trying to solve by using linear regression? Linear regression doesn’t care about the form of the predictor variables as long as your residuals are approximately iid Normally distributed. – Jon – 2018-03-05T21:08:52.340

What's wrong with ANOVA? It's considered a subset of multiple regression where the categorical factors are evaluated, depending on the specficition, either against an intercept or against the base value of the factor. – DJohnson – 2018-03-08T17:08:01.303

I think it depends...if the question focus is on continuous variables then you can discard the categorical features. But discarding features with no valid statistical reason is Wrong. – mnm – 2020-09-07T00:21:51.720