This is more of a design question regarding linear regression. Here is some info on our dataset:
- Our dataset has 8 features; 3 of them being categorical. We are willing to perform linear regression to fit our target data.
- We have tried including all of our 8 features (categorical ones being encoded in integer) and doing the linear regression.
- We have also tried taking out the categorical features and running the linear regression algorithm for each possible combination of our 3 categorical features. That of course yields in a lot of regression runs (each category has 4 possible values; so 64 to be exact).
The latter approach gave us better results. So, fixing the categorical features and creating a new dataset for each combination turned out to give better estimations.
What does this tell us about our data?
- Categorical features make the data non-linear, so they should indeed be taken out? For example, there is a complex relationship between categorical feature A and categorical feature B that can't be captured in linear regression?
- If I am forced to step out of linear regression, which algorithm should I try to apply? I would prefer to have one dataset with 8 features instead of having 64 datasets with 5 features. There should be an algorithm that can capture this model.