decision -tree regression to avoid multicollinearity for regression model?

1

I read in comments a recommendation for decision tree´s instead of linear models like neural network, when the dataset has many correlated features. Because to avoid multicollinearity. A similar question is already placed, but not really answered. https://stats.stackexchange.com/questions/137573/do-classification-trees-need-to-consider-the-correlation-between-attributes

or here In supervised learning, why is it bad to have correlated features?

https://www.quora.com/Is-multicollinearity-a-problem-in-decision-trees#:~:text=Decision%20trees%20follow%20the%20non%20parametric%20approach.&text=Though%20single%20tree%20leads%20to,robust%20to%20the%20multi%20collinearity%20.

My problem: I have a dataset of about 30 columns. 10 columns have a high correlation with the target/dependend variable. Data are numerical. I would like to do a prediction (regression model) include all variables if possible?

One big problem is to avoid multicollinearity.

  • Is there a decision tree regression model good when 10 features are high correlated? (if I follow the answers of the links, but there is no really good explanation for that).
  • Is there a scientific or math explanation or recommendation (to use decision tree regression)?

martin

Posted 2020-07-13T19:05:12.000

Reputation: 319

problem: I have a dataset of about 30 columns. 10 columns have a high correlation with the target/dependend variable. High correlation has nothing to do with modeling here. – Subhash C. Davar – 2020-07-24T23:44:38.827

cor relatedness between probable features is a good basis for classification problem. It is not clear what prompts you to opt for decision tree model. It is based on linear modeling (lm). linear Regression need not be confused with simple linear models that are essentially based on data for correlated features. – Subhash C. Davar – 2020-07-25T00:00:55.833

For linear models it´s important to know correlated features. To handle this with VIF. The background of this question was, that I would like to do a prediction of numerical values. However include all variable. Not to kick out any variable with a VIF for linear models (neural net, multipl/regression). At the end I would like to play with all variables (features) to see the behavior, when changing values. Im looking from a engineering perspective. Im not interested just to get a good prediction. Target is a model for "playing" with features to derive information, how component could behav – martin – 2020-07-25T14:22:04.130

with that information, im going to change/optimize the structure of a component in engineering software. Data are out of machine tests. I cant see all properties in engineering software. With real world data and software Im optimizing. Hope I could explain it well. That´s the reason, why I want to know: Why can you use decision tree regression without kicking any correlated features like normal regression model or neural net. – martin – 2020-07-25T14:31:20.547

In your , step-regression analysis could be useful. Decision tree modeling has a different orientation and a separate purpose. – Subhash C. Davar – 2020-07-25T14:33:21.843

Would you recommend to do: decision tree regression model include all features (even the 10 high correlating features) to get a model for my purpose? Because multicollinearity has no effect in decision tree regression model? – martin – 2020-07-25T14:49:10.097

what I understand from Josh´s answer is, that it is possible to do that. Sorry it´s really a strange question from my point of view. – martin – 2020-07-25T14:52:16.823

I do not know about Decision tree regression. I have a rethinking . Given your problem, you can consider one of 6 variants of GAM models to cope with your issue of including all variables in your study.. Do not get scared with multicolinearity. IF necessary, transform your data variables to safeguard against so called multicolinearity. – Subhash C. Davar – 2020-07-25T15:29:47.683

Ah yes I know about poisson and some of these processes. Thank you very much I going to learn about that. What do you mean with transform to safeguard? Something like normalization min max? – martin – 2020-07-25T20:28:14.550

I do not know about your data types. could you send a part of it - a sample. Right now, I have "standardization" in my mind to realize best and valid model. – Subhash C. Davar – 2020-07-26T02:02:23.313

subhash C. Davar is available on stats.stackexchange.com with a score of 1. – Subhash C. Davar – 2020-07-26T02:04:50.220

Answers

4

To answer your questions directly, first:

Is there a decision tree regression model good when 10 features are high correlated?

Yes, definitely. But even better than decision trees, is many decision trees (RandomForest, Gradient Boosting (xGBoost is popular). I think you'd be well served by learning about how decision trees split, and how they naturally deal with collinearity. Maybe try this video Follow the logic until the 2nd tier of splits, and you'll be able to imagine how the correlated variables are suddenly not important because they're relative to the split above them.

Is there a scientific or math explanation or recommendation (to use decision tree regression)?

The math explanation of why collinearity is "bad" for linear models, comes down to the coefficients and how you interpret them. One of the side effects is that they can undermine the statistical significance of a variable, as well as flip their coefficients the wrong direction. It usually doesn't affect the accuracy of the model very much, but most people want linear models so that they can interpret the coefficients (which is totally messed up with collinearity). I suggest reading maybe this article to start.

One of the things that you mentioned, include all variables if possible? is not really something you should be concerned with. The goal of a model is to explain the most, with the least. If you're forcing as many variables as possible into the model, then it's possible that you'll be fooled into thinking a model is good, when in fact it isn't if you were to test it on new data. In fact, sometimes less variables will give you a better model. This is exactly the kind of problem that multicollinearity causes with linear models - that you can't really judge very well what variables are significant or not. Stepwise selection doesn't work very well when there are correlated features.

In general, I think decision trees - especially Random Forests - will be a good start for you. But remember not to force all of the variables into the model just for the sake of it. Experiment with using less variables and manipulating the tree structure such as leaf size and max depth. And as always - test your model on validation data and holdout data so that you don't overfit a model and fool yourself into thinking it's a strong model.

Josh

Posted 2020-07-13T19:05:12.000

Reputation: 321

Thank you very much. Funny thing, I already viewed some of your clips in youtube. For this project i don´t reduce variables. (Target is not to reduce variables). It´s an engineering project, at the end I check the behavior of the independent variables to the dependent variable when they got changes. With that information I would like to improve the construction of this automotive component. – martin – 2020-07-14T07:23:23.297

So in other words, I could try different decision tree models without reduction of variables. Please correct me if wrong. Thank you. – martin – 2020-07-14T07:31:43.867

Yes, a decision tree will naturally "reduce" the variables by simply not using them if they dont provide value by spitting with them. So you can claim you're "using all the variables" but actually some of them are not impacting the model result at all. – Josh – 2020-07-14T15:59:22.330

2FYI a good way to measure the behavior of the independent variables (after the model is built) is to use permutation importance. Essentially, it will 1 at a time take a variable and shuffle it, thereby destroying its information. Then the model is scored on holdout and compared to the original model. The relative effect on how bad the model gets when each variable is destroyed will give you a good idea of how important each variable is. This is better than counting on the algorithm itself to provide its importance metric. – Josh – 2020-07-14T16:01:24.757

Thank you very much, These steps are very interesting. I didn´t know about permutation importance. – martin – 2020-07-14T19:43:31.727

No problem - one of the reason its great is that you can use it for any model type, so you can compare the impact for a decision tree and a regression using it. – Josh – 2020-07-14T20:01:26.003