Can I add features that are parts of another feature?



I am building a model (implementing both logistic regression and Xgboost) to understand the importance/significance of each feature in whether a customer is going to repurchase to understand what matters for customers to repurchase (More interested in inferences than the predictions)

My feature set looks like this: [income, age, price_product, discount, product_category, delivery_charge, gender, lifestyle, delivery_time_main (P_main), P1, P2, P3, P4 and so on]

P_main is something that the customer sees and may influence their decision to repurchase or not. P1 + P2 + P3 + P4 = P_main. We want to understand how many important are P1, P2, P3, P4 as well as they are stages in the process so we can make inferences about which stage is to be focused most on if we were to improve repurchases.

Can the feature set contain the parts (P1, P2, P3, P4) as well as the sum (P_main) as inputs to the model? Or does it introduce multicollinearity problems? I am removing features that have multicollinearity using the Variance Inflation Factor and then applying Lasso regression.

P_main as a sum of other features of interest

Bhavya Geethika

Posted 2019-05-28T17:59:34.560

Reputation: 31



First, I recommend NOT binning continous variables in any way. The amount of information lost and numerical problems makes it rarely a good decision. This is especially true in your case where you will be sacking a lot of degrees of freedom in your model by binning (which are necessary if you want statistical power, i.e. rejecting the null when you should). If you want to account for non linearities, either use splines or better yet, use an algorithm that doesn't assume a linear relationship between (some function of) y and your predictors.

Second, I would be extremely cautious with doing any sort of statistical inference (determining importance/significance of predictors) with machine learning algorithms that primarily (if not fully) were created for predictive, rather than explanatory reasons. In other words, I would stick with logistic regression in this case, since most ML algorithms don't allow for much, if any, proper statistical inference.

You are right to consider multicollinearity in this case. Severe collinearity will adversely affect p values and confidence intervals, and make inference hard, if not impossible. But this is why you need to have a hypothesis first as to what could possibly influence your target, to carefully specify your model, and to check the assumptions related to your chosen model (and fix or note any violations). Simply throwing a bunch of predictors into a model won't work here since you actually care about interpreting the model and its statistical output.

To be honest, I recommend going to the cross validated SE for more information rather than here, which primarily focuses on predictions, not inference.


Posted 2019-05-28T17:59:34.560

Reputation: 1 615

Ok thank you i ll post on CV SE. So your recommendation is not to include P1,P2, P3, P4 and P_main in same model ? Can you please clarify if these should be in two different models then? Its unclear what the bottomline approach you are suggesting is. – Bhavya Geethika – 2019-05-29T20:20:18.157

Do you have an apriori hypothesis or theory to suggest that P1, P2, P3, etc. could influence your target? If yes, then include them in your logistic regression. Then, check the models assumptions to make sure the model is correctly specified. This includes things like multicollinearity, linearity between your predictors and the predicted log odds, and outliers. Then run the appropriate statistical tests to answer your questions on variable significance/effect on y. – aranglol – 2019-05-30T01:00:02.160

If two variables are collinear, drop one. Outliers can be adjusted through appropriate transformations or possibly removed (though this might be dangerous). If predictors are not linear in the log odds, consider transformations. Finally, this reminds me; instead of logistic regression, maybe take a look at generalized additive models? I have not used them before but they generalize GLM's and are extremely flexible/powerful (also allows you to answer the inference related questions you have like generalized linear models to my belief). – aranglol – 2019-05-30T01:00:51.797


Yes, you can create additional features based on existing features.

In general, this is more powerful for Logistic Regression which has a hard time with non-linear relationships (see Chris Albon, the author of "ML Cookbook").

Other tricks:

  • Binning can work wonders with Logistic Regression where you categorize a measure and thus help the algorithm to split.

    • Bagging for Logistic Regression




Posted 2019-05-28T17:59:34.560

Reputation: 901

Hi thank you. Can you please elaborate on if I should be concerned about multicollinearity by including the parts (P1, P2, P3, P4) as well as the sum (P_main) along with other features in the same feature set input to the model. If no, why? – Bhavya Geethika – 2019-05-28T20:10:11.707

1Hi, You're right, multicollinearity is created when adding a feature. But multicollinearity is also created with One-Hot Encoding. It's a trade-off which can only be answered empirically (i.e. test it). Good luck! – FrancoSwiss – 2019-05-29T08:10:52.890