2

1

I am building a model (implementing both logistic regression and Xgboost) to understand the importance/significance of each feature in whether a customer is going to repurchase to understand what matters for customers to repurchase (More interested in inferences than the predictions)

My feature set looks like this: [income, age, price_product, discount, product_category, delivery_charge, gender, lifestyle, delivery_time_main (P_main), P1, P2, P3, P4 and so on]

P_main is something that the customer sees and may influence their decision to repurchase or not. P1 + P2 + P3 + P4 = P_main. We want to understand how many important are P1, P2, P3, P4 as well as they are stages in the process so we can make inferences about which stage is to be focused most on if we were to improve repurchases.

Can the feature set contain the parts (P1, P2, P3, P4) as well as the sum (P_main) as inputs to the model? Or does it introduce multicollinearity problems? I am removing features that have multicollinearity using the Variance Inflation Factor and then applying Lasso regression.

Ok thank you i ll post on CV SE. So your recommendation is not to include P1,P2, P3, P4 and P_main in same model ? Can you please clarify if these should be in two different models then? Its unclear what the bottomline approach you are suggesting is. – Bhavya Geethika – 2019-05-29T20:20:18.157

Do you have an apriori hypothesis or theory to suggest that P1, P2, P3, etc. could influence your target? If yes, then include them in your logistic regression. Then, check the models assumptions to make sure the model is correctly specified. This includes things like multicollinearity, linearity between your predictors and the predicted log odds, and outliers. Then run the appropriate statistical tests to answer your questions on variable significance/effect on y. – aranglol – 2019-05-30T01:00:02.160

If two variables are collinear, drop one. Outliers can be adjusted through appropriate transformations or possibly removed (though this might be dangerous). If predictors are not linear in the log odds, consider transformations. Finally, this reminds me; instead of logistic regression, maybe take a look at generalized additive models? I have not used them before but they generalize GLM's and are extremely flexible/powerful (also allows you to answer the inference related questions you have like generalized linear models to my belief). – aranglol – 2019-05-30T01:00:51.797