I am building a model (implementing both logistic regression and Xgboost) to understand the importance/significance of each feature in whether a customer is going to repurchase to understand what matters for customers to repurchase (More interested in inferences than the predictions)
My feature set looks like this: [income, age, price_product, discount, product_category, delivery_charge, gender, lifestyle, delivery_time_main (P_main), P1, P2, P3, P4 and so on]
P_main is something that the customer sees and may influence their decision to repurchase or not. P1 + P2 + P3 + P4 = P_main. We want to understand how many important are P1, P2, P3, P4 as well as they are stages in the process so we can make inferences about which stage is to be focused most on if we were to improve repurchases.
Can the feature set contain the parts (P1, P2, P3, P4) as well as the sum (P_main) as inputs to the model? Or does it introduce multicollinearity problems? I am removing features that have multicollinearity using the Variance Inflation Factor and then applying Lasso regression.