Feature engineering using XGBoost


I am participating in a kaggle competition. I am planning to use the XGBoost package (in R). I read the XGBoost documentation and understood the basics.

Can someone explain how is feature engineering done using XGBoost?

An example for explanation would be of great help.

Ashish kulkarni

Posted 2016-12-11T11:43:55.263

Reputation: 131

Question was closed 2020-07-29T08:03:20.020

1Featuring engineering is usually done before the modelling stage (in which one can use xgboost). Why do you want to do feature engineering using xgboost? – DaL – 2016-12-11T11:55:38.413

@Den Levin: Thanks for pointing that out. So is it that we decide on some features, use xgboost on them, and then evaluate it's performance. If not satisfied, change the features and again use xgboost ? – Ashish kulkarni – 2016-12-11T14:03:34.723

Indeed, in the common scenario you first do the feature engineering and then the modelling. Note that is the modelling doesn't satisfies you, going back to feature engineering is one of many options. You might try xgboost with different parameters (e.g., the number of trees). You might try a different classification model (e.g., SVM, logistic regression). The problem might be in any of the steps - data collection, pre processing, feature engineering, feature selection, labeling, evaluation, etc. – DaL – 2016-12-11T14:12:58.770

@Dan Levin: Thanks a lot. I was wrong with the fundamentals. Will write this in the answer. – Ashish kulkarni – 2016-12-11T14:25:40.087



It turns out that the question I asked is incorrect.

Initially feature engineering is done, then xgboost is used to build a model out of it. If we not satisfied with the model's performance we can go back to feature engineering.

Thanks Dan Levin for the explanation.

Ashish kulkarni

Posted 2016-12-11T11:43:55.263

Reputation: 131

There is potentially more that you could do. For instance xgboost can output analysis of variable importance after training, which you can use to help assess new features that you have created. It doesn't necessarily mean an "important" engineered feature is useful, because it is probably correlated to existing input variable. You need to look at overall model performance too. – Neil Slater – 2016-12-11T15:52:00.490