Adding extra variables to XGboost model is worsening the train and test accuracy

2

I am fitting a multi class model using Xgboost. I am getting an accuracy of 96% on Train and 95% on test. I am using the 80-20 train/test split. However, when I am adding two new features , the accuracy drops down to 92% for train and 89% for test.

Doesnt XGBoost:

  1. Pick the most important variable that can be used to split a node and leave out the rest?
  2. Handle multi colinearity?

I have not used cross validation. Could it be that I am still overfitting the data ?

This is the code I used

from sklearn.model_selection import train_test_split
df_new_train, df_new_test, y_train, y_test = train_test_split(df, labels2, test_size = 0.2)

dtrain = xgb.DMatrix(df_new_train, label=y_train)
dtest = xgb.DMatrix(df_new_test, label=y_test)

param = {
        'max_depth': 10,
        'early_stopping_rounds': 10,
        'eta': 0.01,
        'subsample': 0.6,
         'colsample_bytree': 0.5,
        #'alpha': 0.5,x`
        #'lambda': 0.5,
        'gamma': 10,
        'min_child_weight': 1,
        'watchlist': [(dtrain, 'train'), (dtest, 'valid')],
        'objective': 'multi:softprob',  # error evaluation for multiclass training
        'num_class': 4}  # the number of classes that exist in this datset
num_round = 1500

bst = xgb.train(param, dtrain,  num_round)

preds = bst.predict(dtest)
preds_train = bst.predict(dtrain)



best_preds_train = np.asarray([np.argmax(line) for line in preds_train])

best_preds = np.asarray([np.argmax(line) for line in preds])


print(classification_report(y_test,best_preds,target_names=label_encoder.classes_ )) 
```

kash

Posted 2019-11-08T05:38:20.923

Reputation: 21

Answers

0

This is something you can try, remove the parameter early stopping rounds and the number as I see is relatively low. It could be that xgboost stopped training before it is supposed because it may not have seen any improvement. Also you might consider setting random seed to get consistent result between attempts.

Yohanes Alfredo

Posted 2019-11-08T05:38:20.923

Reputation: 928

0

In general, when you change the data being fed into a model you should also consider re-tuning the model parameters.

It could be that the addition of the two new features in your data set means that your existing model parameters (e.g. eta, gamma, etc.) are no longer optimal.

bradS

Posted 2019-11-08T05:38:20.923

Reputation: 1 387