How to determine if my GBM model is overfitting?

4

Below is a simplified example of a h2o gradient boosting machine model using R's iris dataset. The model is trained to predict sepal length.

The example yields an r2 value of 0.93, which seems unrealistic. How can I assess if these are indeed realistic results or simply model overfitting?

library(datasets)
library(h2o)

# Get the iris dataset
df <- iris

# Convert to h2o
df.hex <- as.h2o(df)

# Initiate h2o
h2o.init()

# Train GBM model
gbm_model <- h2o.gbm(x = 2:5, y = 1, df.hex, 
                     ntrees=100, max_depth=4, learn_rate=0.1)

# Check Accuracy
perf_gbm <- h2o.performance(gbm_model) 
rsq_gbm <- h2o.r2(perf_gbm)

---------->

> rsq_gbm
[1] 0.9312635

Borealis

Posted 2017-07-06T05:22:21.060

Reputation: 297

1to check if you are overfitting, you need to have both training and test set. You can also consider validation set if you are doing optimization. I do not see any such sets here. I am assuming you are testing on your whole training set, in which case you are definitely overfitting. Can you show your results in a separate, not seen before test set? – Sal – 2017-07-06T06:08:03.167

Answers

5

The term overfitting means the model is learning relationships between attributes that only exist in this specific dataset and do not generalize to new, unseen data. Just by looking at the model accuracy on the data that was used to train the model, you won't be able to detect if your model is or isn't overfitting.

To see if you are overfitting, split your dataset into two separate sets:

  • a train set (used to train the model)
  • a test set (used to test the model accuracy)

A 90% train, 10% test split is very common. Train your model on the train test and evaluate its performance both on the test and the train set. If the accuracy on the test set is much lower than the models accuracy on the train set, the model is overfitting.

You can also use cross-validation (e.g. splitting the data into 10 sets of equal size, for each iteration use one as test and the others as train) to get a result that is less influenced by irregularities in your splits.

Simon Boehm

Posted 2017-07-06T05:22:21.060

Reputation: 352

2

I suggest training/testing your classifier on separate splits of the original dataset, and then printing a confusion matrix: https://topepo.github.io/caret/measuring-performance.html#

This is a way of seeing how many of the 'true' classifications your classifier predicted correctly or incorrectly, and the same for 'false' classifications. This will give you more information than just 'accuracy', because a model trained on data where most of the classes are 1, for example, will predict 1 most of the time because it will probably report reasonably high accuracy in doing so. A confusion matrix is like a sanity check for this.

Dan Carter

Posted 2017-07-06T05:22:21.060

Reputation: 1 521

1

In my opinion: in real world cases, the nature of the problem plays the main part on how you handle over fitting and the evaluation of a classifier as a whole.

What I do is take a look in all the metrics provided but give more weight to A.U.C. score and F 1 score. The combination of these two can provide useful information about decision boundaries and generalization as well!!

Andreas Vrangas

Posted 2017-07-06T05:22:21.060

Reputation: 11