Is this over-fitting or something else?

1

I recently put together an entry for the House Prices Kaggle competition for beginners. I decided to try my hand at understanding and using XGBoost.

I split Kaggle's 'training' data into 'training' and 'testing'. Then I fit and tuned my model on the new training data using KFold CV and got a score with scikit's cross_val_score using a KFold with shuffle.

the average score on the training set with this cross validation was 0.0168 (mean squared log error).

Next, with the fully tuned model, I check its performance on the never before seen 'test' set (not the final test set for the Kaggle leader board). The score is identical after rounding.

So, I pat myself on the back because I've avoided over-fitting... or so I thought. When I made my submission to the competition, my score became 0.1359, which is a massive drop in performance. It amounts to being a solid 25 grand wrong on my house price predictons.

What could be causing this, if not overfitting?

Here is the link to my notebook, if it helps: https://www.kaggle.com/wesleyneill/house-prices-walk-through-with-xgboost

rocksNwaves

Posted 2020-05-13T00:17:37.477

Reputation: 279

Why did you make your own test set? I quickly glanced at the competition and it seems like a test set was provided. – S van Balen – 2020-05-13T09:35:45.330

@SvanBalen The test set provided is unlabeled data used for final competition evaluation, not for model selection or tuning. – rocksNwaves – 2020-05-13T18:35:53.197

Answers

1

I'm not an avid Kaggler, but I do remember a case where the evaluation in time related data was randomly picked (which favored nearest neighbor-approaches, since exact duplicates could exist).

I'm not sure whether there are clues on the evaluation data this time (perhaps you can tell). But a possible overfit could be time related.

If the test set is just a random subsample of the test/train part, and the evaluation part is not randomly sampled, but for instance a holdout of the year 2011, you can still learn rules specific for the time dimension and not find them in test.

A possible way of tackling that would be to resample the test set accordingly.

S van Balen

Posted 2020-05-13T00:17:37.477

Reputation: 1 294

1

You've followed the right process well. (Well it's possible there's an error, like not randomly sampling the test set.)

I think the issue is simply that you have nevertheless overfit. The Kaggle held-out test set may not be, due to chance, that like the provided training data. There's not a lot you can do except favor low-variance models more than low-bias models in your selection process.

Sean Owen

Posted 2020-05-13T00:17:37.477

Reputation: 5 987