I recently put together an entry for the House Prices Kaggle competition for beginners. I decided to try my hand at understanding and using XGBoost.
I split Kaggle's 'training' data into 'training' and 'testing'. Then I fit and tuned my model on the new training data using KFold CV and got a score with scikit's
cross_val_score using a KFold with shuffle.
the average score on the training set with this cross validation was 0.0168 (mean squared log error).
Next, with the fully tuned model, I check its performance on the never before seen 'test' set (not the final test set for the Kaggle leader board). The score is identical after rounding.
So, I pat myself on the back because I've avoided over-fitting... or so I thought. When I made my submission to the competition, my score became 0.1359, which is a massive drop in performance. It amounts to being a solid 25 grand wrong on my house price predictons.
What could be causing this, if not overfitting?
Here is the link to my notebook, if it helps: https://www.kaggle.com/wesleyneill/house-prices-walk-through-with-xgboost