As I increase the number of trees in scikit learn's
GradientBoostingRegressor, I get more negative predictions, even though there are no negative values in my training or testing set. I have about 10 features, most of which are binary.
Some of the parameters that I was tuning were:
- the number of trees/iterations;
- learning depth;
- and learning rate.
The percentage of negative values seemed to max at ~2%. The learning depth of 1(stumps) seemed to have the largest % of negative values. This percentage also seemed to increase with more trees and a smaller learning rate. The dataset is from one of the kaggle playground competitions.
My code is something like:
from sklearn.ensemble import GradientBoostingRegressor X_train, X_test, y_train, y_test = train_test_split(X, y) reg = GradientBoostingRegressor(n_estimators=8000, max_depth=1, loss = 'ls', learning_rate = .01) reg.fit(X_train, y_train) ypred = reg.predict(X_test)