## Why does Gradient Boosting regression predict negative values when there are no negative y-values in my training set?

10

1

As I increase the number of trees in scikit learn's GradientBoostingRegressor, I get more negative predictions, even though there are no negative values in my training or testing set. I have about 10 features, most of which are binary.

Some of the parameters that I was tuning were:

• the number of trees/iterations;
• learning depth;
• and learning rate.

The percentage of negative values seemed to max at ~2%. The learning depth of 1(stumps) seemed to have the largest % of negative values. This percentage also seemed to increase with more trees and a smaller learning rate. The dataset is from one of the kaggle playground competitions.

My code is something like:

from sklearn.ensemble import GradientBoostingRegressor

X_train, X_test, y_train, y_test = train_test_split(X, y)

reg = GradientBoostingRegressor(n_estimators=8000, max_depth=1, loss = 'ls', learning_rate = .01)

reg.fit(X_train, y_train)

ypred = reg.predict(X_test)


1Any chance of a reproducible example with code and data? – Spacedman – 2014-06-25T07:29:35.100

2which playground competition is it? – TheAxeR – 2014-06-25T16:01:01.027

– Ben Reiniger – 2020-10-12T02:17:02.790

11

Remember that the GradientBoostingRegressor (assuming a squared error loss function) successively fits regression trees to the residuals of the previous stage. Now if the tree in stage i predicts a value larger than the target variable for a particular training example, the residual of stage i for that example is going to be negative, and so the regression tree at stage i+1 will face negative target values (which are the residuals from stage i). As the boosting algorithm adds up all these trees to make the final prediction, I believe this can explain why you may end up with negative predictions, even though all the target values in the training set were positive, especially as you mentioned that this happens more often when you increase the number of trees.

This is the correct answer. – hahdawg – 2018-06-26T17:38:52.440

8

In general regression models (any) can behave in an arbitrary way beyond the domain spanned by training samples. In particular, they are free to assume linearity of the modeled function, so if you for instance train a regression model with points:

X     Y
10    0
20    1
30    2


it is reasonable to build a model f(x) = x/10-1, which for x<10 returns negative values.

The same applies "in between" your data points, it is always possible that due to the assumed famility of functions (which can be modeled by particular method) you will get values "out of your training samples".

You can think about this in another way - "what is so special about negative values?", why do you find existance of negative values weird (if not provided in training set) while you don't get alarmed by existance of lets say... value 2131.23? Unless developed in such a way, no model will treat negative values "different" than positive ones. This is just a natural element of the real values which can be attained as any other value.

With regards to your set of questions, I think it is purely the negative values are easier to identify as anomalies because they have that "-" in front of them or clearly go below zero on graphs. The question could just as easily be "Why does Gradient Boosting regression predict previously unseen values?". Maybe you could try to expand on that? It would certainly get you an up vote from me. – josh – 2016-09-16T11:15:46.613

1@lejlot -- Generally speaking, this is not true. Regression models with logistic, or tanh activations are often guaranteed to have outputs within some bounds. – user48956 – 2016-10-04T18:23:43.397

@user48956 answer states "can behave in arbitrary manner", I am not claiming that you cannot force some constraints, of course you can - answer only states that there is no "data dependent" constraint (unless you have very specific model which has this built in construction) - if you add this manually as an expert - it is up to you. – lejlot – 2016-10-04T18:50:24.750

1Random forests and nearest-neighbor models cannot make predictions outside their training range, for example. – Ben Reiniger – 2021-01-14T22:24:10.780

1

The default number of estimators is 100. Reducing the number of estimators may work.