Correcting log-bias in the output of an XGB


I have previously worked with GAMs, where I was trying to do regression on a log-transformed variable. The log-transformation introduced a negative bias in the average of the predicted variable, and I corrected for this by multiplying each of the predictions $\exp(\hat y)$ by the factor

$$\langle \exp( \delta \hat y) \rangle$$

where $\delta \hat y$ was the residuals from the GAM.

Now I am using XGB, and trying to do regression on a log-transformed variable $y$ once again. The predictions $\hat y$ satisfy

$$ \frac{\sum_i \hat y_i}{\sum_i y_i} = 0.999 $$

so overall it looks good. However, when I exp-transform the variables I get

$$ \frac{\sum_i \exp(\hat y_i)}{\sum_i \exp(y_i)} = 0.861 $$

which is considerably worse. I suspect that this is due to the negative bias. Is there a way to correct for the bias in the XGB as in a GAM/GLM?


Posted 2018-02-19T15:18:25.933

Reputation: 131



From this perspective, modern ML regression techniques are not different to classic ones. A mean unbiased model on the log scale will automatically be mean biased after retransformation (by Jensen's inequality).

Depending on the aim of your analysis, applying some retransformation bias correction factor might be useful as well for algorithms like XGBoost. (Be careful with models that impose L1/L2 penalties on the intercept though - a rare occasion.)

The specific correction factor can be calculated directly from the residuals in the same way as for OLS.

Michael M

Posted 2018-02-19T15:18:25.933

Reputation: 460

In OLS we calculate the correction factor from the residuals of the regression directly. To my knowledge we don't have residuals in the same way in XGB (?) – N08 – 2018-02-20T08:50:45.903

We have! They are just the difference between observed and predicted responses. If you fear overfitting, you can always use residuals from cross-validation or some hold-out data set. – Michael M – 2018-02-20T10:51:53.460