How is the equation for the relation between prediction error, bias, and variance defined?

1

3

If we denote the variable we are trying to predict as $Y$ and our covariates as $X$, we may assume that there is a relationship relating one to the other such as $Y=f(X)+\epsilon$ where the error term $\epsilon$ is normally distributed with a mean of zero like so $\epsilon\sim\mathcal{N}(0,\,\sigma_\epsilon)$.

We may estimate a model $\hat{f}(X)$ of $f(X)$. The expected squared prediction error at a point $x$ is: $$Err(x)=E[(Y-\hat{f}(x))^2]$$ This error may then be decomposed into bias and variance components: $$Err(x)=(E[\hat{f}(x)]-f(x))^2+E\big[(\hat{f}(x)-E[\hat{f}(x)])^2\big]+\sigma^2_e$$ $$Err(x)=Bias^2+Variance+Irreducible\ Error$$

I'm wondering how do the last two equations deduct from the first equation?

2

If: $$Err(x)=E[(Y-\hat{f}(x))^2]$$ Then, by adding and substracting $f(x)$, $$Err(x)=E[(Y-f(x)+f(x)-\hat{f}(x))^2]$$ $$= E[(Y-f(x))^2] + E[(\hat{f}(x)-f(x))^2] + 2E[(Y-f(x))(\hat{f}(x)-f(x))]$$ The first term is the irreducible error, by definition. The second term can be expanded like this: $$E[(\hat{f}(x)-f(x))^2] = E[\hat{f}(x)^2]+E[f(x)^2] -2E[f(x)\hat{f}(x)]$$

$$=E[\hat{f}(x)^2]+f(x)^2-2f(x)E[\hat{f}(x)]$$

$$= E[\hat{f}(x)^2]-E[\hat{f}(x)]^2+E[\hat{f}(x)]^2+f(x)^2-2f(x)E[\hat{f}(x)]$$

$$= E\big[(\hat{f}(x)-E[\hat{f}(x)])^2\big] + (E[\hat{f}(x)]-f(x))^2$$

$$= Bias^2+Variance$$

Then the only thing that is left is to prove that the third term is 0. This is seen using $E[Y] = f(x)$.

Edit

I am not that sure on how to prove $$E[(Y-f(x))(\hat{f}(x)-f(x))] = 0$$ If we assume independence between $\epsilon = Y - f(x)$ and $\hat{f}(x)-f(x)$, then the proof is trivial, as we can split the expected value in two products, the first of them being $0$. However, I am not so sure about the fact that we can assume this independence.

Could you please explain a bit more on why the third term is 0? Why $E[Y]=f(x)$ leads the third term to be 0? How to calculate the third term? – CyberPlayerOne – 2018-06-15T10:12:40.443

I have edited the answer. – David Masip – 2018-06-15T14:16:44.067

Since ϵ is the noise, I think the noise should be independent from the model or the data. So the independence you assumed should be valid. – CyberPlayerOne – 2018-06-15T17:35:36.310

Indeed, I think this should be it. Good work – David Masip – 2018-06-15T20:54:29.860

The third term is zero because: 1. As someone pointed out, the noise is independent of the data. 2. The noise is assumed to be normally distributed with mean zero. – Velu44 – 2019-07-11T10:11:13.350

1

This comes from some standard definitions really. There is a similar question on Cross Validated SE that has good answers. There are related questions there that might be worth looking through too, like this one.

The $\sigma^2_{e}$, which is basically the noise that comes with a random variable. Perhaps there isn't much of it, but we normally just write it at the end of such equations.

In the context of a real world machine learning problem, I also sometimes think of that term as accounting for information that I just do not have the possibility to explain, with the data that I have. So in that particular project, it is as good as irreducible error.