Bias-variance decomposition: term for expected squared forecast error less irreducible error

Hastie et al. “The Elements of Statistical Learning” (2009) consider a data generating process
$$
Y = f(X) + \varepsilon
$$
with $\mathbb{E}(\varepsilon)=0$ and $\text{Var}(\varepsilon)=\sigma^2_{\varepsilon}$.

They present the following bias-variance decomposition of the expected squared forecast error at point $x_0$ (p. 223, formula 7.9):
\begin{aligned}
\text{Err}(x_0) &= \mathbb{E}\left( [ y – \hat f(x_0) ]^2 | X = x_0 \right) \\
&= \dots \\
&= \sigma^2_{\varepsilon} + \text{Bias}^2(\hat f(x_0)) + \text{Var}(\hat f(x_0)) \\
&= \text{Irreducible error} + \text{Bias}^2 + \text{Variance} .\\
\end{aligned}
In my own work I do not specify $\hat f(\cdot)$ but take an arbitrary forecast $\hat y$ instead (if this is relevant).
Question: I am looking for a term for
$$
\text{Bias}^2 + \text{Variance}
$$
or, more precisely,
$$
\text{Err}(x_0) – \text{Irreducible error}.
$$

Answer

I propose reducible error. This is also the terminology adopted in paragraph 2.1.1 of Gareth, Witten, Hastie & Tibshirani, An Introduction to
Statistical Learning
, a book which is basically a simplification of ESL + some very cool R code laboratories (except for the fact that they use attach, but, hey, nobody’s perfect). I’ll list below the reasons the pros and cons of this terminology.


First of all, we must recall that we not only assume $\epsilon$ to have mean 0, but to also be independent of $X$ (see paragraph 2.6.1, formula 2.29 of ESL, 2nd edition, 12th printing). Then of course $\epsilon$ cannot be estimated from $X$, no matter which hypothesis class $\mathcal{H}$ (family of models) we choose, and how large a sample we use to learn our hypothesis (estimate our model). This explains why $\sigma^2_{\epsilon}$ is called irreducible error.

By analogy, it seems natural to define the remaining part of the error, $\text{Err}(x_0)-\sigma^2_{\epsilon}$, the reducible error. Now, this terminology may sound somewhat confusing: as a matter of fact, under the assumption we made for the data generating process, we can prove that

$$ f(x)=\mathbb{E}[Y\vert X=x]$$

Thus, the reducible error can be reduced to zero if and only if $\mathbb{E}[Y\vert X=x]\in \mathcal{H}$ (assuming of course we have a consistent estimator). If $\mathbb{E}[Y\vert X=x]\notin \mathcal{H}$, we cannot drive the reducible error to 0, even in the limit of an infinite sample size. However, it’s still the only part of our error which can be reduced, if not eliminated, by changing the sample size, introducing regularization (shrinkage) in our estimator, etc. In other words, by choosing another $\hat{f}(x)$ in our family of models.

Basically, reducible is meant not in the sense of zeroable (yuck!), but in the sense of that part of the error which can be reduced, even if not necessarily made arbitrarily small. Also, note that in principle this error can be reduced to 0 by enlarging $\mathcal{H}$ until it includes $\mathbb{E}[Y\vert X=x]$. In contrast, $\sigma^2_{\epsilon}$ cannot be reduced, no matter how large $\mathcal{H}$ is, because $\epsilon\perp X$.

Attribution
Source : Link , Question Author : Richard Hardy , Answer Author : DeltaIV

Leave a Comment