Hastie et al. “The Elements of Statistical Learning” (2009) consider a data generating process

$$

Y = f(X) + \varepsilon

$$

with $\mathbb{E}(\varepsilon)=0$ and $\text{Var}(\varepsilon)=\sigma^2_{\varepsilon}$.They present the following bias-variance decomposition of the expected squared forecast error at point $x_0$ (p. 223, formula 7.9):

\begin{aligned}

\text{Err}(x_0) &= \mathbb{E}\left( [ y – \hat f(x_0) ]^2 | X = x_0 \right) \\

&= \dots \\

&= \sigma^2_{\varepsilon} + \text{Bias}^2(\hat f(x_0)) + \text{Var}(\hat f(x_0)) \\

&= \text{Irreducible error} + \text{Bias}^2 + \text{Variance} .\\

\end{aligned}

In my own work I do not specify $\hat f(\cdot)$ but take an arbitrary forecast $\hat y$ instead (if this is relevant).

Question:I am looking for a term for

$$

\text{Bias}^2 + \text{Variance}

$$

or, more precisely,

$$

\text{Err}(x_0) – \text{Irreducible error}.

$$

**Answer**

I propose *reducible error*. This is also the terminology adopted in paragraph 2.1.1 of Gareth, Witten, Hastie & Tibshirani, *An Introduction to
Statistical Learning*, a book which is basically a simplification of ESL + some very cool R code laboratories (except for the fact that they use

`attach`

, but, hey, nobody’s perfect). I’ll list below the reasons the pros and cons of this terminology.First of all, we must recall that we not only assume $\epsilon$ to have mean 0, but to also be *independent* of $X$ (see paragraph 2.6.1, formula 2.29 of ESL, 2^{nd} edition, 12^{th} printing). Then of course $\epsilon$ cannot be estimated from $X$, no matter which hypothesis class $\mathcal{H}$ (family of models) we choose, and how large a sample we use to learn our hypothesis (estimate our model). This explains why $\sigma^2_{\epsilon}$ is called *irreducible error*.

By analogy, it seems natural to define the remaining part of the error, $\text{Err}(x_0)-\sigma^2_{\epsilon}$, the *reducible error*. Now, this terminology may sound somewhat confusing: as a matter of fact, under the assumption we made for the data generating process, we can prove that

$$ f(x)=\mathbb{E}[Y\vert X=x]$$

Thus, the *reducible error* can be reduced to zero *if and only if* $\mathbb{E}[Y\vert X=x]\in \mathcal{H}$ (assuming of course we have a consistent estimator). If $\mathbb{E}[Y\vert X=x]\notin \mathcal{H}$, we cannot drive the reducible error to 0, even in the limit of an infinite sample size. However, it’s still the only part of our error which can be reduced, if not eliminated, by changing the sample size, introducing regularization (shrinkage) in our estimator, etc. In other words, by choosing another $\hat{f}(x)$ in our family of models.

Basically, *reducible* is meant not in the sense of *zeroable* (yuck!), but in the sense of that part of the error which can be reduced, even if not necessarily made arbitrarily small. Also, note that in principle this error can be reduced to 0 by enlarging $\mathcal{H}$ until it includes $\mathbb{E}[Y\vert X=x]$. In contrast, $\sigma^2_{\epsilon}$ cannot be reduced, no matter how large $\mathcal{H}$ is, because $\epsilon\perp X$.

**Attribution***Source : Link , Question Author : Richard Hardy , Answer Author : DeltaIV*