# Understanding the Bayes risk

When evaluating an estimator, the two probably most common used criteria are the maximum risk and the Bayes risk. My question refers to the latter one:

The bayes risk under the prior $\pi$ is defined as follows:

I don’t quite get what the prior $\pi$ is doing and how I should interpret it. If I have a risk function $R(\theta, \hat{\theta} )$ and plot it, intuitively I would take its area as a criterion to judge how “strong” the risk is over all possible values of $\theta$. But involving the prior somehow destroys this intuition again, although it is close. Can someone help me how to interpret the prior?

[Here is an excerpt from my own textbook, The Bayesian Choice (2007), that argues in favour of a decision-theoretic approach to Bayesian analysis, hence of using the Bayes risk.]

Except for the most trivial settings, it is generally
impossible to uniformly minimize (in $d$) the loss
function $\text{L}(\theta,d)$ when $\theta$ is unknown.
In order to derive an effective comparison criterion from
the loss function, the frequentist approach proposes to consider instead
the average loss (or frequentist risk)

where $\delta(x)$ is the decision rule, i.e., the allocation of
a decision to each outcome $x\sim f(x|\theta)$ from the random
experiment.

The function $\delta$, from ${\mathcal X}$ in $\mathfrak{D}$, is usually called estimator (while the value $\delta(x)$ is called estimate of $\theta$). When there is no risk of confusion, we also denote the set of estimators by $\mathfrak{D}$.

The frequentist paradigm relies on this criterion
to compare estimators and, if possible, to select the best estimator,
the reasoning being that estimators are evaluated on their
long-run performance for all possible values of the parameter $\theta$.
Notice, however, that there are several difficulties associated with this approach.

1. The error (loss) is averaged over the different values of $x$
proportionally to the density $f(x|\theta)$. Therefore, it seems
that the observation $x$ is not taken into account any further. The
risk criterion evaluates procedures on their long-run performance
and not directly for the given observation, $x$. Such an evaluation
may be satisfactory for the statistician, but it is not so appealing
for a client, who wants optimal results for her data $x$, not that
of another’s!
2. The frequentist analysis of the decision problem implicitly assumes
that this problem will be met again and again, for the frequency
evaluation to make sense. Indeed, $R(\theta,\delta)$ is
approximately the average loss over i.i.d. repetitions of the same
experiment, according to the Law of Large Numbers. However, on both
philosophical and practical grounds, there is a lot of controversy
over the very notion of repeatability of experiments (see Jeffreys
(1961)). For one thing, if new observations come to the
statistician, she should make use of them, and this could modify the
way the experiment is conducted, as in, for instance, medical
trials.
3. For a procedure $\delta$, the risk $R(\theta, \delta)$ is a function
of the parameter $\theta$. Therefore, the frequentist approach does
not induce a total ordering on the set of procedures. It is
generally impossible to compare decision procedures with this
criterion, since two crossing risk functions prevent comparison
between the corresponding estimators. At best, one may hope for a
procedure $\delta_0$ that uniformly minimizes $R(\theta,\delta)$,
but such cases rarely occur unless the space of decision procedures
is restricted. Best procedures can only be obtained by restricting
rather artificially the set of authorized procedures.

Example 2.4 – Consider $x_1$ and $x_2$, two observations from

The parameter of interest is $\theta$ (i.e., $\mathfrak{D} = \Theta$) and
it is estimated by estimators $\delta$ under the loss

often called $0-1$ loss, which penalizes errors of estimation,
whatever their magnitude, by $1$. Considering the particular \est

its risk function is

This computation shows that the estimator $\delta_0$ is correct half
of the time. Actually, this estimator is always correct when
$x_1\ne x_2$, and always wrong otherwise. Now, the \est\
$\delta_1(x_1,x_2) = x_1+1$ also has a risk function equal to $0.5$,
as does $\delta_2(x_1,x_2) = x_2-1$. Therefore, $\delta_0$, $\delta_1$
and $\delta_2$ cannot be ranked under the $0-1$ loss. $\blacktriangleright$

On the contrary, the Bayesian approach to Decision Theory integrates
on the space $\Theta$ since $\theta$ is unknown, instead of
integrating on the space ${\cal X}$ as $x$ is known. It relies on
the posterior expected loss

which averages the error (i.e., the loss) according to the posterior
distribution of the parameter $\theta$, conditionally on the observed
value} $x$. Given $x$, the average error resulting from decision $d$ is actually $\rho(\pi,d|x)$. The posterior expected loss is thus a function of $x$ but this dependence is not troublesome, as opposed to the frequentist
dependence of the risk on the parameter because $x$, contrary to $\theta$, is known.