Understanding the Bayes risk

When evaluating an estimator, the two probably most common used criteria are the maximum risk and the Bayes risk. My question refers to the latter one:

The bayes risk under the prior π is defined as follows:

Bπ(ˆθ)=R(θ,ˆθ)π(θ)dθ

I don’t quite get what the prior π is doing and how I should interpret it. If I have a risk function R(θ,ˆθ) and plot it, intuitively I would take its area as a criterion to judge how “strong” the risk is over all possible values of θ. But involving the prior somehow destroys this intuition again, although it is close. Can someone help me how to interpret the prior?

Answer

[Here is an excerpt from my own textbook, The Bayesian Choice (2007), that argues in favour of a decision-theoretic approach to Bayesian analysis, hence of using the Bayes risk.]

Except for the most trivial settings, it is generally
impossible to uniformly minimize (in d) the loss
function L(θ,d) when θ is unknown.
In order to derive an effective comparison criterion from
the loss function, the frequentist approach proposes to consider instead
the average loss (or frequentist risk)
R(θ,δ)=Eθ[L(θ,δ(x))]=XL(θ,δ(x))f(x|θ)dx,
where δ(x) is the decision rule, i.e., the allocation of
a decision to each outcome xf(x|θ) from the random
experiment.

The function δ, from X in D, is usually called estimator (while the value δ(x) is called estimate of θ). When there is no risk of confusion, we also denote the set of estimators by D.

The frequentist paradigm relies on this criterion
to compare estimators and, if possible, to select the best estimator,
the reasoning being that estimators are evaluated on their
long-run performance for all possible values of the parameter θ.
Notice, however, that there are several difficulties associated with this approach.

  1. The error (loss) is averaged over the different values of x
    proportionally to the density f(x|θ). Therefore, it seems
    that the observation x is not taken into account any further. The
    risk criterion evaluates procedures on their long-run performance
    and not directly for the given observation, x. Such an evaluation
    may be satisfactory for the statistician, but it is not so appealing
    for a client, who wants optimal results for her data x, not that
    of another’s!
  2. The frequentist analysis of the decision problem implicitly assumes
    that this problem will be met again and again, for the frequency
    evaluation to make sense. Indeed, R(θ,δ) is
    approximately the average loss over i.i.d. repetitions of the same
    experiment, according to the Law of Large Numbers. However, on both
    philosophical and practical grounds, there is a lot of controversy
    over the very notion of repeatability of experiments (see Jeffreys
    (1961)). For one thing, if new observations come to the
    statistician, she should make use of them, and this could modify the
    way the experiment is conducted, as in, for instance, medical
    trials.
  3. For a procedure δ, the risk R(θ,δ) is a function
    of the parameter θ. Therefore, the frequentist approach does
    not induce a total ordering on the set of procedures. It is
    generally impossible to compare decision procedures with this
    criterion, since two crossing risk functions prevent comparison
    between the corresponding estimators. At best, one may hope for a
    procedure δ0 that uniformly minimizes R(θ,δ),
    but such cases rarely occur unless the space of decision procedures
    is restricted. Best procedures can only be obtained by restricting
    rather artificially the set of authorized procedures.

Example 2.4 – Consider x1 and x2, two observations from
Pθ(x=θ1)=Pθ(x=θ+1)=0.5,θR.
The parameter of interest is θ (i.e., D=Θ) and
it is estimated by estimators δ under the loss
L(θ,δ)=1Iθ(δ),
often called 01 loss, which penalizes errors of estimation,
whatever their magnitude, by 1. Considering the particular \est
δ0(x1,x2)=x1+x22,
its risk function is
R(θ,δ0)=1Pθ(δ0(x1,x2)=θ)=1Pθ(x1x2)=0.5.
This computation shows that the estimator δ0 is correct half
of the time. Actually, this estimator is always correct when
x1x2, and always wrong otherwise. Now, the \est\
δ1(x1,x2)=x1+1 also has a risk function equal to 0.5,
as does δ2(x1,x2)=x21. Therefore, δ0, δ1
and δ2 cannot be ranked under the 01 loss.

On the contrary, the Bayesian approach to Decision Theory integrates
on the space \Theta since \theta is unknown, instead of
integrating on the space {\cal X} as x is known. It relies on
the posterior expected loss
\begin{eqnarray*}
\rho(\pi,d|x) & = & \mathbb{E}^\pi[L(\theta,d)|x] \\
& = & \int_{\Theta} \text{L}(\theta,d) \pi(\theta|x)\, d\theta,
\end{eqnarray*}

which averages the error (i.e., the loss) according to the posterior
distribution of the parameter \theta, conditionally on the observed
value} x. Given x, the average error resulting from decision d is actually \rho(\pi,d|x). The posterior expected loss is thus a function of x but this dependence is not troublesome, as opposed to the frequentist
dependence of the risk on the parameter because x, contrary to \theta, is known.

Attribution
Source : Link , Question Author : Peter Series , Answer Author : Xi’an

Leave a Comment