When evaluating an estimator, the two probably most common used criteria are the maximum risk and the Bayes risk. My question refers to the latter one:

The bayes risk under the prior π is defined as follows:

Bπ(ˆθ)=∫R(θ,ˆθ)π(θ)dθ

I don’t quite get what the prior π is doing and how I should interpret it. If I have a risk function R(θ,ˆθ) and plot it, intuitively I would take its area as a criterion to judge how “strong” the risk is over all possible values of θ. But involving the prior somehow destroys this intuition again, although it is close. Can someone help me how to interpret the prior?

**Answer**

*[Here is an excerpt from my own textbook, The Bayesian Choice (2007), that argues in favour of a decision-theoretic approach to Bayesian analysis, hence of using the Bayes risk.]*

Except for the most trivial settings, it is generally

impossible to uniformly minimize (in d) the loss

function L(θ,d) when θ is unknown.

In order to derive an effective comparison criterion from

the loss function, the *frequentist* approach proposes to consider instead

the average loss (or *frequentist risk*)

R(θ,δ)=Eθ[L(θ,δ(x))]=∫XL(θ,δ(x))f(x|θ)dx,

where δ(x) is the decision rule, i.e., the allocation of

a decision to each outcome x∼f(x|θ) from the random

experiment.

The function δ, from X in D, is usually called *estimator* (while the value δ(x) is called *estimate* of θ). When there is no risk of confusion, we also denote the set of estimators by D.

The *frequentist paradigm* relies on this criterion

to compare estimators and, if possible, to select the best estimator,

the reasoning being that estimators are evaluated on their

long-run performance for all possible values of the parameter θ.

Notice, however, that there are several difficulties associated with this approach.

- The error (loss) is averaged over the different values of x

proportionally to the density f(x|θ). Therefore, it seems

that the observation x is not taken into account any further. The

risk criterion evaluates procedures on their long-run performance

and not directly for the given observation, x. Such an evaluation

may be satisfactory for the statistician, but it is not so appealing

for a client, who wants optimal results for her data x, not that

of another’s! - The frequentist analysis of the decision problem implicitly assumes

that this problem will be met again and again, for the frequency

evaluation to make sense. Indeed, R(θ,δ) is

approximately the average loss over i.i.d. repetitions of the same

experiment, according to the Law of Large Numbers. However, on both

philosophical and practical grounds, there is a lot of controversy

over the very notion of repeatability of experiments (see Jeffreys

(1961)). For one thing, if new observations come to the

statistician, she should make use of them, and this could modify the

way the experiment is conducted, as in, for instance, medical

trials. - For a procedure δ, the risk R(θ,δ) is a function

of the parameter θ. Therefore, the frequentist approach does

not induce a total ordering on the set of procedures. It is

generally impossible to compare decision procedures with this

criterion, since two crossing risk functions prevent comparison

between the corresponding estimators. At best, one may hope for a

procedure δ0 that uniformly minimizes R(θ,δ),

but such cases rarely occur unless the space of decision procedures

is restricted. Best procedures can only be obtained by restricting

rather artificially the set of authorized procedures.

**Example 2.4 –** Consider x1 and x2, two observations from

Pθ(x=θ−1)=Pθ(x=θ+1)=0.5,θ∈R.

The parameter of interest is θ (i.e., D=Θ) and

it is estimated by estimators δ under the loss

L(θ,δ)=1−Iθ(δ),

often called 0−1 *loss*, which penalizes errors of estimation,

whatever their magnitude, by 1. Considering the particular \est

δ0(x1,x2)=x1+x22,

its risk function is

R(θ,δ0)=1−Pθ(δ0(x1,x2)=θ)=1−Pθ(x1≠x2)=0.5.

This computation shows that the estimator δ0 is correct half

of the time. Actually, this estimator is always correct when

x1≠x2, and always wrong otherwise. Now, the \est\

δ1(x1,x2)=x1+1 also has a risk function equal to 0.5,

as does δ2(x1,x2)=x2−1. Therefore, δ0, δ1

and δ2 cannot be ranked under the 0−1 loss. ▸

On the contrary, the Bayesian approach to Decision Theory integrates

on the space \Theta since \theta is unknown, instead of

integrating on the space {\cal X} as x is known. It relies on

the *posterior expected loss*

\begin{eqnarray*}

\rho(\pi,d|x) & = & \mathbb{E}^\pi[L(\theta,d)|x] \\

& = & \int_{\Theta} \text{L}(\theta,d) \pi(\theta|x)\, d\theta,

\end{eqnarray*}

which averages the error (i.e., the loss) according to the posterior

distribution of the parameter \theta, conditionally on the observed

value} x. Given x, the average error resulting from decision d is actually \rho(\pi,d|x). The posterior expected loss is thus a function of x but this dependence is not troublesome, as opposed to the frequentist

dependence of the risk on the parameter because x, contrary to \theta, is known.

**Attribution***Source : Link , Question Author : Peter Series , Answer Author : Xi’an*