From a Bayesian probability perspective, why doesn’t a 95% confidence interval contain the true parameter with 95% probability?

From the Wikipedia page on confidence intervals:

…if confidence intervals are constructed across many separate data analyses of repeated (and possibly different) experiments, the proportion of such intervals that contain the true value of the parameter will match the confidence level…

And from the same page:

A confidence interval does not predict that the true value of the parameter has a particular probability of being in the confidence interval given the data actually obtained.

If I understood it right, this last statement is made with the frequentist interpretation of probability in mind. However, from a Bayesian probability perspective, why doesn’t a 95% confidence interval contain the true parameter with 95% probability? And if it doesn’t, what is wrong with the following reasoning?

If I have a process that I know produces a correct answer 95% of the time then the probability of the next answer being correct is 0.95 (given that I don’t have any extra information regarding the process). Similarly if someone shows me a confidence interval that is created by a process that will contain the true parameter 95% of the time, should I not be right in saying that it contains the true parameter with 0.95 probability, given what I know?

This question is similar to, but not the same as, Why does a 95% CI not imply a 95% chance of containing the mean? The answers to that question have been focusing on why a 95% CI does not imply a 95% chance of containing the mean from a frequentist perspective. My question is the same, but from a Bayesian probability perspective.


Update: With the benefit of a few years’ hindsight, I’ve penned a more concise treatment of essentially the same material in response to a similar question.

How to Construct a Confidence Region

Let us begin with a general method for constructing confidence regions. It can be applied to a single parameter, to yield a confidence interval or set of intervals; and it can be applied to two or more parameters, to yield higher dimensional confidence regions.

We assert that the observed statistics D originate from a distribution with parameters θ, namely the sampling distribution s(d|θ) over possible statistics d, and seek a confidence region for θ in the set of possible values Θ. Define a Highest Density Region (HDR): the h-HDR of a PDF is the smallest subset of its domain that supports probability h. Denote the h-HDR of s(d|ψ) as Hψ, for any ψΘ. Then, the h confidence region for θ, given data D, is the set CD={ϕ:DHϕ}. A typical value of h would be 0.95.

A Frequentist Interpretation

From the preceding definition of a confidence region follows
with Cd={ϕ:dHϕ}. Now imagine a large set of (imaginary) observations {Di}, taken under similar circumstances to D. i.e. They are samples from s(d|θ). Since Hθ supports probability mass h of the PDF s(d|θ), P(DiHθ)=h for all i. Therefore, the fraction of {Di} for which DiHθ is h. And so, using the equivalence above, the fraction of {Di} for which θCDi is also h.

This, then, is what the frequentist claim for the h confidence region for θ amounts to:

Take a large number of imaginary observations {Di} from the sampling distribution s(d|θ) that gave rise to the observed statistics D. Then, θ lies within a fraction h of the analogous but imaginary confidence regions {CDi}.

The confidence region CD therefore does not make any claim about the probability that θ lies somewhere! The reason is simply that there is nothing in the fomulation that allows us to speak of a probability distribution over θ. The interpretation is just elaborate superstructure, which does not improve the base. The base is only s(d|θ) and D, where θ does not appear as a distributed quantity, and there is no information we can use to address that. There are basically two ways to get a distribution over θ:

  1. Assign a distribution directly from the information at hand: p(θ|I).
  2. Relate θ to another distributed quantity: p(θ|I)=p(θx|I)dx=p(θ|xI)p(x|I)dx.

In both cases, θ must appear on the left somewhere. Frequentists cannot use either method, because they both require a heretical prior.

A Bayesian View

The most a Bayesian can make of the h confidence region CD, given without qualification, is simply the direct interpretation: that it is the set of ϕ for which D falls in the h-HDR Hϕ of the sampling distribution s(d|ϕ). It does not necessarily tell us much about θ, and here’s why.

The probability that θCD, given D and the background information I, is:
Notice that, unlike the frequentist interpretation, we have immediately demanded a distribution over θ. The background information I tells us, as before, that the sampling distribution is s(d|θ):
Now this expression does not in general evaluate to h, which is to say, the h confidence region CD does not always contain θ with probability h. In fact it can be starkly different from h. There are, however, many common situations in which it does evaluate to h, which is why confidence regions are often consistent with our probabilistic intuitions.

For example, suppose that the prior joint PDF of d and θ is symmetric in that pd,θ(d,θ|I)=pd,θ(θ,d|I). (Clearly this involves an assumption that the PDF ranges over the same domain in d and θ.) Then, if the prior is p(θ|I)=f(θ), we have s(D|θ)p(θ|I)=s(D|θ)f(θ)=s(θ|D)f(D). Hence
From the definition of an HDR we know that for any ψΘ
Hψs(d|ψ)dd=hand therefore thatHDs(d|D)dd=hor equivalentlyHDs(θ|D)dθ=h
Therefore, given that s(d|θ)f(θ)=s(θ|d)f(d), CD=HD implies P(θCD|DI)=h. The antecedent satisfies
Applying the equivalence near the top:
Thus, the confidence region CD contains θ with probability h if for all possible values ψ of θ, the h-HDR of s(d|ψ) contains D if and only if the h-HDR of s(d|D) contains ψ.

Now the symmetric relation DHψψHD is satisfied for all ψ when s(ψ+δ|ψ)=s(Dδ|D) for all δ that span the support of s(d|D) and s(d|ψ). We can therefore form the following argument:

  1. s(d|θ)f(θ)=s(θ|d)f(d) (premise)
  2. ψδ[s(ψ+δ|ψ)=s(Dδ|D)] (premise)
  3. ψδ[s(ψ+δ|ψ)=s(Dδ|D)]ψ[DHψψHD]
  4. \forall \psi \; [ D \in H_\psi \leftrightarrow \psi \in H_D ] \longrightarrow C_D = H_D
  5. \therefore \quad C_D = H_D
  6. [s(d | \theta) f(\theta) = s(\theta | d) f(d) \wedge C_D = H_D] \longrightarrow P(\theta \in C_D | DI) = h
  7. \therefore \quad P(\theta \in C_D | DI) = h

Let’s apply the argument to a confidence interval on the mean of a 1-D normal distribution (\mu, \sigma), given a sample mean \bar{x} from n measurements. We have \theta = \mu and d = \bar{x}, so that the sampling distribution is

s(d | \theta) = \frac{\sqrt{n}}{\sigma \sqrt{2 \pi}} e^{-\frac{n}{2 \sigma^2} { \left( d – \theta \right) }^2 }

Suppose also that we know nothing about \theta before taking the data (except that it’s a location parameter) and therefore assign a uniform prior: f(\theta) = k. Clearly we now have s(d | \theta) f(\theta) = s(\theta | d) f(d), so the first premise is satisfied. Let s(d | \theta) = g\left( (d – \theta)^2 \right). (i.e. It can be written in that form.) Then
s(\psi + \delta | \psi) = g \left( (\psi + \delta – \psi)^2 \right) = g(\delta^2) \\
\text{and} \quad\quad s(D – \delta | D) = g \left( (D – \delta – D)^2 \right) = g(\delta^2) \\
\text{so that} \quad\quad \forall \psi \; \forall \delta \; [s(\psi + \delta | \psi) = s(D – \delta | D)]

whereupon the second premise is satisfied. Both premises being true, the eight-point argument leads us to conclude that the probability that \theta lies in the confidence interval C_D is h!

We therefore have an amusing irony:

  1. The frequentist who assigns the h confidence interval cannot say that P(\theta \in C_D) = h, no matter how innocently uniform \theta looks before incorporating the data.
  2. The Bayesian who would not assign an h confidence interval in that way knows anyhow that P(\theta \in C_D | DI) = h.

Final Remarks

We have identified conditions (i.e. the two premises) under which the h confidence region does indeed yield probability h that \theta \in C_D. A frequentist will baulk at the first premise, because it involves a prior on \theta, and this sort of deal-breaker is inescapable on the route to a probability. But for a Bayesian, it is acceptable—nay, essential. These conditions are sufficient but not necessary, so there are many other circumstances under which the Bayesian P(\theta \in C_D | DI) equals h. Equally though, there are many circumstances in which P(\theta \in C_D | DI) \ne h, especially when the prior information is significant.

We have applied a Bayesian analysis just as a consistent Bayesian would, given the information at hand, including statistics D. But a Bayesian, if he possibly can, will apply his methods to the raw measurements instead—to the \{x_i\}, rather than \bar{x}. Oftentimes, collapsing the raw data into summary statistics D destroys information in the data; and then the summary statistics are incapable of speaking as eloquently as the original data about the parameters \theta.

Source : Link , Question Author : Rasmus Bååth , Answer Author : CarbonFlambe

Leave a Comment