# When does a confidence interval “make sense” but the corresponding credible interval does not?

It is often the case that a confidence interval with 95% coverage is very similar to a credible interval that contains 95% of the posterior density. This happens when the prior is uniform or near uniform in the latter case. Thus a confidence interval can often be used to approximate a credible interval and vice versa. Importantly, we can conclude from this that the much maligned misinterpretation of a confidence interval as a credible interval has little to no practical importance for many simple use cases.

There are a number of examples out there of cases where this does not happen, however they all seem to be cherrypicked by proponents of Bayesian stats in an attempt to prove there is something wrong with the frequentist approach. In these examples, we see the confidence interval contains impossible values, etc which is supposed to show that they are nonsense.

I don’t want to go back over those examples, or a philosophical discussion of Bayesian vs Frequentist.

I am just looking for examples of the opposite. Are there any cases where the confidence and credible intervals are substantially different, and the interval provided by the confidence procedure is clearly superior?

To clarify: This is about the situation when the credible interval is usually expected to coincide with the corresponding confidence interval, ie when using flat, uniform, etc priors. I am not interested in the case where someone chooses an arbitrarily bad prior.

EDIT:
In response to @JaeHyeok Shin’s answer below, I must disagree that his example uses the correct likelihood. I used approximate bayesian computation to estimate the correct posterior distribution for theta below in R:

### Methods ###
# Packages
require(HDInterval)

# Define the likelihood
like <- function(k = 1.2, theta = 0, n_print = 1e5){
x    = NULL
rule = FALSE
while(!rule){
x     = c(x, rnorm(1, theta, 1))
n     = length(x)
x_bar = mean(x)

rule = sqrt(n)*abs(x_bar) > k

if(n %% n_print == 0){ print(c(n, sqrt(n)*abs(x_bar))) }
}
return(x)
}

# Plot results
plot_res <- function(chain, i){
par(mfrow = c(2, 1))
plot(chain[1:i, 1], type = "l", ylab = "Theta", panel.first = grid())
hist(chain[1:i, 1], breaks = 20, col = "Grey", main = "", xlab = "Theta")
}

### Generate target data ###
set.seed(0123)
X = like(theta = 0)
m = mean(X)

### Get posterior estimate of theta via ABC ###
tol   = list(m = 1)
nBurn = 1e3
nStep = 1e4

# Initialize MCMC chain
chain           = as.data.frame(matrix(nrow = nStep, ncol = 2))
colnames(chain) = c("theta", "mean")
chain$theta = rnorm(1, 0, 10) # Run ABC for(i in 2:nStep){ theta = rnorm(1, chain[i - 1, 1], 10) prop = like(theta = theta) m_prop = mean(prop) if(abs(m_prop - m) < tol$m){
chain[i,] = c(theta, m_prop)
}else{
chain[i, ] = chain[i - 1, ]
}
if(i %% 100 == 0){
print(paste0(i, "/", nStep))
plot_res(chain, i)
}
}

# Remove burn-in
chain = chain[-(1:nBurn), ]

# Results
plot_res(chain, nrow(chain))
as.numeric(hdi(chain[, 1], credMass = 0.95))


This is the 95% credible interval:

> as.numeric(hdi(chain[, 1], credMass = 0.95))
 -1.400304  1.527371 EDIT #2:

Here is an update after @JaeHyeok Shin’s comments. I’m trying to keep it as simple as possible but the script got a bit more complicated. Main changes:

1. Now using a tolerance of 0.001 for the mean (it was 1)
2. Increased number of steps to 500k to account for smaller tolerance
3. Decreased the sd of the proposal distribution to 1 to account for smaller tolerance (it was 10)
4. Added the simple rnorm likelihood with n = 2k for comparison
5. Added the sample size (n) as a summary statistic, set tolerance to 0.5*n_target

Here is the code:

### Methods ###
# Packages
require(HDInterval)

# Define the likelihood
like <- function(k = 1.3, theta = 0, n_print = 1e5, n_max = Inf){
x    = NULL
rule = FALSE
while(!rule){
x     = c(x, rnorm(1, theta, 1))
n     = length(x)
x_bar = mean(x)
rule  = sqrt(n)*abs(x_bar) > k
if(!rule){
rule = ifelse(n > n_max, TRUE, FALSE)
}

if(n %% n_print == 0){ print(c(n, sqrt(n)*abs(x_bar))) }
}
return(x)
}

# Define the likelihood 2
like2 <- function(theta = 0, n){
x = rnorm(n, theta, 1)
return(x)
}

# Plot results
plot_res <- function(chain, chain2, i, main = ""){
par(mfrow = c(2, 2))
plot(chain[1:i, 1],  type = "l", ylab = "Theta", main = "Chain 1", panel.first = grid())
hist(chain[1:i, 1],  breaks = 20, col = "Grey", main = main, xlab = "Theta")
plot(chain2[1:i, 1], type = "l", ylab = "Theta", main = "Chain 2", panel.first = grid())
hist(chain2[1:i, 1], breaks = 20, col = "Grey", main = main, xlab = "Theta")
}

### Generate target data ###
set.seed(01234)
X    = like(theta = 0, n_print = 1e5, n_max = 1e15)
m    = mean(X)
n    = length(X)
main = c(paste0("target mean = ", round(m, 3)), paste0("target n = ", n))

### Get posterior estimate of theta via ABC ###
tol   = list(m = .001, n = .5*n)
nBurn = 1e3
nStep = 5e5

# Initialize MCMC chain
chain           = chain2 = as.data.frame(matrix(nrow = nStep, ncol = 2))
colnames(chain) = colnames(chain2) = c("theta", "mean")
chain$$theta = chain2$$theta  = rnorm(1, 0, 1)

# Run ABC
for(i in 2:nStep){
# Chain 1
theta1 = rnorm(1, chain[i - 1, 1], 1)
prop   = like(theta = theta1, n_max = n*(1 + tol$$n)) m_prop = mean(prop) n_prop = length(prop) if(abs(m_prop - m) < tol$$m &&
abs(n_prop - n) < tol$n){ chain[i,] = c(theta1, m_prop) }else{ chain[i, ] = chain[i - 1, ] } # Chain 2 theta2 = rnorm(1, chain2[i - 1, 1], 1) prop2 = like2(theta = theta2, n = 2000) m_prop2 = mean(prop2) if(abs(m_prop2 - m) < tol$m){
chain2[i,] = c(theta2, m_prop2)
}else{
chain2[i, ] = chain2[i - 1, ]
}

if(i %% 1e3 == 0){
print(paste0(i, "/", nStep))
plot_res(chain, chain2, i, main = main)
}
}

# Remove burn-in
nBurn  = max(which(is.na(chain$$mean) | is.na(chain2$$mean)))
chain  = chain[ -(1:nBurn), ]
chain2 = chain2[-(1:nBurn), ]

# Results
plot_res(chain, chain2, nrow(chain), main = main)
hdi1 = as.numeric(hdi(chain[, 1],  credMass = 0.95))
hdi2 = as.numeric(hdi(chain2[, 1], credMass = 0.95))

2*1.96/sqrt(2e3)
diff(hdi1)
diff(hdi2)


The results, where hdi1 is my “likelihood” and hdi2 is the simple rnorm(n, theta, 1):

> 2*1.96/sqrt(2e3)
 0.08765386
> diff(hdi1)
 1.087125
> diff(hdi2)
 0.07499163


So after lowering the tolerance sufficiently, and at the expense of many more MCMC steps, we can see the expected CrI width for the rnorm model. In a sequential experimental design, the credible interval can be misleading.

(Disclaimer: I am not arguing it is not reasonable – it is perfectly reasonable in Bayesian reasoning and is not misleading in the perspective of Bayesian point of view.)

For a simple example, let say we have a machine giving us a random sample $$XX$$ from $$N(θ,1)N(\theta,1)$$ with unknown $$θ\theta$$. Instead of drawing $$nn$$ i.i.d. samples, we draw samples until $$√nˉXn>k\sqrt{n} \bar{X}_n > k$$ for a fixed $$kk$$. That is, the number of samples is a stopping time $$NN$$ defined by
$$N=inf N = \inf\left\{n \geq 1 : \sqrt{n}\bar{X}_n > k \right\}.$$

From the law of iterated logarithm, we know $$P_{\theta}(N < \infty) = 1P_{\theta}(N < \infty) = 1$$ for any $$\theta \in \mathbb{R}\theta \in \mathbb{R}$$. This type of stopping rule is commonly used in sequential testing/estimations to reduce the number of samples to do inference.

Likelihood principle shows that posterior of $$\theta\theta$$ is not affected by stopping rule and thus for any reasonable smooth prior $$\pi(\theta)\pi(\theta)$$ (e.g., $$\theta \sim N(0, 10))\theta \sim N(0, 10))$$, if we set a large enough $$kk$$, the posterior of $$\theta\theta$$ is approximately $$N(\bar{X}_N, 1/N)N(\bar{X}_N, 1/N)$$ and thus the credible interval is approximately given as
$$CI_{bayes} :=\left[\bar{X}_N - \frac{1.96}{\sqrt{N}}, \bar{X}_N + \frac{1.96}{\sqrt{N}}\right]. CI_{bayes} :=\left[\bar{X}_N - \frac{1.96}{\sqrt{N}}, \bar{X}_N + \frac{1.96}{\sqrt{N}}\right].$$
However, from the definition of $$NN$$, we know that this credible interval does not contain $$00$$ if $$kk$$ is large since
$$0 < \bar{X}_N - \frac{k}{\sqrt{N}} \ll \bar{X}_N - \frac{1.96}{\sqrt{N}} 0 < \bar{X}_N - \frac{k}{\sqrt{N}} \ll \bar{X}_N - \frac{1.96}{\sqrt{N}}$$
for $$k \gg 0k \gg 0$$. Therefore, the frequentist coverage of $$CI_{bayes}CI_{bayes}$$ is zero since
$$\inf_{\theta} P_\theta( \theta \in CI_{bayes} ) = 0, \inf_{\theta} P_\theta( \theta \in CI_{bayes} ) = 0,$$
and $$00$$ is attained when $$\theta\theta$$ is $$00$$. In contrast, the Bayesian coverage is always approximately equal to $$0.950.95$$ since
$$\mathbb{P}( \theta \in CI_{bayes} | X_1, \dots, X_N) \approx 0.95. \mathbb{P}( \theta \in CI_{bayes} | X_1, \dots, X_N) \approx 0.95.$$

Take home message: If you are interested in having a frequentist guarantee, you should be careful about using Bayesian inference tools which is always valid for Bayesian guarantees but not always for frequentist ones.

(I learned this example from Larry's awesome lecture. This note contains many interesting discussion about the subtle difference between frequentist and Bayesian frameworks. http://www.stat.cmu.edu/~larry/=stat705/Lecture14.pdf)

EDIT In Livid's ABC, the tolerance value is too large, so even for the standard setting where we sample a fixed number of observations, it does not gives a correct CR. I am not familiar with ABC but if I only changed the tol value to 0.05, we can have a very skewed CR as following

> X = like(theta = 0)
> m = mean(X)
> print(m)
 0.02779672 > as.numeric(hdi(chain[, 1], credMass = 0.95))  -0.01711265 0.14253673

Of course, the chain is not well-stabilized but even if we increase the chain length, we can get similar CR - skewed to positive part.

In fact, I think the rejection rule based on mean difference is not well-suited in this setting since with high probability $$\sqrt{N}\bar{X}_N\sqrt{N}\bar{X}_N$$ is close to $$kk$$ if $$0<\theta \ll k0<\theta \ll k$$ and close to $$-k-k$$ if $$-k \ll \theta <0-k \ll \theta <0$$.