# Biased bootstrap: is it okay to center the CI around the observed statistic?

This is similar to Bootstrap: estimate is outside of confidence interval

I have some data that represents counts of genotypes in a population. I
want to estimate genetic diversity using Shannon’s index and also
generate a confidence interval using bootstrapping. I’ve noticed,
however, that the estimate via bootstrapping tends to be extremely
biased and results in a confidence interval that lies outside of my
observed statistic.

Below is an example.

# Shannon's index
H <- function(x){
x <- x/sum(x)
x <- -x * log(x, exp(1))
return(sum(x, na.rm = TRUE))
}
# The version for bootstrapping
H.boot <- function(x, i){
H(tabulate(x[i]))
}


Data generation

set.seed(5000)
X <- rmultinom(1, 100, prob = rep(1, 50))[, 1]


Calculation

H(X)

## [1] 3.67948

xi <- rep(1:length(X), X)
H.boot(xi)

## [1] 3.67948

library("boot")
types <- c("norm", "perc", "basic")
(boot.out <- boot::boot(xi, statistic = H.boot, R = 1000L))

##
## CASE RESAMPLING BOOTSTRAP FOR CENSORED DATA
##
##
## Call:
## boot::boot(data = xi, statistic = H.boot, R = 1000)
##
##
## Bootstrap Statistics :
##     original     bias    std. error
## t1*  3.67948 -0.2456241  0.06363903


Generating the CIs with bias-correction

boot.ci(boot.out, type = types)

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = boot.out, type = types)
##
## Intervals :
## Level      Normal              Basic              Percentile
## 95%   ( 3.800,  4.050 )   ( 3.810,  4.051 )   ( 3.308,  3.549 )
## Calculations and Intervals on Original Scale


Assuming that the variance of t can be used for the variance of t0.

norm.ci(t0 = boot.out$t0, var.t0 = var(boot.out$t[, 1]))[-1]

## [1] 3.55475 3.80421


Would it be correct to report the CI centered around t0? Is there a
better way to generate the bootstrap?

In the setup given by the OP the parameter of interest is the Shannon entropy
$$\theta(\mathbf{p}) = – \sum_{i = 1}^{50} p_i \log p_i,$$
which is a function of the probability vector $\mathbf{p} \in \mathbb{R}^{50}$.
The estimator based on $n$ samples ($n = 100$ in the simulation) is the plug-in estimator
$$\hat{\theta}_n = \theta(\hat{\mathbf{p}}_n) = – \sum_{i=1}^{50} \hat{p}_{n,i} \log \hat{p}_{n,i}.$$
The samples were generated using the uniform distribution for which the Shannon entropy is $\log(50) = 3.912.$ Since the Shannon entropy is maximized in the uniform distribution, the plug-in estimator must be downward biased. A simulation shows that $\mathrm{bias}(\hat{\theta}_{100}) \simeq -0.28$ whereas
$\mathrm{bias}(\hat{\theta}_{500}) \simeq -0.05$. The plug-in estimator is consistent, but the $\Delta$-method does not apply for $\mathbf{p}$ being the uniform distribution, because the derivative of the Shannon entropy is 0. Thus for this particular choice of $\mathbf{p}$, confidence intervals based on asymptotic arguments are not obvious.

The percentile interval is based on the distribution of $\theta(\mathbf{p}_n^*)$ where $\mathbf{p}_n^*$ is the estimator obtained from sampling $n$ observations from $\hat{\mathbf{p}}_n$. Specifically, it is the interval from the 2.5% quantile to the 97.5% quantile for the distribution of $\theta(\mathbf{p}_n^*)$. As the OP’s bootstrap simulation shows, $\theta(\mathbf{p}_n^*)$ is clearly also downward biased as an estimator of $\theta(\hat{\mathbf{p}}_n)$, which results in the percentile interval being completely wrong.

For the basic (and normal) interval, the roles of the quantiles are interchanged. This implies that the interval does seem to be reasonable (it covers 3.912), though intervals extending beyond 3.912 are not logically meaningful. Moreover, I don’t know if the basic interval will have the correct coverage. Its justification is based on the following approximate distributional identity:

$$\theta(\mathbf{p}_n^*) – \theta(\hat{\mathbf{p}}_n) \overset{\mathcal{D}}{\simeq} \theta(\hat{\mathbf{p}}_n) – \theta(\mathbf{p}),$$
which might be questionable for (relatively) small $n$ like $n = 100$.

The OP’s last suggestion of a standard error based interval $\theta(\hat{\mathbf{p}}_n) \pm 1.96\hat{\mathrm{se}}_n$ will not work either because of the large bias. It might work for a bias-corrected estimator, but then you first of all need correct standard errors for the bias-corrected estimator.

I would consider a likelihood interval based of the profile log-likelihood for $\theta(\mathbf{p})$. I’m afraid that I don’t know any simple way to compute the profile log-likelihood for this example except that you need to maximize the log-likelihood over $\mathbf{p}$ for different fixed values of $\theta(\mathbf{p})$.