This is similar to Bootstrap: estimate is outside of confidence interval

I have some data that represents counts of genotypes in a population. I

want to estimate genetic diversity using Shannon’s index and also

generate a confidence interval using bootstrapping. I’ve noticed,

however, that the estimate via bootstrapping tends to be extremely

biased and results in a confidence interval that lies outside of my

observed statistic.Below is an example.

`# Shannon's index H <- function(x){ x <- x/sum(x) x <- -x * log(x, exp(1)) return(sum(x, na.rm = TRUE)) } # The version for bootstrapping H.boot <- function(x, i){ H(tabulate(x[i])) }`

Data generation

`set.seed(5000) X <- rmultinom(1, 100, prob = rep(1, 50))[, 1]`

Calculation

`H(X) ## [1] 3.67948 xi <- rep(1:length(X), X) H.boot(xi) ## [1] 3.67948 library("boot") types <- c("norm", "perc", "basic") (boot.out <- boot::boot(xi, statistic = H.boot, R = 1000L)) ## ## CASE RESAMPLING BOOTSTRAP FOR CENSORED DATA ## ## ## Call: ## boot::boot(data = xi, statistic = H.boot, R = 1000) ## ## ## Bootstrap Statistics : ## original bias std. error ## t1* 3.67948 -0.2456241 0.06363903`

Generating the CIs with bias-correction

`boot.ci(boot.out, type = types) ## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS ## Based on 1000 bootstrap replicates ## ## CALL : ## boot.ci(boot.out = boot.out, type = types) ## ## Intervals : ## Level Normal Basic Percentile ## 95% ( 3.800, 4.050 ) ( 3.810, 4.051 ) ( 3.308, 3.549 ) ## Calculations and Intervals on Original Scale`

Assuming that the variance of

tcan be used for the variance oft0.`norm.ci(t0 = boot.out$t0, var.t0 = var(boot.out$t[, 1]))[-1] ## [1] 3.55475 3.80421`

Would it be correct to report the CI centered around

t0? Is there a

better way to generate the bootstrap?

**Answer**

In the setup given by the OP the parameter of interest is the Shannon entropy

$$\theta(\mathbf{p}) = – \sum_{i = 1}^{50} p_i \log p_i,$$

which is a function of the probability vector $\mathbf{p} \in \mathbb{R}^{50}$.

The estimator based on $n$ samples ($n = 100$ in the simulation) is the plug-in estimator

$$\hat{\theta}_n = \theta(\hat{\mathbf{p}}_n) = – \sum_{i=1}^{50} \hat{p}_{n,i} \log \hat{p}_{n,i}.$$

The samples were generated using the uniform distribution for which the Shannon entropy is $\log(50) = 3.912.$ Since the Shannon entropy is maximized in the uniform distribution, the plug-in estimator *must be downward biased*. A simulation shows that $\mathrm{bias}(\hat{\theta}_{100}) \simeq -0.28$ whereas

$\mathrm{bias}(\hat{\theta}_{500}) \simeq -0.05$. The plug-in estimator is consistent, but the $\Delta$-method does not apply for $\mathbf{p}$ being the uniform distribution, because the derivative of the Shannon entropy is 0. Thus for this particular choice of $\mathbf{p}$, confidence intervals based on asymptotic arguments are not obvious.

The percentile interval is based on the distribution of $\theta(\mathbf{p}_n^*)$ where $\mathbf{p}_n^*$ is the estimator obtained from sampling $n$ observations from $\hat{\mathbf{p}}_n$. Specifically, it is the interval from the 2.5% quantile to the 97.5% quantile for the distribution of $\theta(\mathbf{p}_n^*)$. As the OP’s bootstrap simulation shows, $\theta(\mathbf{p}_n^*)$ is clearly also downward biased as an estimator of $\theta(\hat{\mathbf{p}}_n)$, which results in the percentile interval being completely wrong.

For the basic (and normal) interval, the roles of the quantiles are interchanged. This implies that the interval does seem to be reasonable (it covers 3.912), though intervals extending beyond 3.912 are not logically meaningful. Moreover, I don’t know if the basic interval will have the correct coverage. Its justification is based on the following approximate distributional identity:

$$\theta(\mathbf{p}_n^*) – \theta(\hat{\mathbf{p}}_n) \overset{\mathcal{D}}{\simeq} \theta(\hat{\mathbf{p}}_n) – \theta(\mathbf{p}),$$

which might be questionable for (relatively) small $n$ like $n = 100$.

The OP’s last suggestion of a standard error based interval $\theta(\hat{\mathbf{p}}_n) \pm 1.96\hat{\mathrm{se}}_n$ will *not* work either because of the large bias. It *might* work for a bias-corrected estimator, but then you first of all need correct standard errors for the bias-corrected estimator.

I would consider a likelihood interval based of the profile log-likelihood for $\theta(\mathbf{p})$. I’m afraid that I don’t know any simple way to compute the profile log-likelihood for this example except that you need to maximize the log-likelihood over $\mathbf{p}$ for different fixed values of $\theta(\mathbf{p})$.

**Attribution***Source : Link , Question Author : ZNK , Answer Author : NRH*