# Bootstrap: the issue of overfitting

Suppose one performs the so-called non-parametric bootstrap by drawing $B$ samples of size $n$ each from the original $n$ observations with replacement. I believe this procedure is equivalent to estimating the cumulative distribution function by the empirical cdf:

http://en.wikipedia.org/wiki/Empirical_distribution_function

and then obtaining the bootstrap samples by simulating $n$ observations from the estimated cdf $B$ times in a row.

If I am right in this, then one has to address the issue of overfitting, because the empirical cdf has about N parameters. Of course, asymptotically it converges to the population cdf, but what about finite samples? E.g. if I were to tell you that I have 100 observations and I am going to estimate the cdf as $N(\mu, \sigma^2)$ with two parameters, you wouldn’t be alarmed. However, if the number of parameters were to go up to 100, it wouldn’t seem reasonable at all.

Likewise, when one employs a standard multiple linear regression, the distribution of the error term is estimated as $N(0, \sigma^2)$. If one decides to switch to bootstrapping the residuals, he has to realize that now there are about $n$ parameters used just to handle the error term distribution.

Could you please direct me to some sources that address this issue explicitly, or tell me why it’s not an issue if you think I got it wrong.

i am not completely sure i understand your question right… i am assuming you are interested in the order of convergence?

because the empirical cdf has about N parameters. Of course, asymptotically it converges to the population cdf, but what about finite samples?

Have you read any of the basics on bootstrap theory?
The Problem is that it gets pretty wild (mathematically) pretty quickly.

Anyway, i recommend having a look at

van der Vaart “Asymptotic Statistics” chapter 23.

Hall “Bootstrap and Edgeworth expansions” (lengthy but concise and less handwaving than van der Vaart i’d say)

for the basics.

Chernick “Bootstrap Methods” is more aimed at users rather than mathematicians but has a section on “where bootstrap fails”.

The classical Efron/Tibshirani has little on why bootstrap actually works…