I am reading the chapter on Frequent Statistics from Kevin Murphy’s book “

Machine Learning – A Probabilistic Perspective“. The section on bootstrap reads:The bootstrap is a simple Monte Carlo technique to approximate the

sampling distribution. This is particularly useful in cases where the

estimator is a complex function of the true parameters.The idea is simple. If we knew the true parameters θ^∗ , we could

generate many (say S) fake datasets, each of size N, from the true

distribution, x_i^s \sim p (·| θ^∗ ), for s = 1 : S, i = 1 : N.

We could then compute our estimator from each sample,

\hat{\theta^s}=f (x^s_{1:N}) and use the empirical distribution of

the resulting samples as our estimate of the sampling distribution.

Since \theta is unknown, the idea of theparametric bootstrapis to

generate the samples using \hat{\theta}(D) instead.An alternative, called the

non-parametric bootstrap, is to sample the

x^s_i (with replacement) from the original data D , and then

compute the induced distribution as before. Some methods for speeding

up the bootstrap when applied to massive data sets are discussed in

(Kleiner et al. 2011).

1. The text says:If we knew the true parameters \theta^* … we could compute our

estimator from each sample, \hat{\theta^s}…but why would I use the estimator of each sample if I

alreadyknow the true parameters \theta^*?

2. Also, what is the difference here between the empirical distribution and the sampling distribution?

3. Finally, I don’t quite understand the difference betweenparametricandnon-parametricbootstrap from this text. They both infer \theta from the set of observations D, but what is exactly the difference?

**Answer**

The answer given by miura is not entirely accurate so I am answering this old question for posterity:

(2). These are very different things. The empirical cdf is an estimate of the CDF (distribution) which generated the data. Precisely, it is the discrete CDF which assigns probability 1/n to each observed data point, \hat{F}(x) = \frac{1}{n}\sum_{i=1}^n I(X_i\leq x), for each x. This estimator converges to the true cdf: \hat{F}(x) \to F(x) = P(X_i\leq x) almost surely for each x (in fact uniformly).

The sampling distribution of a statistic T is instead the distribution of the statistic you would expect to see under repeated experimentation. That is, you perform your experiment once and collect data {X_1,\ldots,X_n}. T is a function of your data: T = T(X_1,\ldots,X_n). Now, suppose you repeat the experiment, and collect data {X’_1,\ldots,X’_n}. Recalculating T on the new sample gives T’ = T({X’_1,\ldots,X’_n}). If we collected 100 samples we would have 100 estimates of T. These observations of T form the sampling distribution of T. It is a true distribution. As the number of experiments goes to infinity its mean converges to E(T) and its variance to Var(T).

In general of course we don’t repeat experiments like this, we only ever see one instance of T. Figuring out what the variance of T is from a single observation is very difficult if you don’t know the underlying probability function of T a priori. Bootstrapping is a way to estimate that sampling distribution of T by artificially running “new experiments” on which to calculate new instances of T. Each new sample is actually just a resample from the original data. That this provides you with more information than you have in the original data is mysterious and totally awesome.

(1). You are correct–you would not do this. The author is trying to motivate the parametric bootstrap by describing it as doing “what you would do if you knew the distribution” but substituting a very good estimator of the distribution function–the empirical cdf.

For example, suppose you know that your test statistic T is normally distributed with mean zero, variance one. How would you estimate the sampling distribution of T? Well, since you know the distribution, a silly and redundant way to estimate the sampling distribution is to use R to generate 10,000 or so standard normal random variables, then take their sample mean and variance, and use these as our estimates of the mean and variance of the sampling distribution of T.

If we *don’t* know a priori the parameters of T, but we do know that it’s normally distributed, what we can do instead is generate 10,000 or so samples from the empirical cdf, calculate T on each of them, then take the sample mean and variance of these 10,000 Ts, and use them as our estimates of the expected value and variance of T. Since the empirical cdf is a good estimator of the true cdf, the sample parameters should converge to the true parameters. This is the parametric bootstrap: you posit a model on the statistic you want to estimate. The model is indexed by a parameter, e.g. (\mu, \sigma), which you estimate from repeated sampling from the ecdf.

(3). The nonparametric bootstrap doesn’t even require you to know a priori that T is normally distributed. Instead, you simply draw repeated samples from the ecdf, and calculate T on each one. After you’ve drawn 10,000 or so samples and calculated 10,000 Ts, you can plot a histogram of your estimates. This is a visualization of the sampling distribution of T. The nonparametric bootstrap won’t tell you that the sampling distribution is normal, or gamma, or so on, but it allows you to estimate the sampling distribution (usually) as precisely as needed. It makes fewer assumptions and provides less information than the parametric bootstrap. It is less precise when the parametric assumption is true but more accurate when it is false. Which one you use in each situation you encounter depends entirely on context. Admittedly more people are familiar with the nonparametric bootstrap but frequently a weak parametric assumption makes a completely intractable model amenable to estimation, which is lovely.

**Attribution***Source : Link , Question Author : Amelio Vazquez-Reina , Answer Author : guest47*