“forgetfulness” of the prior in the Bayesian setting?

It is well-known that as you have more evidence (say in the form of larger n for n i.i.d. examples), the Bayesian prior gets “forgotten”, and most of the inference is impacted by the evidence (or the likelihood).

It is easy to see it for various specific case (such as Bernoulli with Beta prior or other type of examples) – but is there a way to see it in the general case with x_1,\ldots,x_n \sim p(x|\mu) and some prior p(\mu)?

EDIT: I am guessing it cannot be shown in the general case for any prior (for example, a point-mass prior would keep the posterior a point-mass). But perhaps there are certain conditions under which a prior is forgotten.

Here is the kind of “path” I am thinking about showing something like that:

Assume the parameter space is \Theta, and let p(\theta) and q(\theta) be two priors which place non-zero probability mass on all of \Theta. So, the two posterior calculations for each prior amount to:

p(\theta | x_1,\ldots,x_n) = \frac{\prod_i p(x_i | \theta) p(\theta)}{\int_{\theta} \prod_i p(x_i | \theta) p(\theta) d\theta}

and
q(\theta | x_1,\ldots,x_n) = \frac{\prod_i p(x_i | \theta) q(\theta)}{\int_{\theta} \prod_i p(x_i | \theta) q(\theta) d\theta}

If you divide p by q (the posteriors), then you get:

p(\theta | x_1,\ldots,x_n)/q(\theta | x_1,\ldots,x_n) = \frac{p(\theta)\int_{\theta} \prod_i p(x_i | \theta) q(\theta)d \theta}{q(\theta)\int_{\theta} \prod_i p(x_i | \theta) p(\theta)d \theta}

Now I would like to explore the above term as n goes to \infty. Ideally it would go to 1 for a certain \theta that “makes sense” or some other nice behavior, but I can’t figure out how to show anything there.

Answer

Just a rough, but hopefully intuitive answer.

  1. Look at it from the log-space point of view:

    -\log P(\theta|x_1, \ldots, x_n)
    = -\log P(\theta) -\sum_{i=1}^n \log P(x_i|\theta) – C_n

    where C_n>0 is a constant that depends on the data, but not on the parameter, and where your likelihoods assume i.i.d. observations. Hence, just concentrate on the part that determines the shape of your posterior, namely

    S_n = -\log P(\theta) -\sum_{i=1}^n \log P(x_i|\theta)

  2. Assume that there is a D>0 such that -\log P(\theta) \leq D. This is reasonable for discrete distributions.

  3. Since the terms are all positive, S_n “will” grow (I’m skipping the technicalities here). But the contribution of the prior is bounded by D. Hence, the fraction contributed by the prior, which is at most D/S_n, decreases monotonically with each additional observation.

Rigorous proofs of course have to face the technicalities (and they can be very difficult), but the setting above is IMHO the very basic part.

Attribution
Source : Link , Question Author : bayesianOrFrequentist , Answer Author : Pedro A. Ortega

Leave a Comment