Beta distribution on flipping a coin

Kruschke’s Bayesian book says, regarding the use of a beta distribution for flipping a coin,

For example, if we have no prior knowledge other than the knowledge
that the coin has a head side and a tail side, that’s tantamount to
having previously observed one head and one tail, which corresponds to
a = 1 and b = 1.

Why would no information be tantamount to having seen one head and one tail — 0 heads and 0 tails seems more natural to me.

Answer

The quotation is a “logical sleight-of-hand” (great expression!), as noted by @whuber in comments to the OP. The only thing we can really say after seeing that the coin has an head and a tail, is that both the events “head” and “tail” are not impossible. Thus we could discard a discrete prior which puts all of the probability mass on “head” or on “tail”. But this doesn’t lead, by itself, to the uniform prior: the question is much more subtle. Let’s first of all summarize a bit of background. We’re considering the Beta-Binominal conjugate model for Bayesian inference of the probability θ of heads of a coin, given n independent and identically distributed (conditionally on θ) coin tosses. From the expression of p(θ|x) when we observe x heads in n tosses:

p(θ|x)=Beta(x+α,nx+β)

we can say that α and β play the roles of a “prior number of heads” and “prior number of tails” (pseudotrials), and α+β can be interpreted as an effective sample size. We could also arrive at this interpretation using the well-known expression for the posterior mean as a weighted average of the prior mean αα+β and the sample mean xn.

Looking at p(θ|x), we can make two considerations:

  1. since we have no prior knowledge about θ (maximum ignorance),
    we intuitively expect the effective sample size α+β to be
    “small”. If it were large, then the prior would be incorporating
    quite a lot of knowledge. Another way of seeing this is noting that
    if α and β are “small” with respect to x and nx,
    the posterior probability won’t depend a lot on our prior, because
    x+αx and nx+βnx. We would expect that
    a prior which doesn’t incorporate a lot of knowledge must quickly
    become irrelevant in light of some data.
  2. Also, since μprior=αα+β is the prior
    mean, and we have no prior knowledge about the distribution of
    θ, we would expect μprior=0.5. This is an argument of
    symmetry – if we don’t know any better, we wouldn’t expect a
    priori
    that the distribution is skewed towards 0 or towards 1. The
    Beta distribution is

    f(θ|α,β)=Γ(α+β)Γ(α)+Γ(β)θα1(1θ)β1

    This expression is only symmetric around θ=0.5 if
    α=β.

For these two reasons, whatever prior (belonging to the Beta family – remember, conjugate model!) we choose to use, we intuitively expect that α=β=c and c is “small”. We can see that all the three commonly used non-informative priors for the Beta-Binomial model share these traits, but other than that, they are quite different. And this is obvious: no prior knowledge, or “maximum ignorance”, is not a scientific definition, so what kind of prior expresses “maximum ignorance”, i.e., what’s a non-informative prior, depends on what you actually mean as “maximum ignorance”.

  1. we could choose a prior which says that all values for θ are
    equiprobable, since we don’t know any better. Again, a symmetry
    argument. This corresponds to α=β=1:

    f(θ|1,1)=Γ(2)2Γ(1)θ0(1θ)0=1

    for θ[0,1], i.e., the uniform prior used by Kruschke. More
    formally, by writing out the expression for the differential entropy
    of the Beta distribution, you can see that it is maximized when
    α=β=1. Now, entropy is often interpreted as a measure of “the
    amount of information” carried by a distribution: higher entropy corresponds to less information. Thus, you could use this
    maximum entropy principle to say that, inside the Beta family, the
    prior which contains less information (maximum ignorance) is this
    uniform prior.

  2. You could choose another point of view, the one used by the OP, and
    say that no information corresponds to having seen no heads and no
    tail, i.e.,

    α=β=0π(θ)θ1(1θ)1

    The prior we obtain this way is called the Haldane prior. The
    function θ1(1θ)1 has a little problem – the
    integral over I=[0,1] is infinite, i.e., no matter what the
    normalizing constant, it cannot be transformed into a proper pdf.
    Actually, the Haldane prior is a proper pmf, which puts
    probability 0.5 on θ=0, 0.5 on θ=1 and 0
    probability on all other values for θ. However, let’s not get
    carried away – for a continuous parameter θ, priors which
    don’t correspond to a proper pdf are called improper priors.
    Since, as noted before, all that matters for Bayesian inference is
    the posterior distribution, improper priors are admissible, as long
    as the posterior distribution is proper. In the case of the Haldane
    prior, we can prove that the posterior pdf is proper if our sample
    contains at least one success and one failure. Thus we can only use
    the Haldane prior when we observe at least one head and one tail.

    There’s another sense in which the Haldane prior can be considered
    non-informative: the mean of the posterior distribution is now
    α+xα+β+n=xn, i.e., the
    sample frequency of heads, which is the frequentist MLE estimate of
    θ for the Binomial model of the coin flip problem. Also, the
    credible intervals for θ correspond to the Wald confidence
    intervals. Since frequentist methods don’t specify a prior, one
    could say that the Haldane prior is noninformative, or corresponds
    to zero prior knowledge, because it leads to the “same” inference a
    frequentist would make.

  3. Finally, you could use a prior which doesn’t depend on the
    parametrization of the problem, i.e., the Jeffreys prior, which for
    the Beta-Binomial model corresponds to

    α=β=12π(θ)θ12(1θ)12

    thus with an effective sample size of 1. The Jeffreys prior has the
    advantage that it’s invariant under reparametrization of the
    parameter space. For example, the uniform prior assigns equal
    probability to all values of θ, the probability of the event
    “head”. However, you could decide to parametrize this model in terms
    of log-odds λ=log(θ1θ) of event “head”,
    instead than θ. What’s the prior which expresses “maximum
    ignorance” in terms of log-odds, i.e., which says that all possible
    log-odds for event “head” are equiprobable? It’s the Haldane prior,
    as shown in this (slightly cryptic) answer. Instead, the
    Jeffreys is invariant under all changes of metric. Jeffreys stated
    that a prior which doesn’t have this property, is in some way
    informative because it contains information on the metric you used
    to parametrize the problem. His prior doesn’t.

To summarize, there’s not just one unequivocal choice for a noninformative prior in the Beta-Binomial model. What you choose depends on what you mean as zero prior knowledge, and on the goals of your analysis.

Attribution
Source : Link , Question Author : Hatshepsut , Answer Author : Community

Leave a Comment