# Beta distribution on flipping a coin

Kruschke’s Bayesian book says, regarding the use of a beta distribution for flipping a coin,

For example, if we have no prior knowledge other than the knowledge
that the coin has a head side and a tail side, that’s tantamount to
having previously observed one head and one tail, which corresponds to
a = 1 and b = 1.

Why would no information be tantamount to having seen one head and one tail — 0 heads and 0 tails seems more natural to me.

The quotation is a “logical sleight-of-hand” (great expression!), as noted by @whuber in comments to the OP. The only thing we can really say after seeing that the coin has an head and a tail, is that both the events “head” and “tail” are not impossible. Thus we could discard a discrete prior which puts all of the probability mass on “head” or on “tail”. But this doesn’t lead, by itself, to the uniform prior: the question is much more subtle. Let’s first of all summarize a bit of background. We’re considering the Beta-Binominal conjugate model for Bayesian inference of the probability $\theta$ of heads of a coin, given $n$ independent and identically distributed (conditionally on $\theta$) coin tosses. From the expression of $p(\theta|x)$ when we observe $x$ heads in $n$ tosses:

we can say that $\alpha$ and $\beta$ play the roles of a “prior number of heads” and “prior number of tails” (pseudotrials), and $\alpha+\beta$ can be interpreted as an effective sample size. We could also arrive at this interpretation using the well-known expression for the posterior mean as a weighted average of the prior mean $\frac{\alpha}{\alpha+\beta}$ and the sample mean $\frac{x}{n}$.

Looking at $p(\theta|x)$, we can make two considerations:

1. since we have no prior knowledge about $\theta$ (maximum ignorance),
we intuitively expect the effective sample size $\alpha+\beta$ to be
“small”. If it were large, then the prior would be incorporating
quite a lot of knowledge. Another way of seeing this is noting that
if $\alpha$ and $\beta$ are “small” with respect to $x$ and $n-x$,
the posterior probability won’t depend a lot on our prior, because
$x+\alpha\approx x$ and $n-x+\beta\approx n-x$. We would expect that
a prior which doesn’t incorporate a lot of knowledge must quickly
become irrelevant in light of some data.
2. Also, since $\mu_{prior}=\frac{\alpha}{\alpha+\beta}$ is the prior
mean, and we have no prior knowledge about the distribution of
$\theta$, we would expect $\mu_{prior}=0.5$. This is an argument of
symmetry – if we don’t know any better, we wouldn’t expect a
priori
that the distribution is skewed towards 0 or towards 1. The
Beta distribution is

This expression is only symmetric around $\theta=0.5$ if
$\alpha=\beta$.

For these two reasons, whatever prior (belonging to the Beta family – remember, conjugate model!) we choose to use, we intuitively expect that $\alpha=\beta=c$ and $c$ is “small”. We can see that all the three commonly used non-informative priors for the Beta-Binomial model share these traits, but other than that, they are quite different. And this is obvious: no prior knowledge, or “maximum ignorance”, is not a scientific definition, so what kind of prior expresses “maximum ignorance”, i.e., what’s a non-informative prior, depends on what you actually mean as “maximum ignorance”.

1. we could choose a prior which says that all values for $\theta$ are
equiprobable, since we don’t know any better. Again, a symmetry
argument. This corresponds to $\alpha=\beta=1$:

for $\theta\in[0,1]$, i.e., the uniform prior used by Kruschke. More
formally, by writing out the expression for the differential entropy
of the Beta distribution, you can see that it is maximized when
$\alpha=\beta=1$. Now, entropy is often interpreted as a measure of “the
amount of information” carried by a distribution: higher entropy corresponds to less information. Thus, you could use this
maximum entropy principle to say that, inside the Beta family, the
prior which contains less information (maximum ignorance) is this
uniform prior.

2. You could choose another point of view, the one used by the OP, and
say that no information corresponds to having seen no heads and no
tail, i.e.,

The prior we obtain this way is called the Haldane prior. The
function $\theta^{-1}(1-\theta)^{-1}$ has a little problem – the
integral over $I=[0, 1]$ is infinite, i.e., no matter what the
normalizing constant, it cannot be transformed into a proper pdf.
Actually, the Haldane prior is a proper pmf, which puts
probability 0.5 on $\theta=0$, 0.5 on $\theta=1$ and 0
probability on all other values for $\theta$. However, let’s not get
carried away – for a continuous parameter $\theta$, priors which
don’t correspond to a proper pdf are called improper priors.
Since, as noted before, all that matters for Bayesian inference is
the posterior distribution, improper priors are admissible, as long
as the posterior distribution is proper. In the case of the Haldane
prior, we can prove that the posterior pdf is proper if our sample
contains at least one success and one failure. Thus we can only use
the Haldane prior when we observe at least one head and one tail.

There’s another sense in which the Haldane prior can be considered
non-informative: the mean of the posterior distribution is now
$\frac{\alpha + x}{\alpha + \beta + n}=\frac{x}{n}$, i.e., the
sample frequency of heads, which is the frequentist MLE estimate of
$\theta$ for the Binomial model of the coin flip problem. Also, the
credible intervals for $\theta$ correspond to the Wald confidence
intervals. Since frequentist methods don’t specify a prior, one
could say that the Haldane prior is noninformative, or corresponds
to zero prior knowledge, because it leads to the “same” inference a
frequentist would make.

3. Finally, you could use a prior which doesn’t depend on the
parametrization of the problem, i.e., the Jeffreys prior, which for
the Beta-Binomial model corresponds to

thus with an effective sample size of 1. The Jeffreys prior has the
advantage that it’s invariant under reparametrization of the
parameter space. For example, the uniform prior assigns equal
probability to all values of $\theta$, the probability of the event
“head”. However, you could decide to parametrize this model in terms
of log-odds $\lambda=log(\frac{\theta}{1-\theta})$ of event “head”,
instead than $\theta$. What’s the prior which expresses “maximum
ignorance” in terms of log-odds, i.e., which says that all possible
log-odds for event “head” are equiprobable? It’s the Haldane prior,