# How does a Bayesian update his belief when something with probability 0 happened?

Define $X:=$ “coin has probability 1 to land heads” Assume that one has the prior belief: $P(X)= 1$. However after tossing the coin once it lands tails ($E:=$ “coin landed tails”).
How should a Bayesian update his beliefs in order to stay coherent?
$P(X|E)$ is undefined, as $P(E) = 0$. However, it seems to me that as his prior beliefs are quite implausible (of course probability 0 does not mean impossible) he should somehow be able to update his belief according to some rule.

Is this just a pathological case in which Bayesian updating does not work or am I unaware of a solution to this problem?

## Any posterior probability is valid in this case

This is an interesting question, which gets into the territory of the foundations of probability. There are a few possible approaches here, but for reasons that I will elaborate on soon, the approach I favour is to give a broader definition of conditional probability that is analogous to its definition when dealing with continuous random variables. (Details of this method are shown below.) In this particular case, this leads to the conclusion that the Bayesian can hold any posterior belief about $$XX$$, and this yields a coherent set of beliefs (notwithstanding that they have observed an event that they believe to have probability zero).

The advantage of this approach is that it gives a well-defined posterior distribution, and allows the Bayesian to update their beliefs conditional on observing an event that was stipulated to occur with probability zero. The posterior is updated essentially arbitrarily (any posterior probability is equally coherent), but that flexibility is unsurprising given what has occurred. In this case, different Bayesians with the same prior beliefs could legitimately come to different posterior conclusions, owing to the fact that they have all observed an event with zero probability a priori.

Conditional probability for continuous random variables: When we are dealing with continuous random variables, the conditional probability function is defined through the Radon-Nikodym derivative, and essentially just requires the function to satisfies the law of joint probability. If $$XX$$ and $$EE$$ were continuous random variables (rather than discrete events) in a probability space $$(Ω,G,P)(\Omega, \mathscr{G}, P)$$ then we would define the conditional probability function $$p(x|e)p(x|e)$$ as any non-negative measureable function that satisfies the integral equation:

$$p(x)=∫Ep(x|e) dP(e)for all x∈X∈G.p(x) = \int \limits_\mathscr{E} p(x|e) \ dP(e) \quad \quad \quad \text{for all } x \in \mathscr{X} \in \mathscr{G}.$$

Since $$p(x)p(x)$$ is also defined via the Radon-Nikodym derivative, this implicitly means that $$p(x|e)p(x|e)$$ can be any non-negative measureable function that satisfies the integral equation:

$$P(X∈A)=∫A∫Ep(x|e) dP(e) dxfor all A∈G.\mathbb{P}(X \in \mathcal{A}) = \int \limits_\mathcal{A} \int \limits_\mathscr{E} p(x|e) \ dP(e) \ dx \quad \quad \quad \text{for all } \mathcal{A} \in \mathscr{G}.$$

This gives a non-unique solution for the conditional probability function, though in practice, every solution is “almost surely” equivalent (i.e., they differ only on a set of outcomes with probability zero) so there is no problem with the non-uniqueness.

Defining conditional probability for discrete events: The standard definition for conditional probability for discrete events is the well-known ratio formula, where the denominator is the probability of the conditioning event. Obviously, in the case where the conditioning event has zero probability, this object is undefined. The obvious solution here is to broaden the definition in a manner that is analogous to the method used in the continuous case. That is, we define the conditional probability pair $$P(X|E)\mathbb{P}(X|E)$$ and $$P(X|ˉE)\mathbb{P}(X|\bar{E})$$ as any pair of values between zero and one that satisfy the equation:

$$P(X)=P(X|E)×P(E)+P(X|ˉE)×(1−P(E)).\mathbb{P}(X) = \mathbb{P}(X|E) \times \mathbb{P}(E) + \mathbb{P}(X|\bar{E}) \times (1-\mathbb{P}(E)).$$

In the case stipulated in the question we have the prior belief $$P(X)=1\mathbb{P}(X) = 1$$ and the sampling distribution $$P(E|X)=0\mathbb{P}(E|X) = 0$$, which leads to $$P(E)=0\mathbb{P}(E) = 0$$. Substituting these values into the above equation gives:

$$1=P(X|E)×0+P(X|ˉE)×1.1 = \mathbb{P}(X|E) \times 0 + \mathbb{P}(X|\bar{E}) \times 1.$$

We can see that this equation is satisfied by taking $$P(X|ˉE)=1\mathbb{P}(X|\bar{E}) = 1$$ and any $$0⩽0 \leqslant \mathbb{P}(X|E) \leqslant 1$$. Thus, the (posterior) conditional probability $$\mathbb{P}(X|E)\mathbb{P}(X|E)$$ may coherently be any value between zero and one. When we say that this is “coherent” we simply mean that the posterior probability is not inconsistent with the other stipulated probabilities in the problem (i.e., the prior and sampling probabilities).

Why this approach makes the most sense: It is entirely possible that a Bayesian analysis could involve observation of a discrete event that has zero probability stipulated in the prior distribution. For example, in a standard model of coin-flipping, we stipulate a Bernoulli distribution for the heads/tails outcome, but it is possible that the coin could come to rest on its edge (thus being neither heads or tails). Brains should not explode in this case, and thus it is incumbent on Bayesian reasoning to have a well-defined way of proceeding in this case.

The major advantage of the approach I have outlined is that it always leads to at least one allowable value for the posterior probability (i.e., the posterior probability is well-defined). The posterior probability is not uniquely defined, but that is a natural offshoot of the fact that there are several values that are equally coherent with the zero-probability sampling observation. This approach means that the Bayesian is free to stipulate any posterior probability, and this is as coherent as any other. (Bear in mind that when we say “coherent” here, we are talking about coherence with a prior belief that stipulated zero probability for a discrete event that actually happened, so coherence with that is not a high bar!)

There is another major benefit to this approach, which is that it allows the Bayesian to update his or her beliefs in response to observing an event that had zero sampling probablity under the prior, and in particular, the Bayesian can now revise his or her beliefs so that they no longer ascribe zero probability to this event. In the example you give, the Bayesian had a prior belief that $$XX$$ is true almost surely, buy then observed an event with zero sampling probability conditional on this event. Now the Bayesian is free to update his or her belief to a posterior probabilty for $$XX$$ that is not one (and so a corresponding posterior probability for $$\bar{X}\bar{X}$$ that is not zero). So, in essence, the Bayesian can now say “Oh shit! That was a silly prior! Let me update my belief in that event so that it no longer occurs almost surely!” Moreover, this is not some ad hoc change, but a legitimate “coherent” updating done under Bayes’ theorem.