In Part 1, I explored how Bayesian updating operates when there are two discrete possibilities. I now investigate how Bayesian updating operates with one continuous parameter.
This example is from Chapter 2 of ‘Statistical Rethinking’ by Richard McElreath. The premise can be paraphrased as follows:
Suppose you have a globe representing the Earth. It is small enough to hold. You are curious how much of the globe is covered in water. You will toss the globe in the air. When you catch it, you will record whether your right index finger is on land or water. Repeat and record your results.
The data story we use assumes that each toss is independent of the last and that our prior is a uniform distribution, according to the principle of indifference. In other words, all fractions of water, p, are considered equally likely. In reality, a globe that represents our earth must have some water and some land, based on our experience, so p can’t exactly equal 1 or 0. But for now, let us assume uniformity:
The chart here represents probability density, as opposed to the previous case which showed probability mass. Still, the total probability, represented as the area under the blue line, sums to one.
We make our first observation and it is water or W. This allows us to calculate a posterior distribution, but how?
Bayesian updating uses the following:
Posterior = Likelihood * Prior / Average Likelihood
In this case, the Posterior, Likelihood and Prior represent continuously varying quantities rather than discrete categories as in Part 1. I’ll touch on Average Likelihood later.
The Prior here is the uniform distribution, valued at 1. The Likelihood is the likelihood of observing W under any value of p. This is in fact the definition of p, so Likelihood is just p. If the observation were L for land, Likelihood would be (1-p), since there are only two possibilities.
Average likelihood is a single value which is used to rescale the posterior so that probabilities sum to one. It is the integral of the Likelihood*Prior over all possible values of p. In this case, it would be:
Integral(from p=0, to p=1)[p * 1]dp = 0.5
Thus, our posterior distribution after observing W is:
Posterior = p * 1 / 0.5 = 2p
This looks like the following:
Notice that the probability density exceeds 1. This is allowed, since it is the integral, not the value of the curve that must equal 1.
We’re not done yet! Let’s say that we spin the globe again but this time observe land, L. What do we do? With Bayesian updating, the posterior becomes the prior for continued observations.
Now, Prior is 2p, Likelihood is (1-p) and Average Likelihood is:
Integral(from p=0, to p=1)[(1-p) * 2p]dp = 0.33
Thus, our updated Posterior after observing W L is:
Posterior = (1-p) * 2p / 0.33 = 6p*(1-p)
This analysis can continue until, say, 9 observations are made:
W L W W W L W L W
In all, 6 W’s and 3 L’s are observed. Following previous trends, each W will add a factor of p and each L will add a factor of (1-p). The normalization factor can be calculated through integral and is 840.
Posterior = 840 * p^6 * (1-p)^3
The distribution has gotten much pointier and is looking very similar to Gaussian.
In principle, this method scales up to multiple parameters and much more complicated models. In practice, however, analytical solutions like we have here quickly become impractical, and numerical methods are used.
Grid approximation is one such method where each parameter is represented as a sufficient number (100-200 is usually enough) points rather than strictly continuous. However, with multiple parameters, the total size of the grid-space scales exponentially, so other solutions become important.