Hypotheses, signals, and Bayesian learning
At the heart of the Bayesian picture of how we learn about the world, there are the hypotheses that interest us—for the purposes of illustration, let’s say there are just two, Hypothesis 1 and Hypothesis 2—and there are signals we receive or elicit from the world that give us some information about which of Hypothesis 1 or Hypothesis 2 is true—let’s say there is Signal 1 and Signal 2. So, for instance, perhaps Hypothesis 1 says your friend Jane went to Aruba on her recent holiday and Hypothesis 2 says she went to Bermuda, while Signal 1 is your friend Kayla telling you Jane went to Aruba and Signal 2 is Kayla telling you Jane went to Bermuda.
To find out how confident you should be in Hypothesis 1 or 2 after you receive one of these signals, you need to have two things: priors and likelihoods. The priors specify how confident you are in each hypothesis prior to receiving either signal: how confident are you Jane went Aruba before Kayla says anything, and how confident Bermuda? The likelihoods specify how likely it is you’ll receive a particular signal under the assumption a particular hypothesis is true: how likely is Kayla to say Jane went to Aruba (Signal 1) if she did in fact go there (Hypothesis 1)? how likely is Kayla to say Jane went to Bermuda (Signal 2) if she in fact went to Aruba (Hypothesis 1)? and so on. Throughout, I write P for priors, and σ for likelihoods. With P and σ in hand, we can calculate your posteriors, which specify how confident you should become in a hypothesis after receiving a particular signal. We calculate them using Bayes’ Theorem. For example:1
Now, sometimes, we get to choose from which set of signals we’ll receive our information. For instance, perhaps there’s an alternative set of signals, Signal 3 and Signal 4. Perhaps Signal 3 is your other friend Loretta saying Jane went to Aruba and Signal 4 is Loretta saying Jane went to Bermuda. If you ask Kayla, you’ll receive either Signal 1 or 2; if you ask Loretta, you’ll receive Signal 3 or 4. How might we choose between these two ways of inquiring or experimenting or eliciting signals from the world? How might we choose how to investigate the world to find out about hypotheses that interest us?
This is the central question in the literature on the value of information—also called the value of evidence or the value of knowledge. I’ve written about it quite a lot in the past, and I posted this rather long set of notes to PhilPapers, which goes over quite a lot of the standard material. But I didn’t include anything in those notes on one of the earliest results: David Blackwell’s Informativeness Theorem from his papers ‘The Comparison of Experiments’ (1951) and ‘Equivalent Comparison of Experiments’ (1953). It’s that result I wish to describe and prove here.2 Fair warning: this post is going to get long—the actual proof of Blackwell’s result is reasonably brief, but I want to motivate the various parts of the theorem so that it’s clear what it says.
The value of learning
The first question to ask is this: when an experiment will result in receiving a signal from a given set, what value does that experiment have? There are two sorts of value to which we might appeal: epistemic and pragmatic.
The epistemic value of learning
On the epistemic side, we evaluate our credences using measures of epistemic value that we take to be strictly proper. So, suppose W is a finite set of possible states of the world and Δ(W) is the set of probability functions over W—that is, each P in Δ(W) assigns a credence between 0 and 1 inclusive to each w in W, and those credences add up to 1. Then an epistemic utility function EU (on W) takes P from Δ(W) and w from W and returns EU(P, w), which is a real number, ∞, or -∞ that measures the epistemic value of P when the world is in state w. We might think of EU(P, w) as measuring the accuracy of P at w—how close it lies to the probability function that assigns credence 1 to w and 0 to all other states in W—but we don’t need to; we can take it to measure something else. I’ll remain agnostic here—I simply assume that each probability function P expects itself to have strictly greater epistemic value than any other probability function P’ in Δ(W)—that is what it means to say that EU is strictly proper.
To illustrate, here is the so-called Brier score for a probability function P defined on a set of worlds W = {w_1, …, w_n} that assigns credence p_k to world w_k:
The Brier score is strictly proper.
We then evaluate an experiment by looking at the epistemic value of the probabilities to which it might give rise. We look at each possible state of the world w and each possible signal s we might receive if we were to conduct the experiment in question; we look at the posterior probability function P(- | s) we’d have if we were to receive signal s and update on it in line with Bayes’ Theorem; we take the epistemic value of P(- | s) at the state of the world w, that is, EU(P(- | s), w); we weight it by our probability in w and s, that is, P(w)σ(s | w); and we sum up these probability-weighted epistemic values. That gives the expected epistemic utility of an experiment that gives us signals from the set S. Here it is in symbols:
The pragmatic value of learning
On the pragmatic side, we evaluate our credences by fixing a decision problem we’ll face with them, and then taking their pragmatic utility to be the utility those credences will obtain for us if we use them to make that decision. So, for instance, if I face a choice between act a and act b, and the expected utility of a by the lights of my probability function P is greater than the expected utility of b by those lights, then the pragmatic utility of P at w is the utility of a at w.
In the context of Blackwell’s theorem, we always assume that a decision problem is specified by a set D of pure options, each of which specifies a utility for each state of the world, and then we take the available options to be all mixed acts over D. This is Δ(D), the set of probability functions over D. We might think of each α in Δ(D) is a randomizing procedure that gives the pure act a from D with probability α(a). The utility of α at w is the expected utility at w of the act picked by this randomizing procedure. That is,
Then, as in the epistemic case, we evaluate an experiment by looking at the pragmatic utility of the probabilities to which it might give rise. We look at each possible state of the world w and each possible signal s we might receive if we were to conduct the experiment in question; we look at the posterior probability function P(- | s) we’d have if we were to receive signal s and update on it in line with Bayes’ Theorem; we take the pragmatic value of P(- | s) at the state of the world w, that is, the utility of the mixed act α over D we’d choose if we were to maximize expected utility from the point of view of P(- | s); we weight it by our prior probability in w and s, that is P(w)σ(s | w); and we sum up these prior-probability-weighted pragmatic utilities. That gives the expected pragmatic utility of an experiment that gives us signals from the set S.
where α^P(- | s) is the mixed act from Δ(D) that P(- | s) expects to be best.3
Our example
Let’s see all this work in the case of Jane’s holiday destination and Kayla’s and Loretta’s testimony about it—you can easily skip this section, if you wish. It’s simply an illustrate of the definitions that went before.
Write A for Jane going to Aruba and B for her going to Bermuda; write KA for Kayla saying she went to Aruba and LA for Loretta saying that; write KB for Kayla saying she went to Bermuda and LB for Loretta saying that. Then suppose your priors and likelihoods are as follows:
So, for instance, if Jane went to Aruba, it’s 2/3 likely Kayla will say she went to Aruba and 1/3 likely Kayla will say she went to Bermuda, while it’s 5/9 likely Loretta will say she went to Aruba and 4/9 likely Loretta will say she went to Bermuda. And something similar with the numbers switched if Jane went to Bermuda.
So, by Bayes’ Theorem:
The epistemic value in our example
Supposing we use the Brier score:
And so the expected utility of asking Kayla is:
which is:
which is around 0.777.
While the expected utility of asking Loretta is:
which is:
which is around 0.753.
So, perhaps unsurprisingly, asking Kayla is better than asking Loretta.
The pragmatic value in our example
Now suppose you face the following decision problem:
Then P(- | KA) leads you to choose Bet 1, P(- | KB) leads you to choose Bet 2, P(- | LA) and P(- | LB) both lead you to decline both bets (i.e., Decline). So the expected pragmatic utility of asking Kayla is:
And the expected pragmatic utility of asking Loretta is 0. So, again, it is better to ask Kayla when faced with that decision problem. As we’ll see below, for any decision problem you might face, asking Kayla is at least as good as asking Loretta.
Blackwell’s Theorem
There is something unsurprising about the result that asking Kayla is better than asking Loretta, if you want to improve you credences concerning where Jane went for her holiday. If Jane went to Aruba, Kayla is 2/3 (or 6/9) likely to say so, while Loretta is only 5/9; and if Jane went to Bermuda, Kayla is 2/3 (or 6/9) likely to say so, while Loretta is only 5/9. Kayla just seems a more reliable indicator of the truth. Blackwell’s Theorem identifies a way in which Loretta’s testimony is related to Kayla’s, and shows that this guarantees that Kayla’s is indeed a more reliable indicator than Loretta’s.
Garbling signals
Let us suppose that Loretta’s testimony is based on Kayla’s testimony in the following way: Loretta has no direct information about Jane’s destination; she only knows Kayla’s testimony; she takes Kayla’s testimony and relays it on, but imperfectly, just as Kayla takes Jane’s destination and relays it on, but imperfectly. Recall that, if Jane went to Aruba, it’s 2/3 likely that Kayla says she did; and let us suppose that, if Kayla says Jane went to Aruba, then it’s 2/3 likely that Loretta says she did. And similarly, we know that, if Jane went to Bermuda, it’s 2/3 likely that Kayla says she did; and let us suppose that, if Kayla says Jane went to Aruba, then it’s 2/3 likely that Loretta says she did. And so, if Jane went to Aruba, the probability Loretta says she did is the probability Kayla says she went to Aruba, given she did, multiplied by the probability Loretta says she went Aruba given Kayla says she did, plus the probability Kayla says she went to Bermuda, given she went to Aruba, multiplied by the probability Loretta says she went to Aruba given Kayla says she went to Bermuda. That is, if γ gives the likelihood of Loretta saying something given Kayla said something, and if we assume that Loretta has no information about Jane’s destination other than through Kayla, then
And similarly for σ(LB | A), σ(LA | B), and σ(LB | B).
Then, if we assume all of this, then we do indeed recover the likelihoods of Loretta’s testimony that we specified above. Now of course this isn’t to say we’ve thereby discovered that Loretta’s testimony is actually based on Kayla’s in this way. It’s just to say that the likelihoods for Loretta’s testimony are exactly as they would be were they to be based on Kayla’s testimony in this way. They are as if this were all so. But the true story might be something quite different: perhaps Loretta knew of Jane’s destination directly, but just is a bit less reliable at remembering and reporting it.
In this sort of case, where there is γ for which those identities hold, we say that Loretta’s testimony is a garbling of Kayla’s. That is, it is as if it is a garbled version of Kayla’s testimony.
In general, a set of signals S’ is a garbling of a set of signals S if there is γ such that, for each s’ in S’ and w in W,
Accessible acts
Blackwell’s Theorem also includes a further concept, which is interesting in itself, but largely used to ease the proof. To understand it, we need to think about why new information is useful to us when we know we’re going to act after we receive it. Here’s one thing you can do if you are going to receive new information: you can make conditional plans for how you’ll act, where the conditions are the different pieces of information you might receive. So, for instance, if I ask Kayla where Jane went, I can make a conditional plan: if Kayla says Aruba, I’ll choose this way; if Kayla says Bermuda, I’ll choose this way. Recall, it’s mixed acts between which we’re choosing in Blackwell’s theorem. Given a decision problem D, a mixed act α on D conditional on a set of signals S takes each s in S and returns a probability function α(- | s) over the possible acts in the decision problem D. And a mixed act λ on D conditional on the set of possible worlds W takes each w in W and returns a probability function λ(- | w) over the possible acts in D. We say that a set of signals S makes a mixed act λ on D conditional on W accessible if there is a mixed act α on D conditional on S such that the following holds: for all a in D and w in W,
And we write Λ^D_S for the set of such λ.
The theorem
Then we can state Blackwell’s Theorem as follows:
Blackwell’s Theorem Fix your prior P, your likelihoods σ, and your utility u. The following are equivalent:
(i) S’ is a garbling of S. That is, there is γ such that, for each s’ in S’ and w in W,
(ii) For decision problems D, the expected pragmatic utility of S when faced with D is at least as great as the expected pragmatic utility of S’ when faced with D. That is,
And we can also add a third equivalent, which Blackwell didn’t consider, but which is related to something that Morris DeGroot investigated:
(iii) For all priors P and all strictly proper measures of epistemic value EU, the expected epistemic value of S is at least as great as the expected epistemic value of S’. That is,
Finally, we include the fourth equivalent, which is given to ease the proof:
(iv) For all decision problems D, and all mixed acts λ on D conditional on W, if S’ makes λ accessible to you, then S makes λ accessible to you. That is,
The proof of Blackwell’s Theorem
We will prove (i) iff (iv), (ii) iff (iv), (i) implies (iii), and (iii) implies (ii).
Step 1: If (i), then (iv).
Suppose S’ is a garbling of S. So there is γ such that, for all s’ in S’ and w in W,
And suppose S’ makes λ accessible. That is, there is α’ such that, for all a in D and w in W,
Then define α as follows:
Then:
So S makes λ accessible, as (iv) requires.
Step 2: (iv) implies (i).
Suppose that, for all D, and all mixed acts λ on D conditional on W, if S’ makes λ accessible to you, then S makes λ accessible to you. Now let D = S’. Then S’ certainly makes σ(s’ | w) accessible to you. And so S makes σ(s’ | w) accessible to you. That is, there is α such that
But that’s just what it means for S’ to be a garbling of S, as (i) requires.
Step 3: (iv) implies (ii).
The crucial fact is this:
And similarly for S’. But the maximum of a quantity taken over one set must be at least the maximum of a quantity taken over a subset of it. And so, if
then
as required by (ii).
Step 4: (ii) implies (iv).
We prove the contrapositive. Suppose (iv) is not true. That is, there is a decision problem D and there is λ’ in Λ^D_S’ that is not in Λ^D_S. Then note that Λ^D_S and {λ’} are both convex and compact subsets of Reals^(D x W). And they are disjoint. So, by the Separating Hyperplane Theorem, there is a function v in Reals^(D x W) such that, for all λ in Λ^D_S,
Now, let
And, for each act a in D, define a* in such a way that, for all a in D, u(a*, w) = v*(a, w). And let D* = {a* | a in D}. Then define λ* : W → D* as follows: λ*(a* | w) = λ(a | w). And similarly for λ'*. Then
And so, for all λ* in Λ^D*_S,
And so,
And so,
which gives the negation of (ii), as required.
Step 5: (i) implies (iii).
Suppose S’ is a garbling of S. Then there is γ such that, for all s’ in S’ and w in W,
So:
as required by (iii).
Step 6. (iii) implies (ii).
At this point, we borrow an observation due to Mark Schervish and Ben Levinstein. I appealed to it to offer a different Dutch Book argument for Probabilism and Conditionalization; it has also been developed by Giacomo Molinari in his recent paper ‘Deference Principles for Imprecise Credences’, which won the Journal of Philosophy’s 2024 Isaac Levi Prize; and it has been used also in this paper to offer a powerful account of deferring to experts. The latter two are the best places to learn about it in more detail.
First, we note that, if we have a certain probability distribution μ over the decision problems we might face, and we score a probability function P at a world w by the expected utility (by the lights of μ) you’d get at w if you were to use P to face whichever decision you’ll face, then this is a proper scoring rule. And if, for any two probability functions P and Q, the set of decision problems in which P and Q disagree concerning what to choose has positive measure by the lights of μ, then it is a strictly proper scoring rule.
Now, suppose (ii) is false. That is, there is P, D, u such that
Then we can find a neighbourhood of decision problems around D in which this inequality always holds. We then construct μ as follows: assign a very high probability to facing a decision problem within this neighbourhood, and uniform distribution within it; assign the remaining very low probability to facing a decision outside this neighbourhood, and a uniform distribution over those. Then the resulting scoring rule EU_μ is strictly proper and
as required.
Or, more generally, for any hypothesis i and signal j,
where k ranges over all the hypotheses, which must together form a partition.
My presentation of the result owes a great deal to this wonderful paper by Henrique de Oliveira.
Of course, it is possible that P(- | s) expects more than one mixed act to be best. In this case, we must simply pick between them. But it turns out that, whichever we pick, the expectation just defined is the same.