27.5: Appendix- Inverse Probabilities and Bayes’ Rule

Last updated
Save as PDF

Page ID: 95284

$ \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } $ $ \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} $$\newcommand{\id}{\mathrm{id}}$ $ \newcommand{\Span}{\mathrm{span}}$ $ \newcommand{\kernel}{\mathrm{null}\,}$ $ \newcommand{\range}{\mathrm{range}\,}$ $ \newcommand{\RealPart}{\mathrm{Re}}$ $ \newcommand{\ImaginaryPart}{\mathrm{Im}}$ $ \newcommand{\Argument}{\mathrm{Arg}}$ $ \newcommand{\norm}[1]{\| #1 \|}$ $ \newcommand{\inner}[2]{\langle #1, #2 \rangle}$ $ \newcommand{\Span}{\mathrm{span}}$ $\newcommand{\id}{\mathrm{id}}$ $ \newcommand{\Span}{\mathrm{span}}$ $ \newcommand{\kernel}{\mathrm{null}\,}$ $ \newcommand{\range}{\mathrm{range}\,}$ $ \newcommand{\RealPart}{\mathrm{Re}}$ $ \newcommand{\ImaginaryPart}{\mathrm{Im}}$ $ \newcommand{\Argument}{\mathrm{Arg}}$ $ \newcommand{\norm}[1]{\| #1 \|}$ $ \newcommand{\inner}[2]{\langle #1, #2 \rangle}$ $ \newcommand{\Span}{\mathrm{span}}$$\newcommand{\AA}{\unicode[.8,0]{x212B}}$

Only two cab companies operate in Belleville, KS. The Blue Company has blue cabs, and the Green Company has green cabs. Exactly 85% of the cabs are blue and 15% are green. A cab was involved in a hit and run accident at night. An eyewitness, Wilbur, identified the cab as a green cab. Careful tests were done to ascertain peoples’ ability to distinguish between blue and green cabs at night. The tests showed that people identified the color correctly 80% of the time, but they were wrong 20% of the time. What is the probability that the cab involved in the accident was indeed a green cab, as Wilbur says?

Problems like this require us to integrate knowledge about the present case (here, what the eyewitness says) with prior information about base rates (here, what proportion of the cabs are green). In many cases, we focus too much on the present information and ignore information about base rates. The correct way to give both pieces of information their proper due is to use Bayes’ Rule.

Bayes’ Rule

Bayes’ Rule (named after the Reverend Thomas Bayes, who discovered it in the eighteenth century) is just another rule for calculating probabilities. It tells us how we should modify or update probabilities when we acquire new information. It gives the posterior probability of a hypothesis A given a piece of new evidence B as:

Pr(A | B) = Pr (A) × Pr (B | A) / Pr (B)

We say that Pr(A) is the prior probability of the hypothesis, Pr(B|A) the likelihood of B given A, and Pr(B) the prior probability of evidence B. For example, A might be the hypothesis that Smith is infected by the HIV virus, and B might be the datum (piece of evidence) that his test came back positive.

By way of illustration, let’s work through example one above (we will work through example two below). This example involved the Blue Company, which has all blue cabs, and the Green Company, which has all green cabs. Exactly 85% of the cabs are blue and the other 15% are green. A cab was involved in a hit and run accident at night. An honest eyewitness, Wilbur, identified the cab as a green cab. Careful tests were done to ascertain witness’ ability to distinguish between blue and green cabs at night; these showed that people were able to identify the color correctly 80% of the time, but they were wrong 20% of the time. What is the probability that the cab involved in the accident was indeed a green cab? The fact that witnesses are reliable leads many people to suppose that Wilbur probably right, even that the probability that he is right is 8. But is this correct? A large part of the battle is often setting up clear notation, so let’s begin with that:

Pr(G) = .15

This is the base rate of green cabs in the city. It gives the prior probability that the cab in the accident is green. Similarly, Pr(B) = .85.

Pr(SG|G) = .80

This is the probability the witness will be correct in saying green when the cab in fact is green, i.e., given that the cab really was green. Similarly, Pr(SB|B) = .80. These are the probabilities that witnesses are correct, so by the negation rule, the probabilities of misidentifications are:

Pr(SG | B) = .2 and Pr(SB | G) = .2

What we want to know is the probability that the cab really was green, given that Wilbur said it was green, i.e., we want to know Pr(G|SG). According to Bayes’ Rule, this probability is given by:

Pr(G | SG) = Pr (G) × Pr (SG | G) / Pr (SG)

We have the values for the two expressions in the numerator—Pr(G) = .15 and Pr(SG|G) = .8 – but we must do a little work to determine the value for the expression Pr(SG) in the denominator. To see what this value—the probability that a witness will say that the cab was green—must be, note that there are two conditions under which Wilbur could say that the cab in the accident was green. He might say this when the cab was green or when it was blue. This is a disjunction, so we add these two probabilities. In the first disjunct, Wilbur says green and the cab is green; in the second disjunct Wilbur says green and the cab is blue. Putting this together we get:

Pr(SG) = Pr(G & SG) + Pr (B & SG)

Now our rule for conjunctions tells us that Pr(G & SG) = Pr(G) x Pr(SG|G) and Pr(B & SG) = Pr(B) x Pr(SG|B). So,

Pr(SG) = Pr(G) × (SG|G) + Pr (B) × Pr (SG|B)

= (. 15 × .80) + (. 85 × .20)

= .12 + .17

= .29

Finally, we substitute this number, .29, into the denominator of Bayes’ Rule:

Pr(G|SG) = Pr (G) × Pr (SG|G) / Pr (SG)

= . 15 x . 80 . 29

= .414

So, the probability that the witness was correct in saying the cab was green is just a bit above .4 –less than fifty/fifty – and (by the negation rule) the probability that he is wrong is nearly .6. This is so even though witnesses are reliable. How can this be? The answer is that the high base rate of blue cabs, and the low base rate of green cabs, makes it somewhat likely that the witness was wrong in this case.

Rule for Total Probabilities

In calculating the value of the denominator in the above problem we made use of the Rule for Total Probabilities. We will use a simple, but widely applicable version of this rule here. If a sentence B is true, then either B and A are true or else B and ~A are true (since any sentence A is either true or false). This allows us to express the probability of B – Pr(B) – in a more complicated way. This may seem like a strange thing to do, but it turns out that it is often possible to calculate the probability of the more complicated expression when it isn’t possible to obtain the probability of B more directly.

Pr(B) = Pr [(B & A) or (B & ~A)]

= Pr(B & A) + Pr(B & ~A)

= [Pr(A) × Pr(B | A)] + [Pr(~A) × Pr(B | ~A)])

In short:

Pr(B) = [Pr(A) × Pr(B | A)] + [Pr(~A) × Pr(B | ~A)]

This rule can be useful in cases that do not involve Bayes’ Rule, as well as in many cases that do. It is particularly useful when we deal with an outcome that can occur in either of two ways.

Example: A factory has two machines that make widgets. Machine A makes 800 per day and 1% of them are defective. The other machine, call it ~A, makes 200 a day and 2% are defective. What is the probability that a widget produced by the factory will be defective (D)? We know the following:

Pr(A) = .8 – the probability that a widget is made by machine A is .8 (since this machine makes 800 out of the 1000 produced everyday).
Pr(~A) = .2 – the probability that a widget is made by the other machine A is .2.
Pr(D|A) = .01 – the probability that a widget is defective given that it was made by machine A, which turns out 1% defective widgets, is .01p.
Pr(D|~A) = .02 – the probability that a widget is defective given that it was made by machine A, which turns out 2% defective widgets, is .02p.

Plugging these numbers into the theorem for total probability we have:

Pr(D) = [Pr(A) × Pr(D | A)] + [Pr(~A) × Pr(D|~A)]

= [. 8 × .01] + [.2 × .02]

= 0.012

Odds Version of Bayes’ Rule

Particularly when we are concerned with mutually exclusive and disjoint hypothesis (cancer vs. no cancer), the odds version of Bayes’ rule is often the most useful version. It also allows us to avoid the need for a value for Pr(e), which is often difficult to come by. Let H be some hypothesis under consideration, and e be some newly acquired piece of evidence (like the result of a medical test) that bears on it. Then:

Pr (H | e) / Pr (~H | e) = Pr(H) Pr (e | H) / Pr(~H) Pr (e | ~H)

The expression above gives the posterior odds of the two hypotheses, Pr(H)/Pr(~H) gives their prior odds, and Pr(e|H)/Pr(e|~H) is the likelihood (or diagnostic) ratio. In integrating information our background knowledge of base rates is reflected in the prior odds, and our knowledge about the specific case is represented by the likelihood ratio. For example, if e represents a positive result in a test for the presence of the HIV virus, then Pr(e|H) gives the hit rate (sensitivity) of the test, and Pr(e|~H) gives the false alarm rate. As the multiplicative relation in (2) indicates, when the prior odds are quite low, even a relatively high likelihood ratio won’t raise the posterior odds dramatically.

The importance of integrating base rates into our inferences is vividly illustrated by example two above. Suppose that we have a test for the HIV virus. The probability that a person who really has the virus will test positive is .90, while the probability that a person who does not have it will test positive is .20. Finally, suppose that the probability that a person in the general population has the virus (the base rate of the virus) is .01. How likely is Smith, whose test came back positive, to be infected?

Because the test is tolerably accurate, many people suppose that the chances are quite high that Smith is infected, perhaps even as high as 90%. Indeed, various studies have shown that many physicians even suppose this. It has been found that people tend to confuse probabilities and their converses. The probability that we’ll get the positive test, e, if the person has the virus is .9, i.e., Pr(e|H) = .9. But we want to know the probability of the converse, the probability that the person has the virus given that they tested positive, i.e., Pr(H |e). These need not be the same, or even close to the same. In fact, when we plug the numbers into Bayes’ Rule, we find that the probability that someone in this situation is infected is quite low.

To see this, we simply plug our numbers into the odds version of Bayes’ rule. We have been told that:

Pr(e|H) = .9
Pr(e|~H) = .2
Pr(H) = .01
Pr(~H) = .99.

Hence,

Pr (H | e) / Pr (~H | e) = Pr (H) / Pr (~H) × Pr (e | H) / Pr (e | ~H) = . 9 / . 2 × . 01 / . 99 = 1 / 22

This gives the odds: 22 to 1 against Smith’s being infected. So, Pr(H |e) = 1/23. Although the test is relatively accurate, the low base rate means that there is only 1 chance in 23 that Smith has the virus.

Similar points apply to other medical tests and to drug tests, if the base rate of the condition being tested for is low. It is not difficult to see how policy makers, or the public to which they respond, can make very inaccurate assessments of probabilities, and hence poor decisions about risks and remedies, if they overlook the importance of base rates.

As we noted earlier, it will help you to think about these matters intuitively if you try to rephrase probability problems in terms of frequencies or proportions whenever possible.

Further Matters

Bayes’ Rule makes it clear when a conditional probability will be equal to its converse. When we look at:

Pr(A | B) = Pr (A) × Pr (B | A) Pr (B)

we see that Pr(A|B) = Pr(B|A) exactly when Pr(A) = Pr(B), so that they cancel out (assuming, as always, that we do not divide by zero). Dividing both sides of Bayes’ Rule by Pr(B|A) gives us the following ratio:

Pr (A | B) / Pr (B | A) = Pr (A) / Pr (B)

which is often useful.

Updating via Conditionalization

One way to update or revise our beliefs given the new evidence e is to set our new probability (once we learn about the evidence) as:

New Pr(H) = Old Pr (H | e)

with Old Pr(H|e) determined by Bayes’ Rule. Such updating is said to involve Bayesian conditionalization.

The Problem of the Reference Class

There are genuine difficulties in selecting the right group to consider when we use base rates. This is known as the problem of the reference class. Suppose that we are considering the effects of smoking on lung cancer. Which reference class should we consider: all people, just smokers, or just heavy smokers?

Similar problems arise when we think about regression to the mean. We know that extreme performances and outcomes tend to be followed by ones that are more average (that regress to the mean). If Wilma hits 86% of her shots in a basketball game, she is likely to hit a lower percentage the next time out. People do not seem to have an intuitive understanding of this phenomenon, and the difficulty is compounded by the fact that, as with base rates, there is a problem about reference class. Which average will the scores regress to: Wilma’s average, her recent average, or her team’s average?

Up to a point, smaller reference classes are likely to underwrite better predictions, and we will also be interested in reference classes that seem causally relevant to the characteristic we care about. But if a reference class becomes too small, it won’t generate stable frequencies, and it will often be harder to find data on smaller classes that is tailored to our concerns. There is no one right answer about which reference class is the correct one. It often requires a judicious balancing of tradeoffs. With base rates, it may be a matter of weighing the costs of a wrong prediction against the payoffs of extreme accuracy, or the value of more precise information against the costs of acquiring it or (if you are a policy maker) the fact that the number of reference classes grows exponentially with each new predictor variable. But while there is no uniquely correct way to use base rates, it doesn’t follow that it is fine to ignore them; failing to take base-rate information into account when we have it will often lead to bad policy.

Exercises on Bayes' Theorem and Conditional Probabilities

In the section on the Rule for Total Probability we encountered the factory with two machines that make widgets. Machine A makes 800 per day and 1% of them are defective. The other machine, call it ~A, makes 200 a day and 2% are defective. What is the probability that a widget is produced by machine A, given that it is defective? We know the probability that it is defective if it is produced by A—Pr(D|A) = .01— but this asks for the opposite or converse probability; what is Pr(A|D)? Use Bayes’ Rule to calculate the answer.
Earlier you were asked to draw a picture to solve the following problem; now solve it by calculation and see if your answers agree. Officials at the suicide prevention center know that 2% of all people who phone their hot line attempt suicide. A psychologist has devised a quick and simple verbal test to help identify those callers who will attempt suicide. She found that:
1. 80% of the people who will attempt suicide have a positive score on this test.
2. Only 5% of those who will not attempt suicide have a positive score on this test.

If you get a positive identification from a caller on this test, what is the probability that he would attempt suicide?

A clinical test, designed to diagnose a specific illness, comes out positive for a certain patient. We are told that:
1. The test is 79% accurate: the chances that you have the illness if it says you do is 79%, and the chances that you do not have the illness if it says you don’t is also 79%.
2. This illness affects 1% of the population in the same age group as the patient. Taking these two facts into account, and assuming you know nothing about the patient’s symptoms or signs, what is the probability that this particular patient actually has the illness?
Suppose that you are handed two bags of poker chips, but from the outside you can’t tell which is which.
1. Bag 1: 70 red chips and 30 blue chips
2. Bag 2: 30 red chips and 70 blue chips

You pick one of the two bags at random and draw a chip from it. The chip is red. You replace it, draw again, and get another chip, and so on through twelve trials (i.e., twelve draws). In the twelve trials, you get 8 red chips and 4 blue chips. What is the probability that you have been drawing chips from Bag One (with 70 red and 30 blue) rather than from Bag Two (with 30 red and 70 blue)? Most people answer that the probability that you have been drawing from Bag One is around 75. In fact, as Bayes’ Rule will show, it is 97. But people often revise their probabilities less that Bayes’ Rule says they should. People are conservative when it comes to updating their probabilities in the light of new evidence. Use Bayes’ Rule to show that 97 is the correct value.

Wilbur has two children. We run into him at the mall with a teenage boy he introduces as his son. What is the probability that Wilbur’s other child is a boy? Second scenario: Wilbur introduces the boy as his oldest son. Now what is the probability that his other child is a boy? (Hint: think about the two-aces problem above).
One thousand people, including you, bought one ticket in the local lottery. There were ten winners, all of whom were notified correctly that they had won. But because of a clerical error, 1% of the people who didn’t win also received notifications that they did. You received a letter saying you were a winner. What is the probability that you really did win?
Monty Hall Problem In an earlier chapter, we approached the Monty Hall Problem in an intuitive way. Now we will now verify our earlier answers using Bayes’ Rule. Recall that in this problem you imagine that you are a contestant in a game show and there are three doors in front of you. There is nothing worth having behind two of them, but there is $100,000 behind the third. If you pick the correct door, the money is yours. You choose A. But before the host, Monty Hall, shows you what is behind that door, he opens one of the other two doors, picking one he knows has nothing behind it. Suppose he opens door B. This takes B out of the running, so the only question now is about door A vs. door C. Monty (‘M’, for short) now allows you to reconsider your earlier choice: you can either stick with door A or switch to door C. Should you switch?
1. What is the probability that the money is behind door A?
2. What is the probability that the money is behind door C?

Since we are asking these questions once Monty has opened door B, they are equivalent to asking about:

Answer

1. We are given all the relevant numbers except that for Pr(D), but we calculated this in the section on total probabilities. By Bayes’ Rule:

Pr(A | D) = Pr (A) × Pr (D | A) / Pr (D)

= . 8 × .01 / . 012

= 2 / 3

7. Let’s give the doors letter names so they don’t get confused with our numbers. Then, to solve the Monty Hall problem we must calculate:

1’ Pr($ behind A|M opened B)

2’ Pr($ behind C|M opened B)

Bayes’ Rule tells us that Pr(Money behind A|M opened B) =

Pr ($ behind A) × Pr (M opened B | $ behind A) / Pr (M opens B)

We know the values for the two items in the numerator:

Pr($ behind A): the prior probability that the money is behind door A is 1/3 (going in, it could equally well be behind any of the three doors).
Pr(M opened B|$ behind A) is ½ (there is a fifty/fifty chance that he would open either door B or door C, when the money is behind door A).
But Pr(M opens B), the number in the denominator, requires some work.

To see the value of the denominator—Pr(M opens B)—note that Monty will never open door B if the money is behind B. Hence, he could open B under exactly two conditions. He could open B when the money is behind A or he could open B when the money is behind C. So,

Pr(M opens B) = [Pr($ behind A) × Pr(M opens B | $ behind A)]

+ [Pr($ behind C) × Pr(M opnes B | $ behind C)]

= [ 1 / 3 × 1 / 2 ] + [ 1 / 3 × 1]

= [ 1 / 6 ] + [ 1 / 3 ]

= 1 / 3

Plugging this into the denominator in Bayes’ Rule, we get:

Pr($ behind A | M opened B) = Pr ($ behind A) × Pr (M opened B | $ behind A) / Pr (M opens B)

= 1/3 × 1/2 / 1/2

= 1/6 / 1/2

= 1/3

So, the probability that the money is behind your door, door A, given that Monty has opened door B, is 1/3. Since the only other door left is door C, the probability that the money is there, given that Monty opened door B, should be 2/3. Use Bayes’ Rule to prove that this is correct.

Derivation of Bayes’ Rule

Where does Bayes’ theorem come from? It is straightforward to derive it from our rules for calculating probabilities. You don’t need to worry about the derivation, but here it is for anyone who likes such things.

Pr(A | B) = Pr (A & B) / Pr (B)

= Pr(A) × Pr (B | A) / Pr (B)

In actual applications we often don’t have direct knowledge of the value of the denominator, Pr(B), but in many cases we do have enough information to calculate it using the Rule for Total Probability. This tells us that:

Pr(B) = Pr(B & A) + Pr (B & ~A)

= [Pr(A) × Pr(B | A)] + [Pr (~A) × Pr (B | ~A)

So, the most useful version of Bayes’ Rule is often

Pr(A | B) = Pr (A) × Pr (B | A) / [Pr(A) × Pr(B | A) + [Pr(~A) × Pr(B |~A)]

Bayes’ Rule can take more complex forms, but this is as far as we will take it in this book.