7. The base rate fallacy

Consider the following scenario. You go in for some testing for some health problems you’ve been having and after a number of tests, you test positive for colon cancer. What are the chances that you really do have colon cancer? Let’s


7 Kahneman gives this explanation numerous places, including, most exhaustively (and for a general audience) in his 2011 book, Thinking Fast and Slow. New York, NY: Farrar, Straus and Giroux.


suppose that the test is not perfect, but it is 95% accurate. That is, in the case of those who really do have colon cancer, the test will detect the cancer 95% of the time (and thus miss it 5% of the time). (The test will also misdiagnose those who don’t actually have colon cancer 5% of the time.) Many people would be inclined to say that, given the test and its accuracy, there is a 95% chance that you have colon cancer. However, if you are like most people and are inclined to answer this way, you are wrong. In fact, you have committed the fallacy of ignoring the base rate (i.e., the base rate fallacy).

 

The base rate in this example is the rate of those who have colon cancer in a population. There is very small percentage of the population that actually has colon cancer (let’s suppose it is .005 or .5%), so the probability that you have it must take into account the very low probability that you are one of the few that have it. That is, prior to the test (and not taking into account any other details about you), there was a very low probability that you have it—that is, a half of one percent chance (.5%). Yes, the test is 95% accurate, but given the very low prior probability that you have colon cancer, we cannot simply now say that there is a 95% chance that you have it. Rather, we must temper that figure with the very low base rate. The general point is this: when a condition (x) is very rare, then even if a highly accurate test identifies condition (x) as being present, we should still suspect that condition (x) is not present. In the above scenario, the prior probability (i.e., before the test) that you have colon cancer is really, really low. And that means that even after the test we should suspect that the probability is still fairly low. That’s the logic of the matter and we can understand that conceptually without actually even getting into the math.

 

But since we are given numbers to work with, we can actually use math to figure the actual probability that you have colon cancer. Here is how we do it. Let’s suppose that our population is 100,000 people. The base rate tells us that .5% of the population has colon cancer. That means that of the 100,000 people only 500 of them have colon cancer. If we were to apply the 95% accurate test to 500 people, the test would correctly diagnose 475 of them. That is, the test would deliver 475 correct identifications. These are called true positives. However, the test will also mistakenly tell us that some of the people who don’t have colon cancer actually do have it. When this happens, it is called a false positive. Our base rate tells us that most of our population (99,500) do not have colon cancer and the 95% accurate test will misdiagnose 5% of those has having colon cancer. This comes out to 4975 false positives!


So what are the chances that you are true positive rather than a false positive? It is simply the number of true positives (475) divided by the total number of positive identifications that the test would make. This latter number includes those the test would misidentify (4975) as well as the number it would accurately identify (475)—thus the total number the test would identify as having colon cancer would be 5450. So the probability that you have it, given the positive test = 475/5450 = .087 or 8.7%. So the probability that you have cancer, given the evidence of the positive test is 8.7%. Thus, contrary to our initial reasoning that there was a 95% chance that you have colon cancer, the chance is only a tenth of that—it is less than 10%! In thinking that the probability that you have cancer is closer to 95% you would be ignoring the extremely low probability of having the disease in the first place (i.e., the low base rate). Neglecting to account for low base rates in determining the probability of some event is the signature of any base rate fallacy.

 

The general lesson here is that the number of false positives will be quite high (even when the identification method is fairly accurate) as long as the base rate of the phenomenon we’re looking for is very low. And if the number of false positives is high, then this will significantly lower the probability that the identification method has correctly identified the phenomenon in question.

 

From the above example we can see that the general method for determining probabilities when base rates are in play is the following.

 

1.    Determine the number of false positives (i.e., the number of instances that are incorrectly identified by the method)

2.    Determine the number of true positives (i.e., the number of instances that are correctly determined by the method)

3.    Use the following equation to figure probability:

 


𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠

𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝑓𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠

 

Before closing this section, let’s look at a couple more examples of a base rate fallacy. Suppose that the government has developed a machine that is able to detect terrorist intent with an accuracy of 90%. During a joint meeting of congress, a highly trustworthy source says that there is a terrorist in the building. (Let’s suppose, for the sake of simplifying this example, that there is in fact a terrorist in the building.) In order to determine who the terrorist is, the building


security seals all the exits, rounds up all 3000 people in the building and uses the machine to test each person. The first 30 people pass without triggering a positive identification from the machine, but on the very next person, the machine triggers a positive identification of terrorist intent. The question is: what are the chances that the person who set off the machine really is a terrorist?8 Consider the following three possibilities: a) 90%, b) 10%, or c) .3%. If you answered 90%, then you committed the base rate fallacy again. The actually answer is “c”—less than 1%! Here’s why. The base rate is the likelihood that any given individual is a terrorist and this is exceedingly low since there is only one terrorist in the building and there are 3000 people in the building. That means the probability of any one person being a terrorist, before any results of the test, is exceedingly low: 1/3000. Since the test is 90% accurate, that means that out of the 2999 non-terrorists, it will misidentify 10% of them as terrorists = ~300 false positives. Assuming the machine doesn’t misidentify the one actual terrorist, the machine will identify a total of 301 individuals as those “possessing terrorist intent.” The probability that any one of them actually possesses terrorist intent is 1/301 = .3%. So the probability is drastically lower than 90%. It’s not even close. This is another good illustration of how far off probabilities can be when the base rate is ignored.

 

Last one. Suppose that Bob is a super-eyewitness. When he observers a crime he is 99% accurate in identifying the suspect. Suppose that Bob identifies Nancy as the person who robbed the American Apparel store. In a population where .5% (half of one percent) of the population are robbers, what is the probability that Nancy really is a robber, given that Bob’s eyewitness testimony identified her as a robber?

 

At this point, having been sensitized to the base rate fallacy, you should suspect that the probability is nowhere near as high as the accuracy of Bob’s eyewitness skills. Here’s the math. Suppose our population is 1000 people: 995 non- robbers and 5 robbers (based on the above base rate of robbers in the population). Of 995 non-robbers, Bob will misidentify 9.95 as robbers (false positives) and accurately identify 4.95 as robbers for a totally of 14.9 robber- identifications. So the chances that Nancy really is a robber, given Bob’s eyewitness evidence is:

 


8 This example is taken (with certain alterations) from: http://news.bbc.co.uk/2/hi/uk_news/magazine/8153539.stm



# 𝑜𝑓 𝑟𝑜𝑏𝑏𝑒𝑟𝑠 𝐵𝑜𝑏 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑖𝑑𝑒𝑛𝑡𝑖𝑓𝑖𝑒𝑠

𝑡𝑜𝑡𝑎𝑙 # 𝑜𝑓 𝐵𝑜𝑏6 𝑠 𝑟𝑜𝑏𝑏𝑒𝑟 𝑖𝑑𝑒𝑛𝑡𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑠

 

= 4.95/14.9 = 33%. Thus, it is more likely that Nancy is not the robber.