In this section and the next, the goal is equip ourselves to understand, analyze, and criticize arguments using statistics. Such arguments are extremely common; they’re also frequently manipulative and/or fallacious. As Mark Twain once said, “There are three kinds of lies: lies, damned lies, and statistics.” It is possible, however, with a minimal understanding of some basic statistical concepts and techniques, along with an awareness of the various ways these are commonly misused (intentionally or not), to see the “lies” for what they are: bad arguments that shouldn’t persuade us. In this section, we will provide a foundation of basic statistical knowledge. In the next, we will look at various statistical fallacies.
Averages: Mean vs. Median
The word ‘average’ is slippery: it can be used to refer both to the arithmetic mean or the median of a set of values. The mean and median are often different, and when this is the case, use of the word ‘average’ is equivocal. A clever person can use this fact to her rhetorical advantage. We hear the word ‘average’ thrown around quite a bit in arguments: the average family has such-and-such an income, the average student carries such-and-such in student loan debt, and so on. Audiences are supposed to take this fictional average entity to be representative of all the others, and depending on the conclusion she’s trying to convince people of, the person making the argument will choose between mean and median, picking the number that best serves her rhetorical purpose. It’s important, therefore, for the critical listener to ask, every time the word ‘average’ is used, “Does this refer to the mean or the median? What’s the difference between the two? How would using the other affect the argument?”
A simple example can make this clear. (Inspiration for this example, as with much that follows, comes from Darrell Huff, 1954, How to Lie with Statistics, New York: Norton.) I run a masonry contracting business on the side—Logical Constructions (a wholly owned subsidiary of LogiCorp). Including myself, 22 people work at Logical Constructions. This is how much they’re paid per year: $350,000 for me (I’m the boss); $75,000 each for two foremen; $70,000 for my accountant; $50,000 each for five stone masons; $30,000 for the office secretary; $25,000 each for two apprentices; and $20,000 each for ten laborers. To calculate the mean salary at Logical Constructions, we add up all the individual salaries (my $350,000, $75,000 twice since there are two foremen, and so on) and divide by the number of employees. The result is $50,000. To calculate the median salary, we put all the individual salaries in numerical order (ten entries of $20,000 for the laborers, then two entries of $25,000 for the apprentices, and so on) and find the middle number—or, as is the case with our set, which has an even number of entries, the mean of the middle two numbers. The middle two numbers are both $25,000, so the median salary is $25,000.
Now, you may have noticed, a lot of my workers don’t get paid particularly well. In particular, those at the bottom—my ten laborers—are really getting the shaft: $20,000 a year for that kind of back-breaking work is a raw deal. Suppose one day, as I’m driving past our construction site (in the back of my limo, naturally), I notice some outside agitators commiserating with my laborers during their (10-minute) lunch break—you know the type, union organizers, pinko commies (in this story, I’m a greedy capitalist; play along). They’re trying to convince my employees to bargain collectively for higher wages. Now we have a debate: should the workers at Logical Constructions be paid more? I take one side of the issue; the workers and organizers take the other. In the course of making our arguments, we might both refer to the average worker at Logical Constructions. I’ll want to do so in a way that makes it appear that this mythical worker is doing pretty well, and so we don’t need to change anything; the organizers will want to do so in such a way that makes it appear that the average worker isn’t do very well at all. We have two senses of ‘average’ to choose from: mean and median. In this case, the mean is higher, so I will use it: “The average worker at Logical Constructions makes $50,000 per year. That’s a pretty good wage!” My opponents, the union organizers, will counter, using the median: “The average worker at Logical Constructions makes a mere $25,000 per year. Try raising a family on such a pittance!”
A lot hangs on which sense of ‘average’ we pick. This is true in lots of real-life circumstances. For example, household income in the United States is distributed much as salaries are at my fictional Logical Constructions company: those at the top of the range fare much better than those at the bottom. (In 2014, the richest fifth of American households accounted for over 51% of income; the poorest fifth, 3%.) In such circumstances, the mean is higher than the median. In 2014, the mean household income in the U.S. was $72, 641. That’s pretty good! The median, however, was a mere $53, 657. That’s a big difference! “The average family makes about $72,000 per year” sounds a lot better than “The average family makes about $53,000 per year.”
Normal Distributions: Standard Deviation, Confidence Intervals
If you gave IQ tests to a whole bunch of people, and then graphed the results on a histogram or bar chart—so that every time you saw a particular score, the bar for that score would get higher— you’d end up with a picture like this:
This kind of distribution is called a “normal” or “Gaussian” distribution (Gaussian” because the great German mathematician Carl Friedrich Gauss made a study of such distributions in the early 19th century (in connection with their relationship to errors in measurement).); because of its shape, it’s often called a “bell curve”. Besides IQ, many phenomena in nature are (approximately) distributed normally: height, blood pressure, motions of individual molecules in a collection, lifespans of industrial products, measurement errors, and so on. (This is a consequence of a mathematical result, the Central Limit Theorem, the basic upshot of which is that if some random variable (a trait like IQ, for example, to be concrete) is the sum of many independent random variables (causes of IQ differences: lots of different genetic factors, lots of different environmental factors), then the variable (IQ) will be normally distributed. The mathematical theorem deals with abstract numbers, and the distribution is only perfectly “normal” when the number of independent variables approaches infinity. That’s why real-life distributions are only approximately normal.) And even when traits are not normally distributed, it can be useful to treat them as if they were. This is because the bell curve provides an extremely convenient starting point for making certain inferences. It’s convenient because one can know everything about such a curve by specifying two of its features: its mean (which, because the curve is symmetrical, is the same as its median) and its standard deviation.
We already understand the mean. Let’s get a grip on standard deviation. We don’t need to learn how to calculate it (though that can be done); we just want a qualitative (as opposed to quantitative) understanding of what it signifies. Roughly, it’s a measure of the spread of the data represented on the curve; it’s a way of indicating how far, on average, values tend to stray from the mean. An example can make this clear. Consider two cities: Milwaukee, WI and San Diego, CA. These two cities are different in a variety of ways, not least in the kind of weather their residents experience. Setting aside precipitation, let’s focus just on temperature. If you recorded the high temperatures every day in each town over a long period of time and made a histogram for each (with temperatures on the x-axis, number of days on the y-axis), you’d get two very different-looking curves. Maybe something like these:
The average high temperatures for the two cities—the peaks of the curves—would of course be different: San Diego is warmer on average than Milwaukee. But the range of temperatures experienced in Milwaukee is much greater than that in San Diego: some days in Milwaukee, the high temperature is below zero, while on some days in the summer it’s over 100°F. San Diego, on the other hand, is basically always perfect: right around 70° or so. (This is an exaggeration, of course, but not much of one. The average high in San Diego in January is 65°; in July, it’s 75°. Meanwhile, in Milwaukee, the average high in January is 29°, while in July it’s 80°.) The standard deviation of temperatures in Milwaukee is much greater than in San Diego. This is reflected in the shapes of the respective bell curves: Milwaukee’s is shorter and wider—with a non-trivial number of days at the temperature extremes and a wide spread for all the other days—and San Diego’s is taller and narrower—with temperatures hovering in a tight range all year, and hence more days at each temperature recorded (which explains the relative heights of the curves).
Once we know the mean and standard deviation of a normal distribution, we know everything we need to know about it. There are three very useful facts about these curves that can be stated in terms of the mean and standard deviation (SD). As a matter of mathematical fact, 68.3% of the population depicted on the curve (whether they’re people with certain IQs, days on which certain temperatures were reached, measurements with a certain amount of error) falls within a range of one standard deviation on either side of the mean. So, for example, the mean IQ is 100; the standard deviation is 15. It follows that 68.3% of people have an IQ between 85 and 115—15 points (one SD) on either side of 100 (the mean). Another fact: 95.4% of the population depicted on a bell curve will fall within a range two standard deviations from the mean. So 95.4% of people have an IQ between 70 and 130—30 points (2 SDs) on either side of 100. Finally, 99.7% of the population falls within three standard deviations of the mean; 99.7% of people have IQs between 55 and 145. These ranges are called confidence intervals. (Pick a person at random. How confident are you that they have an IQ between 70 and 130? 95.4%, that’s how confident.) They are convenient reference points commonly used in statistical inference. (As a matter of fact, in current practice, other confidence intervals are more often used: 90%, (exactly) 95%, 99%, etc. These ranges lie on either side of the mean within non-whole-number multiples of the standard deviation. For example, the exactly-95% interval is 1.96 SDs to either side of the mean. The convenience of calculators and spreadsheets to do our math for us makes these confidence intervals more practical. But we’ll stick with the 68.3/95.4/99.7 intervals for simplicity’s sake.)
Statistical Inference: Hypothesis Testing
If we start with knowledge of the properties of a given normal distribution, we can test claims about the world to which that information is relevant. Starting with a bell curve—information of a general nature—we can make draw conclusions about particular hypotheses. These are conclusions of inductive arguments; they are not certain, but more or less probable. When we use knowledge of normal distributions to draw them, we can be precise about how probable they are. This is inductive logic.
The basic pattern of the kinds of inferences we’re talking about is this: one formulates a hypothesis, then runs an experiment to test it; the test involves comparing the results of that experiment to what is known (some normal distribution); depending on how well the results of the experiment comport with what would be expected given the background knowledge represented by the bell curve, we draw a conclusion about whether or not the hypothesis is true.
Though they are applicable in a very wide range of contexts, it’s perhaps easiest to explain the patterns of reasoning we’re going to examine using examples from medicine. These kinds of cases are vivid; they aid in understanding by making the consequences of potential errors more real. Also, in these cases the hypotheses being tested are relatively simple: claims about individuals’ health—whether they’re healthy or sick, whether they have some condition or don’t—as opposed to hypotheses dealing with larger populations and measurements of their properties. Examining these simpler cases will allow us to see more clearly the underlying patterns of reasoning that cover all such instances of hypothesis testing, and to gain familiarity with the vocabulary statisticians use in their work.
The knowledge we start with is how some trait relevant to the particular condition is distributed in the population generally—a bell curve. (Again, the actual distribution may not be normal, but we will assume that it is in our examples. The basic patterns of reasoning are similar when dealing with different kinds of distributions.) The experiment we run is to measure the relevant trait in the individual whose health we’re assessing. The result of a comparison with the result of this measurement and the known distribution of the trait tells us something about whether or not the person is healthy. Suppose we start with information about how a trait is distributed among people who are healthy. Hematocrit, for example, is a measure of how much of a person’s blood is taken up by red blood cells—expressed as a percentage (of total blood volume). Lower hematocrit levels are associated with anemia; higher levels are associated with dehydration, certain kinds of tumors, and other disorders. Among healthy men, the mean hematocrit level is 47%, with a standard deviation of 3.5%. We can draw the curve, noting the boundaries of the confidence intervals:
Because of the fixed mathematical properties of the bell curve, we know that 68.3% of healthy men have hematocrit levels between 43.5% and 50.5%; 95.4% of them are between 40% and 54%; and 99.7% of them are between 36.5% and 57.5%. Let’s consider a man whose health we’re interested in evaluating. Call him Larry. We take a sample of Larry’s blood and measure the hematocrit level. We compare it to the values on the curve to see if there might be some reason to be concerned about Larry’s health. Remember, the curve tells us the levels of hematocrit for healthy men; we want to know if Larry’s one of them. The hypothesis we’re testing is that Larry’s healthy. Statisticians often refer the hypothesis under examination in such tests as the “null hypothesis”—a default assumption, something we’re inclined to believe unless we discover evidence against it. Anyway, we’re measuring Larry’s hematocrit; what kind of result should he be hoping for? Clearly, he’d like to be as close to the middle, fat part of the curve as possible; that’s where most of the healthy people are. The further away from the average healthy person’s level of hematocrit he strays, the more he’s worried about his health. That’s how these tests work: if the result of the experiment (measuring Larry’s hematocrit) is sufficiently close to the mean, we have no reason to reject the null hypothesis (that Larry’s healthy); if the result is far away, we do have reason to reject it.
How far away from the mean is too far away? It depends. A typical cutoff is two standard deviations from the mean—the 95.4% confidence interval. (Actually, the typical level is now exactly 95%, or 1.96 standard deviations from the mean. From now on, we’re just going to pretend that the 95.4% and 95% levels are the same thing.) That is, if Larry’s hematocrit level is below 40% or above 54%, then we might say we have reason to doubt the null hypothesis that Larry is healthy. The language statisticians use for such a result—say, for example, if Larry’s hematocrit came in at 38%—is to say that it’s “statistically significant”. In addition, they specify the level at which it’s significant—an indication of the confidence-interval cutoff that was used. In this case, we’d say Larry’s result of 38% is statistically significant at the .05 level. (95% = .95; 1 - .95 = .05) Either Larry is unhealthy (anemia, most likely), or he’s among the (approximately) 5% of healthy people who fall outside of the two standard-deviation range. If he came in at a level even further from the mean—say, 36%—we would say that this result is significant at the .003 level (99.7% = .997; 1 - .997 = .003). That would give us all the more reason to doubt that Larry is healthy.
So, when we’re designing a medical test like this, the crucial decision to make is where to set the cutoff. Again, typically that’s the 95% confidence interval. If a result falls outside that range, the person tests “positive” for whatever condition we’re on the lookout for. (Of course, a “positive” result is hardly positive news—in the sense of being something you want to hear.) But these sorts of results are not conclusive: it may be that the null hypothesis (this person is healthy) is true, and that they’re simply one of the relative rare 5% who fall on the outskirts of the curve. In such a case, we would say that the test has given the person a “false positive” result: the test indicates sickness when in fact there is none. Statisticians refer to this kind of mistake as “type I error”. We could reduce the number of mistaken results our test gives by changing the confidence levels at which we give a positive result. Returning to the concrete example above: suppose Larry has a hematocrit level of 38%, but that he is not in fact anemic; since 38% is outside of the two standard-deviation range, our test would give Larry a false positive result if we used the 95% confidence level. However, if we raised the threshold of statistical significance to the three standard-deviation level of 99.7%, Larry would not get flagged for anemia; there would be no false positive, no type I error.
So we should always use the wider range on these kinds of tests to avoid false positives, right? Not so fast. There’s another kind of mistake we can make: false negatives, or type II errors. Increasing our range increases our risk of this second kind of foul-up. Down there at the skinny end of the curve there are relatively few healthy people. Sick people are the ones who generally have measurements in that range; they’re the ones we’re trying to catch. When we issue a false negative, we’re missing them. A false negative occurs when the test tells you there’s no reason to doubt the null hypothesis (that you’re healthy), when as a matter of fact you are sick. If we increase our range from two to three standard deviations—from the 95% level to the 99.7% level—we will avoid giving a false positive result to Larry, who is healthy despite his low 38% hematocrit level. But we will end up giving false reassurance to some anemic people who have levels similar to Larry’s; someone who has a level of 38% and is sick will get a false negative result if we only flag those outside the 99.7% confidence interval (36.5% - 57.5%).
This is a perennial dilemma in medical screening: how best to strike a balance between the two types of errors—between needlessly alarming healthy people with false positive results and failing to detect sickness in people with false negative results. The terms clinicians use to characterize how well diagnostic tests perform along these two dimensions are sensitivity and specificity. A highly sensitive test will catch a large number of cases of sickness—it has a high rate of true positive results; of course, this comes at the cost of increasing the number of false positive results as well. A test with a high level of specificity will have a high rate of true negative results— correctly identifying healthy people as such; the cost of increased specificity, though, is an increase in the number of false negative results—sick people that the test misses. Since every false positive is a missed opportunity for a true negative, increasing sensitivity comes at the cost of decreasing specificity. And since every false negative is a missed true positive, increasing specificity comes at the cost of decreasing specificity. A final bit of medical jargon: a screening test is accurate to the degree that it is both sensitive and specific.
Given sufficiently thorough information about the distributions of traits among healthy and sick populations, clinicians can rig their diagnostic tests to be as sensitive or specific as they like. But since those two properties pull in opposite directions, there are limits to degree of accuracy that is possible. And depending on the particular case, it may be desirable to sacrifice specificity for more sensitivity, or vice versa.
To see how a screening test might be rigged to maximize sensitivity, let’s consider an abstract hypothetical example. Suppose we knew the distribution of a certain trait among the population of people suffering from a certain disease. (Contrast this with our starting point above: knowledge of the distribution among healthy individuals.) This kind of knowledge is common in medical contexts: various so-called biomarkers—gene mutations, proteins in the blood, etc.—are known to be indicative of certain conditions; often, one can know how such markers are distributed among people with the condition. Again, keeping it abstract and hypothetical, suppose we know that among people who suffer from Disease X, the mean level of a certain biomarker β for the disease is 20, with a standard deviation of 3. We can sum up this knowledge with a curve:
Now, suppose Disease X is very serious indeed. It would be a benefit to public health if we were able to devise a screening test that could catch as many cases as possible—a test with a high sensitivity. Given the knowledge we have about the distribution of β among patients with the disease, we can make our test as sensitive as we like. We know, as a matter of mathematical fact, that 68.3% percent of people with the disease have β-levels between 17 and 23; 95.4% of people with the disease have levels between 14 and 26; 99.7% have levels between 11 and 29. Given these facts, we can devise a test that will catch 99.7% of cases of Disease X like so: measure the level of biomarker β in people, and if they have a value between 11 and 29, they get a positive test result; a positive result is indicative of disease. This will catch 99.7% of cases of the condition, because the range chosen is three standard deviations on either side of the mean, and that range contains 99.7% of unhealthy people; if we flag everybody in that range, we will catch 99.7% of cases. Of course, we’ll probably end up catching a whole lot of healthy people as well if we cast our net this wide; we’ll get a lot of false positives. We could correct for this by making our test less sensitive, say by lowering the threshold for a positive test to the two standard-deviation range of 14 – 26. We would now only catch 95.4% of cases of sickness, but we would reduce the number of healthy people given false positives; instead, they would get true negative results, increasing the specificity of our test.
Notice that the way we used the bell curve in our hypothetical test for Disease X was different from the way we used the bell curve in our test of hematocrit levels above. In that case, we flagged people as potentially sick when they fell outside of a range around the mean; in the new case, we flagged people as potentially sick when they fell inside a certain range. This difference corresponds to the differences in the two populations the respective distributions represent: in the case of hematocrit, we started with a curve depicting the distribution of a trait among healthy people; in the second case, we started with a curve telling us about sick people. In the former case, sick people will tend to be far from the mean; in the latter, they’ll tend to cluster closer.
The tension we’ve noted between sensitivity and specificity—between increasing the number of cases our diagnostic test catches and reducing the number of false positives it produces, can be seen when show curves for healthy populations and sick populations in the same graph. There is a biomarker called alpha-fetoprotein in the blood serum of pregnant women. Low levels of this protein are associated with Down syndrome in the fetus; high levels are associated with neural tube defects like open spina bifida (spine isn’t completely inside the body) and anencephaly (hardly any of the brain/skull develops). These are serious conditions—especially those associated with the high levels: if the baby has open spina bifida, you need to be ready for that (with specialists and special equipment) at the time of birth; in cases of anencephaly, the fetus will not be viable (at worst) or will live without sensation or awareness (at best?). Early in pregnancy, these conditions are screened for. Since they’re so serious, you’d like to catch as many cases as possible. And yet, you’d like to avoid alarming false positive results for these conditions. The following chart, with bell curves for healthy babies, those with open spina bifida, and anencephaly, illustrates the difficult tradeoffs in making these sorts of decisions (Picture from a post at www.pregnancylab.net by David Grenache, PhD: http://www.pregnancylab.net/2012/11/...e-defects.html):
The vertical line at 2.5 MoM (multiples of the median) is the typical cutoff for a “positive” result (flagged for potential problems). On the one hand, there are substantial portions of the two curves representing the unhealthy populations—to the left of that line—that won’t be flagged by the test. Those are cases of sickness that we won’t catch—false negatives. On the other hand, there are a whole lot of healthy babies whose parents are going to be unnecessarily alarmed. The area of the “Unaffected” curve to the right of the line may not look like much, but these curves aren’t drawn on a linear scale. If they were, that curve would be much (much!) higher than the two for open spina bifida and anencephaly: those conditions are really rare; there are far more healthy babies. The upshot is, that tiny-looking portion of the healthy curve represents a lot of false positives.
Again, this kind of tradeoff between sensitivity and specificity often presents clinicians with difficult choices in designing diagnostic tests. They must weigh the benefits of catching as many cases as possible against the potential costs of too many false positives. Among the costs are the psychological impacts of getting a false positive. As a parent who experienced it, I can tell you getting news of potential open spina bifida or anencephaly is quite traumatic. (False positive: the baby was perfectly healthy.) But it could be worse. For example, when a biomarker for AIDS was first identified in the mid-1980s, people at the Centers for Disease Control considered screening for the disease among the entire population. The test was sensitive, so they knew they would catch a lot of cases. But they also knew that there would be a good number of false positives. Considering the hysteria that would likely arise from so many diagnoses of the dreaded illness (in those days, people knew hardly anything about AIDS; people were dying of a mysterious illness, and fear and misinformation were widespread), they decided against universal screening. Sometimes the negative consequences of false positives include financial and medical costs. In 2015, the American Cancer Society changed its recommendations for breast-cancer screening: instead of starting yearly mammograms at age 40, women should wait until age 45. (Except for those known to be at risk, who should start earlier.) This was a controversial decision. Afterwards, many women came forward to testify that their lives were saved by early detection of breast cancer, and that under the new guidelines they may not have fared so well. But against the benefit of catching those cases, the ACS had to weigh the costs of false-positive mammograms. The follow-up to a positive mammogram is often a biopsy; that’s an invasive surgical procedure, and costly. Contrast that with the follow-up to a positive result for open spina bifida/anencephaly: a non-invasive, cheap ultrasound. And unlike an ultrasound, the biopsy is sometimes quite difficult to interpret; you get some diagnoses of cancer when cancer is not present. Those women may go on to receive treatment—chemotherapy, radiation—for cancer that they don’t have. The costs and physical side- effects of that are severe. (Especially perverse are the cases in which the radiation treatment itself causes cancer in a patient who didn’t have to be treated to begin with.) In one study, it was determined that for every life saved by mammography screening, there were 100 women who got false positives (and learned about it after a biopsy) and five women treated for cancer they didn’t have. (PC Gøtzsche and KJ Jørgensen, 2013, Cochrane Database of Systematic Reviews (6), CD001877.pub5)
The logic of statistical hypothesis testing is relatively clear. What’s not clear is how we ought to apply those relatively straightforward techniques in actual practice. That often involves difficult financial, medical, and moral decisions.
Statistical Inference: Sampling
When we were testing hypotheses, our starting point was knowledge about how traits were distributed among a large population—e.g., hematocrit levels among healthy men. We now ask a pressing question: how do we acquire such knowledge? How do we figure out how things stand with a very large population? The difficulty is that it’s usually impossible to check every member of the population. Instead, we have to make an inference. This inference involves sampling: instead of testing every member of the population, we test a small portion of the population—a sample— and infer from its properties to the properties of the whole. It’s a simple inductive argument:
The sample has property X.
Therefore, the general population has property X.
The argument is inductive: the premise does not guarantee the truth of the conclusion; it merely makes it more probable. As was the case in hypothesis testing, we can be precise about the probabilities involved, and our probabilities come from the good-old bell curve.
Let’s take a simple example. (I am indebted for this example in particular (and for much background on the presentation of statistical reasoning in general) to John Norton, 1998, How Science Works, New York: McGraw-Hill, pp. 12.14 – 12.15.) Suppose we were trying to discover the percentage of men in the general population; we survey 100 people, and it turns out there are 55 men in our sample. So, the proportion of men in our sample is .55. We’re trying to make an inference from this premise to a conclusion about the proportion of men in the general population. What’s the probability that the proportion of men in the general population is .55? This isn’t exactly the question we want to answer in these sorts of cases, though. Rather, we ask, what’s the probability that the true proportion of men in the general population is in some range on either side of .55? We can give a precise answer to this question; the answer depends on the size of the range you’re considering in a familiar way.
Given that our sample’s proportion of men is .55, it is relatively more likely that the true proportion in the general population is close to that number, less likely that it’s far away. For example, it’s more likely, given the result of our survey, that in fact 50% of the population is men than it is that only 45% are men. And it’s still less likely that only 40% are men. The same pattern holds in the opposite direction: it’s more likely that the true percentage of men is 60% than 65%. Generally speaking, the further away from our survey results we go, the less probable it is that we have the true value for the general population. The drop off in probabilities described takes the form of a bell curve:
The standard deviation of .05 is a function of our sample size of 100. (And the mean (our result of .55). The mathematical details of the calculation needn’t detain us.) We can use the usual confidence intervals—again, with 2 standard deviations, 95.4% being standard practice—to interpret the findings of our survey: we’re pretty sure—to the tune of 95%—that the general population is between 45% and 65% male.
That’s a pretty wide range. Our result is not that impressive (especially considering the fact that we know the actual number is very close to 50%). But that’s the best we can do given the limitations of our survey. The main limitation, of course, was the size of our sample: 100 people just isn’t very many. We could narrow the range within which we’re 95% confident if we increased our sample size; doing so would likely (though not certainly) give us a proportion in our sample closer to the true value of (approximately) .5. The relationship between the sample size and the width of the confidence intervals is a purely mathematical one. As sample size goes up, standard deviation goes down—the curve narrows:
The pattern of reasoning on display in our toy example is the same as that used in sampling generally. Perhaps the most familiar instances of sampling in everyday life are public opinion surveys. Rather than trying to determine the proportion of people in the general population who are men (not a real mystery), opinion pollsters try to determine the proportion of a given population who, say, intend to vote for a certain candidate, or approve of the job the president is doing, or believe in Bigfoot. Pollsters survey a sample of people on the question at hand, and end up with a result: 29% of Americans believe in Bigfoot, for example. (Here’s an actual survey with that result: angusreidglobal.com/wp-conten...3.04_Myths.pdf)
But the headline number, as we have seen, doesn’t tell the whole story. 29% of the sample (in this case, about 1,000 Americans) reported believing in Bigfoot; it doesn’t follow with certainty that 29% of the general population (all Americans) have that belief. Rather, the pollsters have some degree of confidence (again, 95% is standard) that the actual percentage of Americans who believe in Bigfoot is in some range around 29%. You may have heard the “margin of error” mentioned in connection with such surveys. This phrase refers to the very range we’re talking about. In the survey about Bigfoot, the margin of error is 3%. (Actually, it’s 3.1%, but never mind.) That’s the distance from the mean (the 29% found in the sample) and the ends of the two standard-deviation confidence interval—the range in which we’re 95% sure the true value lies. Again, this range is just a mathematical function of the sample size: if the sample size is around 100, the margin of error is about 10% (see the toy example above: 2 SDs = .10); if the sample size is around 400, you get that down to 5%; at 600, you’re down to 4%; at around 1,000, 3%; to get down to 2%, you need around 2,500 in the sample, and to get down to 1%, you need 10,000. (Interesting mathematical fact: these relationships hold no matter how big the general population from which you’re sampling (as long as it’s above a certain threshold). It could be the size of the population of Wisconsin or the population of China: if your sample is 600 Wisconsinites, your margin of error is 4%; if it’s 600 Chinese people, it’s still 4%. This is counterintuitive, but true—at least, in the abstract. We’re omitting the very serious difficulty that arises in actual polling (which we will discuss anon): finding the right 600 Wisconsinites or Chinese people to make your survey reliable; China will present more difficulty than Wisconsin.) So the real upshot of the Bigfoot survey result is something like this: somewhere between 26% and 32% of Americans believe in Bigfoot, and we’re 95% sure that’s the correct range; or, to put it another way, we used a method for determining the true proportion of Americans who believe in Bigfoot that can be expected to determine a range in which the true value actually falls 95% of the time, and the range that resulted from our application of the method on this occasion was 26% - 32%.
That last sentence, we must admit, would make for a pretty lousy newspaper headline (“29% of Americans believe in Bigfoot!” is much sexier), but it’s the most honest presentation of what the results of this kind of sampling exercise actually show. Sampling gives us a range, which will be wider or narrower depending on the size of the sample, and not even a guarantee that the actual value is within that range. That’s the best we can do; these are inductive, not deductive, arguments.
Finally, on the topic of sampling, we should acknowledge than in actual practice, polling is hard. The mathematical relationships between sample size and margin of error/confidence that we’ve noted all hold in the abstract, but real-life polls can have errors that go beyond these theoretical limitations on their accuracy. As the 2016 U.S. presidential election—and the so-called “Brexit” vote in the United Kingdom that same year, and many, many other examples throughout the history of public opinion polling—showed us, polls can be systematically in error. The kinds of facts we’ve been stating—that with a sample size of 600, a poll has a margin of error of 4% at the 95% confidence level—hold only on the assumption that there’s a systematic relationship between the sample and the general population it’s meant to represent; namely, that the sample is representative. A representative sample mirrors the general population; in the case of people, this means that the sample and the general population have the same demographic make-up—same percentage of old people and young people, white people and people of color, rich people and poor people, etc., etc. Polls whose samples are not representative are likely to misrepresent the feature of the population they’re trying to capture. Suppose I wanted to find out what percentage of the U.S. population thinks favorably of Donald Trump. If I asked 1,000 people in, say, rural Oklahoma, I’d get one result; if I asked 1,000 people in midtown Manhattan, I’d get a much different result.Neither of those two samples is representative of the population of the United States as a whole. To get such a sample, I’d have to be much more careful about whom I surveyed. A famous example from the history of public polling illustrates the difficulties here rather starkly: in the 1936 U.S. presidential election, the contenders were Republican Alf Landon of Kansas, and the incumbent President Franklin D. Roosevelt. A (now-defunct) magazine, Literary Digest conducted a poll with 2.4 million (!) participants, and predicted that Landon would win in a landslide. Instead, he lost in a landslide; FDR won the second of his four presidential elections. What went wrong? With a sample size so large, the margin of error would be tiny. The problem was that their sample was not representative of the American population. They chose participants randomly from three sources: (a) their list of subscribers; (b) car registration forms; and (c) telephone listings. The problem with this selection procedure is that all three groups tended to be wealthier than average. This was 1936, during the depths of the Great Depression. Most people didn’t have enough disposable income to subscribe to magazines, let alone have telephones or own cars. The survey therefore over-sampled Republican voters and got a skewed results. Even a large and seemingly random sample can lead one astray. This is what makes polling so difficult: finding representative samples is hard. (It’s even harder than this paragraph makes it out to be. It’s usually impossible for a sample—the people you’ve talked to on the phone about the president or whatever—to mirror the demographics of the population exactly. So pollsters have to weight the responses of certain members of their sample more than others to make up for these discrepancies. This is more art than science. Different pollsters, presented with the exact same data, will make different choices about how to weight things, and will end up reporting different results. See this fascinating piece for an example: www.nytimes.com/interactive/2...about.html_r=0)
Other practical difficulties with polling are worth noting. First, the way your polling question is worded can make a big difference in the results you get. As we discussed in Chapter 2, the framing of an issue—the words used to specify a particular policy or position—can have a dramatic effect on how a relatively uninformed person will feel about it. If you wanted to know the American public’s opinion on whether or not it’s a good idea to tax the transfer of wealth to the heirs of people whose holdings are more than $5.5 million or so, you’d get one set of responses if you referred to the policy as an “estate tax”, a different set of responses if you referred to it as an “inheritance tax”, and a still different set if you called it the “death tax”. A poll of Tennessee residents found that 85% opposed “Obamacare”, while only 16% opposed “Insure Tennessee” (they’re the same thing, of course). (Source: http://www.nbcnews.com/politics/elec...-power-n301031) Even slight changes in the wording of questions can alter the results of an opinion poll. This is why the polling firm Gallup hasn’t changed the wording of its presidential-approval question since the 1930s. They always ask: “Do you approve or disapprove of the way [name of president] is handling his job as President?” A deviation from this standard wording can produce different results. The polling firm Ipsos found that its polls were more favorable than others’ for the president. They traced the discrepancy to the different way they worded their question, giving an additional option: “Do you approve, disapprove, or have mixed feelings about the way Barack Obama is handling his job as president?” (spotlight.ipsos-na.com/index....on-wording-on- levels-of-presidential-support/) A conjecture: Obama’s approval rating would go down if pollsters included his middle name (Hussein) when asking the question. Small changes can make a big difference.
Another difficulty with polling is that some questions are harder to get reliable data about than others, simply because they involve topics about which people tend to be untruthful. Asking someone whether he approves of the job the president is doing is one thing; asking him whether or not he’s ever cheated on his taxes, say, is quite another. He’s probably not shy about sharing his opinion on the former question; he’ll be much more reluctant to be truthful on the latter (assuming he’s ever fudged things on his tax returns). There are lots of things it would be difficult to discover for this reason: how often people floss, how much they drink, whether or not they exercise, their sexual habits, and so on. Sometimes this reluctance to share the truth about oneself is quite consequential: some experts think that the reason polls failed to predict the election of Donald Trump as president of the United States in 2016 was that some of his supporters were “shy”— unwilling to admit that they supported the controversial candidate. (See here, for example: https://www.washingtonpost.com/news/...=.f20212063a9c) They had no such qualms in the voting booth, however.
Finally, who’s asking the question—and the context in which it’s asked—can make a big difference. People may be more willing to answer questions in the relative anonymity of an online poll, slightly less willing in the somewhat more personal context of a telephone call, and still less forthcoming in a face-to-face interview. Pollsters use all of these methods to gather data, and the results vary accordingly. Of course, these factors become especially relevant when the question being polled is a sensitive one, or something about which people tend not to be honest or forthcoming. To take an example: the best way to discover how often people truly floss is probably with an anonymous online poll. People would probably be more likely to lie about that over the phone, and still more likely to do so in a face-to-face conversation. The absolute worst source of data on that question, perversely, would probably be from the people who most frequently ask it: dentists and dental hygienists. Every time you go in for a cleaning, they ask you how often youbrush and floss; and if you’re like most people, you lie, exaggerating the assiduity with which you attend to your dental-health maintenance (“I brush after every meal and floss twice a day, honest.”).
As was the case with hypothesis testing, the logic of statistical sampling is relatively clear. Things get murky, again, when straightforward abstract methods confront the confounding factors involved in real-life application.
1. I and a bunch of my friends are getting ready to play a rousing game of “army men”. Together, we have 110 of the little plastic toy soldiers—enough for quite a battle. However, some of us have more soldiers than others. Will, Brian and I each have 25; Roger and Joe have 11 each; Dan has 4; John and Herb each have 3; Mike, Jamie, and Dennis have only 1 each.
(a) What is the mean number of army men held? What’s the median?
(b) Jamie, for example, is perhaps understandably disgruntled about the distribution; I, on the other hand, am satisfied with the arrangement. In defending our positions, each of us might refer to the “average person” and the number of army men he has. Which sense of ‘average’—mean or median—should Jamie use to gain a rhetorical advantage? Which should sense should I use?
2. Consider cats and dogs—the domesticated kind, pets (tigers don’t count). Suppose I produced a histogram for a very large number of pet cats based on their weight, and did the same for pet dogs. Which distribution would have the larger standard deviation?
3. Men’s heights are normally distributed, with a mean of about 70 inches and a standard deviation of about 3 inches. 68.3% of men fall within what range of heights? Where do 95.4% of them fall? 99.7%? My father-in-law was 76 inches tall. What percentage of men were taller than he was?
4. Women, on average, have lower hematocrit levels than men. The mean for healthy women is 42%, with a standard deviation of 3%. Suppose we want to test the null hypothesis that Alice is healthy. What are the hematocrit readings above which and below which Alice’s test result would be considered significant at the .05 level?
5. Among healthy people, the mean (fasting) blood glucose level is 90 mg/dL, with a standard deviation of 9 mg/dL. What are the levels at the high and low end of the 95.4% confidence interval? Recently, I had my blood tested and got a result of 100 mg/dL. Is this result significant at the .05 level? My result was flagged as being potentially indicative of my being “pre-diabetic” (high blood glucose is a marker for diabetes). My doctor said this is a new standard, since diabetes is on therise lately, but I shouldn’t worry because I wasn’t overweight and was otherwise healthy. Compared to a testing regime that only flags patients outside the two standard-deviation confidence interval, does this new practice of flagging results at 100 mg/dL increase or decrease the sensitivity of the diabetes screening? Does it increase or decrease its specificity?
6. A stroke is when blood fails to reach a part of the brain because of an obstruction of a blood vessel. Often the obstruction is due to atherosclerosis—a hardening/narrowing of the arteries from plaque buildup. Strokes can be really bad, so it would be nice to predict them. Recent research has sought for a potentially predictive biomarker, and one study found that among stroke victims there was an unusually high level of an enzyme called myeloperoxidase: the mean was 583 pmol/L, with a standard deviation of 48 pmol/L. (See this study: https://www.ncbi.nlm.nih.gov/pubmed/21180247) Suppose we wanted to devise a screening test on the basis of this data. To guarantee that we caught 99.7% of potential stroke victims, what range of myeloperoxidase levels should get a “positive” test result? If the mean level of myeloperoxidase among healthy people is 425 pmol/L, with a standard deviation of 36 pmol/L, approximately what percentage of healthy people will get a positive result from our proposed screening test?
7. I survey a sample of 1,000 Americans (assume it’s representative) and 43% of them report that they believe God created human beings in their present form less than 10,000 years ago. (See this suevey: http://www.gallup.com/poll/27847/Maj...Evolution.aspx) At the 95% confidence level, what is the range within which the true percentage probably lies?
8. Volunteer members of Mothers Against Drunk Driving conducted a door-to-door survey in a college dormitory on a Saturday night, and discovered that students drink and average of two alcoholic beverages per week. What are some reasons to doubt the results of this survey?