Itemoids ★ Polls Ahead

www.theatlantic.com › ideas › archive › 2024 › 10 › presidential-polls-unreliable › 680408

This story seems to be about:

Well, it’s that time again: Millions of Americans are stress-eating while clicking “Refresh” on 538’s presidential forecast, hoping beyond hope that the little red or blue line will have made a tiny tick upward. Some may be clutching themselves in the fetal position, chanting under their breath: “There’s a good new poll out of Pennsylvania.”

The stakes of this election are sky-high, and its outcome is not knowable in advance—a combination that most of us find deeply discomfiting. People crave certainty, and there’s just one place to look for it: in the data. Earlier humans might have turned to oracles or soothsayers; we have Nate Silver. But the truth is that polling—and the models that rely primarily on polling to forecast the election result—cannot confidently predict what will happen on November 5.

The widespread perception that polls and models are raw snapshots of public opinion is simply false. In fact, the data are significantly massaged based on possibly reasonable, but unavoidably idiosyncratic, judgments made by pollsters and forecasting sages, who interpret and adjust the numbers before presenting them to the public. They do this because random sampling has become very difficult in the digital age, for reasons I’ll get into; the numbers would not be representative without these corrections, but every one of them also introduces a margin for human error.

Most citizens see only the end product: a preposterously precise statistic, such as the notion that Donald Trump has a 50.2 percent—not 50.3 percent, mind you—chance of winning the presidency. (Why stop there? Why not go to three decimal points?) Such numerical precision gives the false impression of certainty where there is none.

[Read: The world is falling apart. Blame the flukes.]

Early American political polls were unscientific but seemingly effective. In the early 20th century, The Literary Digest, a popular magazine in its day, sent sample ballots to millions of its readers. By this method, the magazine correctly predicted the winner of every presidential election from 1916 until 1936. In that year, for the contest between Franklin D. Roosevelt and Alf Landon, the Digest sent out roughly 10 million sample ballots and received an astonishing 2.4 million back (a response rate of 24 percent would be off the charts by modern standards). Based on those responses, the Digest predicted that FDR would receive a drubbing, winning just 41 percent of the vote. Instead, he won 61 percent, carrying all but two states. Readers lost faith in the Digest (it went out of business two years later).

The conventional wisdom was that the poll failed because in addition to its readers, the Digest selected people from directories of automobile and telephone ownership, which skewed the sample toward the wealthy—particularly during the Great Depression, when cars and phones were luxuries. That is likely part of the explanation, but more recent analysis has pointed to a different problem: who responded to the poll and who didn’t. For whatever reason, Landon supporters were far more likely than FDR supporters to send back their sample ballots, making the poll not just useless, but wildly misleading. This high-profile error cleared the way for more “scientific” methods, such as those pioneered by George Gallup, among others.

The basic logic of the new, more scientific method was straightforward: If you can generate a truly random sample from the broader population you are studying—in which every person has an equally likely chance of being included in the poll—then you can derive astonishingly accurate results from a reasonably small number of people. When those assumptions are correct and the poll is based on a truly random sample, pollsters need only about 1,000 people to produce a result with a margin of error of plus or minus three percentage points.

To produce reasonably unbiased samples, pollsters would randomly select people from the telephone book and call them. But this method became problematic when some people began making their phone numbers unlisted; these people shared certain demographic characteristics, so their absence skewed the samples. Then cellphones began to replace landlines, and pollsters started using “random-digit dialing,” which ensured that every active line had an equal chance of being called. For a while, that helped.

But the matter of whom pollsters contacted was not the only difficulty. Another was how those people responded, and why. A distortion known as social-desirability bias is the tendency of respondents to lie to pollsters about their likely voting behavior. In America, that problem was particularly acute around race: If a campaign pitted a minority candidate against a white candidate, some white respondents might lie and say that they’d vote for the minority candidate to avoid being perceived as racist. This phenomenon, contested by some scholars, is known as the Bradley Effect, named after former Los Angeles Mayor Tom Bradley—a Black politician who was widely tipped to become governor of California based on pre-election polling, but narrowly lost instead. To deal with the Bradley Effect, many pollsters switched from live callers to robocalls, hoping that voters would be more honest with a computer than another person.

But representative sampling has continued to become more difficult. In an age of caller ID and smartphones, along with persistent junk and nuisance calls, few people answer when they see unfamiliar numbers. Most Americans spend much of their time online, but there are no reliable methods to get a truly random sample from the internet. (Consider, for example, how subscribers of The Atlantic differ from the overall American population, and it’s obvious why a digital poll on this site would be worthless at making predictions about the overall electorate.)

These shifts in technology and social behavior have created an enormous problem known as nonresponse bias. Some pollsters release not just findings but total numbers of attempted contacts. Take, for example, this 2018 New York Times poll within Michigan’s Eighth Congressional District. The Times reports that it called 53,590 people in order to get 501 responses. That’s a response rate lower than 1 percent, meaning that the Times pollsters had to call roughly 107 people just to get one person to answer their questions. What are the odds that those rare few who answered the phone are an unskewed, representative sample of likely voters? Zilch. As I often ask my undergraduate students: How often do you answer when you see an unknown number? Now, how often do you think a lonely elderly person in rural America answers their landline? If there’s any systematic difference in behavior, that creates a potential polling bias.

To cope, pollsters have adopted new methodologies. As the Pew Research Center notes, 61 percent of major national pollsters used different approaches in 2022 than they did in 2016. This means that when Americans talk about “the polls” being off in past years, we’re not comparing apples with apples. One new polling method is to send text messages with links to digital surveys. (Consider how often you’d click a link from an unknown number to understand just how problematic that method is.) Many pollsters rely on a mix of approaches. Some have started using online “opt-in” methods, in which respondents choose to take a survey and are typically paid a small amount for participating. This technique, too, has raised reasonable questions about accuracy: One of my colleagues at University College London, Thomas Gift, tested opt-in methods and found that nearly 82 percent of participants in his survey likely lied about themselves in order to qualify for the poll and get paid. Pew further found that online opt-in polls do a poor job of capturing the attitudes of young people and Hispanic Americans.

No matter the method, a pure, random sample is now an unattainable ideal—even the aspiration is a relic of the past. To compensate, some pollsters try to design samples representative of known demographics. One common approach, stratification, is to divide the electorate into subgroups by gender, race, age, etc., and ensure that the sample includes enough of each “type” of voter. Another involves weighting some categories of respondents differently from others, to match presumptions about the broader electorate. For example, if a polling sample had 56 percent women, but the pollster believed that the eventual electorate would be 52 percent women, they might weigh male respondents slightly more heavily in the adjusted results.

[Read: The asterisk on Kamala Harris’s poll numbers]

The problem, of course, is that nobody knows who will actually show up to vote on November 5. So these adjustments may be justified, but they are inherently subjective, introducing another possible source of human bias. If women come out to vote in historically high numbers in the aftermath of the Supreme Court’s Dobbs decision, for example, the weighting could be badly off, causing a major polling error.

The bottom line is that modern pollsters are trying to correct for known forms of possible bias in their samples by making subjective adjustments to the data. If their judgments are correct, then their polls might be accurate. But there’s no way to know beforehand whether their assumptions about, say, turnout by demographic group are wise or not.

Forecasters then take that massaged polling data and feed it into a model that’s curated by a person—or team of people—who makes further subjective assessments. For example, the 538 model adjusts its forecasts based on polls plus what some in the field call “the fundamentals,” such as historical trends around convention polling bounces, or underlying economic data. Most forecasters also weight data based on how particular pollsters performed in earlier elections. Each adjustment is an educated guess based on past patterns. But nobody knows for sure whether past patterns are predictive of future results. Enough is extraordinary about this race to suspect that they may not be.

More bad news: Modern polling often misses the mark even when trying to convey uncertainty, because pollsters grossly underestimate their margins of error. Most polls report a plus or minus margin of, say, 3 percent, with a 95 percent confidence interval. This means that if a poll reports that Trump has the support of 47 percent of the electorate, then the reported margin of error suggests that the “real” number likely lies between 44 percent (minus three) and 50 percent (plus three). If the confidence interval is correct, that spread of 44 to 50 should capture the actual result of the election about 95 percent of the time. But the reality is less reassuring.

In a 2022 research paper titled “Election Polls Are 95 Percent Confident but Only 60 Percent Accurate,” Aditya Kotak and Don Moore of UC Berkeley analyzed 6,000 polls from 2008 through 2020. They found that even with just one week to go before Election Day, only about six in 10 polls captured the end result within their stated margin of error. Four in 10 times, the polling data fell outside that window. The authors conclude that to justify a 95 percent confidence interval, pollsters should “at least double” their reported margins of error—a move that would be statistically wise but render polling virtually meaningless in close elections. After all, if a margin of error doubled to six percentage points, then a poll finding that Harris had 50 percent support would indicate that the “true” number was somewhere between 44 percent (a Trump landslide) and 56 percent (a Harris landslide).

Alas, the uncertainty doesn’t end there. Unlike many other forms of measurement, polls can change what they’re measuring. Sticking a thermometer outside doesn’t make the weather hotter or colder. But poll numbers can and do shift voting behavior. For example, studies have shown that perceived poll momentum can make people more likely to vote for the surging party or candidate in a “bandwagon” effect. Take the 2012 Republican primaries, when social conservatives sought an alternative to Mitt Romney and were split among candidates. A CNN poll conducted the night before the Iowa caucus showed Rick Santorum in third place. Santorum went on to win the caucus, likely because voters concluded from the poll that he was the most electable challenger.

The truth is that even after election results are announced, we may not really know which forecasters were “correct.” Just as The Literary Digest accurately predicted the winner of presidential races with a deeply flawed methodology, sometimes a bad approach is just lucky, creating the illusion of accuracy. And neither polling nor electoral dynamics are stable over time. Polling methodology has shifted radically since 2008; voting patterns and demographics are ever-changing too. Heck, Barack Obama won Indiana in 2008; recent polls suggest that Harris is losing there by as much as 17 points. National turnout was 55 percent in 2016 and 63 percent in 2020. Polls are trying to hit a moving target with instruments that are themselves constantly changing. For all of these reasons, a pollster who was perfectly accurate in 2008 could be wildly off in 2024.

In other words, presidential elections are rare, contingent, one-off events. Predicting their outcome does not yield enough comparable data points to support any pollster’s claim to exceptional foresight, rather than luck. Trying to evaluate whether a forecasting model is “good” just from judging its performance on the past four presidential elections is a bit like trying to figure out whether a coin is “fair” or “rigged” from just four coin flips. It’s impossible.

[Read: The man who’s sure that Harris will win]

The social scientists Justin Grimmer, Dean Knox, and Sean Westwood recently published research supporting this conclusion. They write: “We demonstrate that scientists and voters are decades to millennia away from assessing whether probabilistic forecasting provides reliable insights into election outcomes.” (Their research has sparked fierce debate among scholars about the wisdom of using probabilistic forecasting to measure rare and idiosyncratic events such as presidential elections.)

Probabilistic presidential forecasts are effectively unfalsifiable in close elections, meaning that they can’t be proved wrong. Nate Silver’s model in 2016 suggested that Hillary Clinton had a 71.4 percent chance of victory. That wasn’t necessarily “wrong” when she lost: After all, as Silver pointed out to the Harvard Gazette, events with a 28.6 percent probability routinely happen—more frequently than one in four times. So was his 2016 presidential model “wrong”? Or was it bang-on accurate, but an unusual, lower-probability event took place? There’s no way of knowing for sure.

The pollsters and forecasters who are studying the 2024 election are not fools. They are skilled analysts attempting some nearly impossible wizardry by making subjective adjustments to control for possible bias while forecasting an uncertain future. Their data suggest that the race is a nail-biter—and that may well be the truth. But nobody—not you, not me, not the betting markets, not Nate Silver—knows what’s going to happen on November 5.