Blog

Superforecasting: The Art and Science of Prediction

“The average expert was roughly as accurate as a dart-throwing chimpanzee.”

That’s the punch line to a study that lasted twenty years. In 1984 Philip Tetlock, then an Associate Professor at the University of California, Berkeley, gathered nearly 300 experts and asked them to make predictions about the economy, wars, elections, and other issues of the day. Two decades later, Tetlock collected the data and measured the results. The average expert would have done just as well by randomly guessing—hence the dart-throwing chimp.

The punch line is memorable but misleading. It turned out that the experts performed better than chance in the short-term. It was questions about what would happen in three- to-five years that exposed their feeble powers of prediction. No one foresaw the collapse of the Soviet Union.

Tetlock also found that “foxes” performed better than “hedgehogs.” The distinction comes from a 1953 essay by Isaiah Berlin, “The Hedgehog and the Fox,” that popularized an aphorism by the ancient Greek poet Archilochus—“A fox knows many things, but a hedgehog one important thing.” Tetlock says that foxes are like a dragonfly’s eye, constantly aggregating dozens of different perspectives to create a coherent view of the world.

Unfortunately, when Tetlock published his findings in Expert Political Judgment in 2006—“the most comprehensive assessment of expert judgment in the scientific literature”— the dart-throwing chimp line stuck. Saddam Hussein, we learned, did not possess WMDs, Osama bin Laden was still at large, and the 9/11 Commission had revealed systematics flaws in how the intelligence community gathered and analyzed information. The image of a chimp pitted against a myopic C.I.A. analyst felt like a good one. Forecasting, we concluded, must be a fool’s errand.

Right?

In the last decade the science of forecasting has made a huge comeback. Nate Silver has been instrumental in using the basic rules of probability and statistics to forecast events in sports and politics, but, more importantly, a growing number of academics have begun to study what makes good forecasters so effective. And the best way to tell this comeback story is Tetlock’s new book Superforecasting: The Art and Science of Predictionco-authored with Dan Gardner.

In 2011, over five years after his original research project ended, Tetlock and his partner Barbara Mellers launched the Good Judgment Project. They invited anyone who wanted to join to sign up and start forecasting. Every week, thousands of participants answered questions like, “Will Serbia be officially granted European Union candidacy by 31 December 2011” and “Will the London Gold Market Fixing price of gold (USD per ounce) exceed $1,850 on 30 September 2011.”

To understand how Tetlock and his team graded the answers is to get a glimpse into how forecasters think. Key metrics, “calibration” and “resolution,” measure not just accuracy but a forecaster’s ability to assign high probabilities to things that happen and low probabilities to things that don’t. If you’re a meteorologist and it rains 40% of the time and you forecast that it will rain 40% of the time, your calibration is perfect but your resolution is low. If, on the other hand, you forecast a 90% chance that Bernie Sanders will become president and he does, your resolution is high. It’s a constant tug-of-war between making the safe bet and making the right bet. Meteorologists, who usually have access to a century of reliable data, have it relatively easy.

The book is about a small group of people, “superforecasters,” who consistently hit the sweet spot between calibration and resolution.

One of those people is Bill Flack, a retired fifty-five year-old native Nebraskan who worked for the US Department of Agriculture. Flack answered roughly 300 questions, such as “In the next year, will any country withdraw from the Eurozone?” “Will North Korea detonate a nuclear device before the end of this year?” and “How many additional countries will report cases of the Ebola virus in the next eight months.” Of the thousands of other participants answering the same questions, Flack was in the top 2%.

Superforecasters, it turns out, are not geniuses but possess above average intelligence, they pay close attention to the news but know what to ignore, and they like numbers but aren’t math whizzes. They’re intellectually humble foxes who crave different perspectives and encourage dissenting voices. As Tetlock said in a recent talk, they’re “willing to tolerate dissonance.”

They’re also good team players. According to Tetlock, superforecasters regularly used the Good Judgment Project’s open forum to share their thinking in order to improve it.

But, more than these traits, their secret weapon is a set of mental tools that helps them think clearly. Consider the Renzettis, a hypothetical family that lives in a small house. “Frank Renzetti is forty-four and works as a bookkeeper for a moving company. Mary Renzetti is thirty-five and works part-time at a day care. They have one child, Tommy, who is five.” Tetlock and Gardner ask: How likely is it that the Renzettis have a pet?

While it’s tempting to scrutinize the details of the Renzetti family for hidden clues, superforecasters like Bill Flack would first find out what percentage of American households own a pet—62%. From there, he would cautiously use what he knows about the Renzettis to adjust the initial 62% up or down. Daniel Kahnmean calls this the “outside view,” which should always precede the “inside view.” Start with the base rate—how many households own a pet?—and then turn to the details of the Renzettis—how many households with one child have a pet?

Most questions, such as “Will either the French or Swiss inquiries find elevated levels of polonium in the remains of Yasser Arafat’s body?” in which historical data were either unreliable or did not exist, were tougher. Even though Flack wasn’t an expert in polonium, he had researched the story enough to raise his forecast from 60% to 65% when a Swiss autopsy team delayed announcing the findings. He reasoned the delay suggested that the Swiss team had detected polonium but had to conduct more tests to rule out lead, which naturally exists in the human body and produces polonium as it decays.

The promising new appeal of forecasting might seem incompatible for fans of Nassim Taleb, the author and philosopher who is responsible for putting the phrase “black swan” into common English parlance. Daily life is filled with events that comfortably fit under the classic bell curve. Most men are between five and six feet tall, a few are around four and seven feet tall, and even fewer are three or eight feet tall. The distribution of wealth, on the other hand, is fat-tailed, which means that even though the medium household wealth is around $100,000, people like Bill Gates and Warren Buffet exist. It would be like walking past someone who is over 100 feet tall.

Taleb’s point is that our world is much more fat-tailed than we think. From World War I to September 11th, the events that shaped history are distributed like wealth, not height. And because a hallmark of these “improbable but highly consequential events” is that they are impossible to predict (just like black swans were impossible to predict for Europeans before the discovery of Australia) forecasting is a fool’s errand. Should we go ahead and replace C.I.A analysts with chimps?

Tetlock and Gardner’s answer to this question represents the sharpest section of Superforecasting. “We may have no evidence that superforecasters can foresee events like those of September 11, 2001,” they write, “but we do have a warehouse of evidence that they can forecast questions like: Will the United States threaten military action if the Taliban don’t hand over Osama bin Laden? Will the Taliban comply? Will bin Laden flee Afghanistan prior to the invasion? To the extent that such forecasts can anticipate the consequences of events like 9/11, and these consequences make a black swan what it is, we can forecast black swans.”

When Tetlock finished the manuscript, he asked Bill Flack what he thought about pundits like Tom Friedman who regularly dish out predictions. Flack said that even though the media is filled with poor forecasters—Friedman, like so many others, was convinced Saddam Hussein possessed WMDs—some commenters and journalists play an important role by making arguments that exposed holes in his thinking.

Flack had one of the top forecasting records not just because he had the right tools. He succeeded because he embodied one of the oldest traditions in western intellectual history. He was willing to admit what he didn’t know.

Living In A Post-Kahneman World

Imagine sitting in a laboratory with your brain connected to a computer. You’re in a rigid chair, waiting patiently, when a scientist walks into the room and offers you a deal. “In just a few seconds, I can upload everything psychologists know about human judgment, including a complete list of biases and how they undermine rational thinking, into your mind. All I need is your permission.”

There is a good argument for saying “No. Absolutely not.”

In the early 2000s, the psychologists Emily Pronin at Princeton and Lee Ross at Stanford conducted a series of studies that examined what happens when you ask people to evaluate themselves and then teach them about self-serving biases. Psychologists have known for years that most people believe they are above average in terms of just about every measurable trait—sociability, humor, intelligence, driving skills—but Pronin and Ross wanted to know if telling people about their egocentric habits would deflate their sense of self. It’s like saying, “OK, now that you know 95 percent of people believe they are above-average, would you like to amend anything you just said about yourself?”

Across many studies, Pronin and Ross found that self-ratings were unaffected by the news. It was as if someone from the Flat-Earth society was sent to the International Space Station, peeked out the window, and concluded that our planet was indeed flat.

It’s been four years since Nobel-Prize winner Daniel Kahneman, professor of psychology and public affairs at Princeton University, published Thinking, Fast and Slow, a book that documents nearly forty years of research in the many ways we make poor decisions. Kahneman, along with his late partner Amos Tversky, showed that contrary to the economic model of human nature, we’re prone to a suite of biases that systematically distort how we perceive the world.

Given what we know about how people react when they learn about biases, it’s worth wondering if popular books outlining how we screw up, including Thinking, Fast and Slow, may not only fail to change behavior but even instill overconfidence. It’s very easy to conclude, just as the participants in Pronin and Ross’ study did, that learning about biases makes us immune to them, as if they are something we can permanently fix.

We used to think that the hard part of the question “How do I improve judgment?” had to do with understanding judgment. But it may have more to do with understanding the environment in which we make decisions. Many researchers now believe, to varying degrees, that in order make better decisions, we’ve got to redesign the environment around our foibles instead of simply listing them.

That, at least, is one view Jack Soll (Duke), Katherine Milkman (Penn), and John W. Payne (Duke) endorse in a new working paper, “A User’s Guide to Debiasing.”

Soll and his colleagues begin by clarifying what they mean by “debiasing.” Although the researchers list several tools to help people overcome their limitations, they also insist that psychologists should focus less on achieving perfect rationality and more on modifying the environment to help people achieve their goals. “This approach accepts that there is bias,” the researchers write, “but strives to create situations in which a bias is either irrelevant or may even be helpful.”

Defaults, which leverage our tendency to opt for the path of least resistance, are one example. They have been used to increase flu vaccination rates and retirement savings. Some readers might be familiar with research from Eric Johnson and Dan Goldstein. In 2003, they found that the default on organ donation forms in several European countries dramatically influenced how many people become donors. Donation rates were as much as 90 percent higher in countries when the default was to donate.

Because we’re deeply swayed by how numbers and ratios are framed, the EPA has taken steps to help people understand fuel economy better. Although trading in a car that gets 25 MPG for a hybrid that gets 50 MPG might seem like a sizeable improvement, someone who swaps a gas-guzzling pickup truck that gets 10 MPG for a sedan that gets 15 MPG will save about 1.3 gallons more every hundred miles. MPG is not a linear metric—gains are much greater on the low end the scale—yet most people perceive it that way. As a result, the EPA began including GPM (Gallons per Mile) in 2013 to help users make more informed decisions.

“Planning prompts” help people follow through on their intentions by prompting them to visualize themselves completing them. In the weeks leading up to Pennsylvania’s April 2008 presidential primary, a Harvard behavioral scientist named Todd Rogers scripted a phone call that went out to nearly 20,000 Democratic households in Pennsylvania. Compared to a control condition, in which Rogers simply encouraged voters to vote instead of prompting them to make a specific plan to vote, those in the experimental condition were four-times more likely to go to the polls. Planning prompts have been used to help people in several other areas, such as dieting and scheduling colonoscopies, where willpower is notoriously unreliable.

It’s tempting to read “A User’s Guide to Debiasing” as evidence that human reason is deeply flawed. We might laugh at how easily fooled we are by something as important as the difference between miles per gallon and gallons per mile, or how something as trivial as defaults and planning can protect us from the flu. Pronin and Ross might be right. We’re blind to our blindness.

However, a better interpretation should begin with the assumption that even though we systemically screw up, we’re smart enough to except mistakes and account for them.

In The Checklist Manifesto, Atul Gawande writes about the difference between errors of ignorance—mistakes we make because we don’t have enough information—and errors of ineptitude—mistakes we make when we have enough information but don’t use it properly.

Under the economic model, in which people are assumed to easily understand confusing ratios and complicated statistics, misunderstanding MPGs was an error of ignorance. Now we know that conflating MPG with GPM is an error of ineptitude. The problem isn’t the people shopping for cars. It’s the designers at the EPA who printed those misleading labels.

There is a memorable scene in The Matrix where the protagonist, Neo, learns Kung Fu in a few seconds by downloading it into his brain. Although Neo “knows” Kung Fu he still requires hours of training to learn how to use it. The Matrix was released in 1999 but the scene embodies a trope that dates back at least to the Ancient Greeks. There is a constant tension in Western intellectual history between knowing-that and knowing-how, between acquiring knowledge and using it.

Soll and his colleagues show that this debate might be an antiquated one. If we want to live in a post-Kahneman world, we should spend more time reforming the environment and less time reforming ourselves. Diligently reading Thinking, Fast and Slow will only get us so far. The brain is not a computer to debug. It is a feature of the choice environment that we must design around.

Human-Centric Thinking: Why Behavioral Science is Not an Alternative

The fundamental premise of The Design of Everyday Things, first published in 1988, is that physical objects have “affordances.” Wheels are for turning. Slots are for inserting things into. If a door has a horizontal metal bar, we know to push.

Human failures, therefore, are usually design failures. Even though the shutdown button is next to the volume-up button on my MacBook Pro, I’ve never accidentally turned my laptop off. The shutdown button only works when I hold it down for a few seconds. As author Don Norman wrote, “The designer must assume that all possible errors will occur and design so as to minimize the chance of error in the first place.”

The wisdom of The Design of Everyday Things is no longer controversial. Of the 500 million iPhones sold since 2007, none came with an instruction manual, a testament to Apple’s unwavering dedication to the idea that the design of a product should afford how users will use it—an iPhone perfectly fits the palm of a hand for a reason. Phrases such as “User Experience,” “Design Thinking,” and “Human-Centric Design” have not just entered the business lexicon. They’ve become cringe worthy clichés, ripe for satire on HBO’s hit sitcom “Silicon Valley.”

However, if design principles have sprinted into business culture, behavioral science insights have crawled. Although it’s true that governments are beginning to incorporate behavioral science, popular psychology books consistently fill bestseller lists, and people like Daniel Kahneman, Richard Thaler, and Dan Ariely have become minor celebrities, we’re still living in a world where many products and services ignore the basic principles of human cognition, much like pre-Apple mobile phones—flip phones—ignored the basic principles of good design. Even when we think about how we think, we use outdated notions of human judgment and jargon such as “Human-Centric Thinking” to talk about how real humans behave in real environments.

Why?

Unfortunately the standard economic model, in which people are rational optimizers with infinite willpower and stable preferences, is still the default. When we design things like restaurant menus or the interface for health care exchanges, we assume users will objectively calculate each variable. As a result, we tend to treat behavioral science as an alternative and nudges and choice architecture interventions as deviations. In this view, a cognitive bias is an error or lapse in judgment, when in fact it more likely reflects a problem with the environment and our assumptions about human nature.

If we’re willing to accept what decades of psychological research reveal about how we actually decide, we’ll better appreciate the fact that Human-Centric Thinking is not a deviation but the norm. Just as smart designers have accounted for accidentally deleted word documents by incorporating an undo option instead of blaming the user, smart choice architects should account for our natural cognitive preferences, such as our hatred of uncertainty, by catering to them instead of labeling them as biased or irrational. Uber has made a fortune not by reinventing the concept of a taxi but eliminating the psychological pain associated with not knowing if a cab is available or when it will arrive.

Another reason relates to visibility. According to Norman, a good design should visibly convey the correct message. If a door is meant to be pushed, the designer must provide signals that show users where to push. It’s much harder to incorporate smart psychological improvements, such as mirrors in an elevator to reduce boredom and therefore make the ride seem faster, because the fast and automatic mind (what Kahneman refers to as “System 1”) does a pitiful job of explaining itself. It would be nearly impossible for anyone to consciously detect how mirrors change the perceived duration of an elevator ride, which is why we must not only study the nuances of human cognition but also test them.

Take the experience of waiting, for example. When I renewed my license last week at the DMV in Manhattan, I was delighted to watch a big screen scroll through the numbers ahead of me. It was reassuring to know my place in the line and the fact that I was actually moving through it. Those with a traditional background in business would fix a problem like long lines by focusing on reducing the wait time instead of improving the experience of waiting—waiting 30 minutes isn’t that bad when you know that you’re going to wait for 30 minutes, even in the soul-crushing halls of a DMV. Certainty might not be visible, but it is comfortable.

The good news is waiting is one corner of behavioral science that some businesses and organizations have mastered. Writing for The Times, Alex Stone reports that when the Houston airport reduced average wait time at the baggage claim to eight minutes (which was “well within industry benchmarks”) complaints persisted. So airport executives decided to move the arrival gates further away from baggage claim and reroute bags to the outermost carousel. Passengers had to walk six times longer, and complaints virtually disappeared.

If one area has done an especially dreadful job of using psychology to change behavior, dieting would be a good candidate. Nearly 80 percent of diet resolutions end in failure because they rely on human willpower, a notoriously feeble device. Dieting is of course so difficult because dieters have to resist temptation throughout the day—whereas waiting at baggage claim happens once and is mostly automatic—yet there are still much better commitment methods.

Consider Ramadan. It is incredibly effective not only because it relies on a simple rule—don’t eat or drink during the day—but also because it’s social. Millions of Muslims fast because the social norm polices itself; the crowd supports the individual. Why not make dieting social?

But perhaps the most psychologically unfriendly aspects of modern life are environments with too much information. Last week, when I traveled from New York City to Minneapolis, a TSA officer listed items I needed to remove before entering the full-body scanner. I diligently padded my pants, knowing that I’d probably miss a flattened receipt in my back pocket. On the return trip, a TSA officer in Minneapolis simply asked everyone to pretend that they were about to put their clothes in the wash. It was a great heuristic because it was simple and relatable; it elicited an image that was easy to mimic. Why not replace complex directions with pithy rules-of-thumb?

When you think about the baggage claim at Houston’s airport, it’s worth wondering if the mind affords to be nudged just as physical objects afford to be pushed or pulled. In this view, frustration at the baggage claim may not be a sign of impatience and vice but poor choice architecture. The problem with neoclassical economics is therefore not only conceptual. It has caused us to push and pull the mind in the wrong directions. Once we design around human cognition instead of listing its foibles, we can begin to live in a world that’s more mindful of the mind.

The Myth of Perfect Information

I want to tell you about some research that will change the way you think about thinking.

Imagine you’re about to interview someone for an important job. Your colleague informs you the candidate is intelligent, industrious, impulsive, critical, stubborn, and envious. You might picture someone who knows what he wants. He might be occasionally impatient and forceful, but he is hard working and ambitious. He puts his intelligence to good use.

Now imagine you’re about to interview someone else for the same job. This time, your colleague tells you the candidate is envious, stubborn, critical, impulsive, industrious, and intelligent. You might picture someone with a “problem.” Although he is intelligent, the candidate is prone to moments of rage and jealousy. His bad qualities will surely overshadow his lighter side.

In 1946, the American psychologist Solomon Asch gathered 58 participants and split them into two groups. The first group read about a person who was intelligent, industrious, impulsive, critical, stubborn, and envious. The second group read about the same person but with a twist. When Asch reversed the order of the qualities, participants imagined an entirely different person. Some qualities that people in the first group perceived as positive (impulsive and critical) were perceived as negative.

Asch was not the first person to notice that we make unreliable snap judgments based on limited information. Just about every philosopher and writer has commented on our malleable social intuitions.

Asch was one of the first scientists to empirically show that there is no such thing as neutral information. Even though his experiment revealed a quirk in how we evaluate other people—the study was published in a journal dedicated to social psychology—his findings apply to nearly every aspect of life. How information is ordered and how it is framed will invariably influence our judgment one way or another.

For instance, we tend to judge the length of a bike ride from Maine to Florida as shorter than the length of a bike ride from Florida to Maine, as if gravity helps us on the way down. We’re more likely to order expensive beer when it is placed next to lite beer, yet we’re more partial to lite beer when it is placed next to a “premium” cheap beer. A $60,000 salary feels different in a company where everyone makes $80,000 versus $40,000. If I tell you a painkiller costs $2.50, it will reduce pain more than if I told you it cost $0.10; how effective medicine will be critically depends on how effective you think the medicine is.

Even when we process a single piece of information—imagine someone only telling you a candidate is “intelligent” or having only one beer to select from—the information will not be neutral. Without other reference points, we’ll evaluate the same trait or price differently.

It’s worth pausing to appreciate this insight. In Thinking, Fast and SlowDaniel Kahneman discusses cognitive biases as they relate to the economic standard of rationality. In this view, a bias is a deviation. It’s what happens in the checkout lane, on a trading floor, or during fourth down.

The implication of Asch’s study is that the idea of a neutral choice environment, in which the layout of a menu or the font of an email does not sway the reader one way or another, is a myth. In this view, no matter how hard you flex your cognitive muscles, you will never process information without distorting it, not just because the mind is biased, but because the information is biased as well.

The lesson for anybody who depends on customers should be obvious. Be mindful how you present the facts; they will nudge customers in some way. Williams-Sonoma once boosted the sales of a $279 breadmaker simply by placing it next to a somewhat bigger model priced at $429. We’re more likely to buy a $200 printer with a $25 rebate than the same printer priced at $175. Despite what you heard in economics class, consumers really don’t know what most goods should cost.

The second lesson is for everyone else. If you’re still wondering if there is such a thing as neutral information, good. The moral of Thinking, Fast and Slow and every other book in that aisle is not that we occasionally mess up. It’s that a dose of epistemic humility can go a long way.

The surreptitious part of the human brain is that we think we see the world as it is. It’s almost as if the brain and the mind have a contractual relationship, in which the mind has agreed to believe the worldview the brain creates, but in return the brain has agreed to create a worldview the mind wants. La Rochefoucauld was right: “Nothing can comfort us when we are deceived by our enemies and betrayed by our friends; yet we are often happy to be deceived and betrayed by ourselves.”