“The average expert was roughly as accurate as a dart-throwing chimpanzee.”
That’s the punch line to a study that lasted twenty years. In 1984 Philip Tetlock, then an Associate Professor at the University of California, Berkeley, gathered nearly 300 experts and asked them to make predictions about the economy, wars, elections, and other issues of the day. Two decades later, Tetlock collected the data and measured the results. The average expert would have done just as well by randomly guessing—hence the dart-throwing chimp.
The punch line is memorable but misleading. It turned out that the experts performed better than chance in the short-term. It was questions about what would happen in three- to-five years that exposed their feeble powers of prediction. No one foresaw the collapse of the Soviet Union.
Tetlock also found that “foxes” performed better than “hedgehogs.” The distinction comes from a 1953 essay by Isaiah Berlin, “The Hedgehog and the Fox,” that popularized an aphorism by the ancient Greek poet Archilochus—“A fox knows many things, but a hedgehog one important thing.” Tetlock says that foxes are like a dragonfly’s eye, constantly aggregating dozens of different perspectives to create a coherent view of the world.
Unfortunately, when Tetlock published his findings in Expert Political Judgment in 2006—“the most comprehensive assessment of expert judgment in the scientific literature”— the dart-throwing chimp line stuck. Saddam Hussein, we learned, did not possess WMDs, Osama bin Laden was still at large, and the 9/11 Commission had revealed systematics flaws in how the intelligence community gathered and analyzed information. The image of a chimp pitted against a myopic C.I.A. analyst felt like a good one. Forecasting, we concluded, must be a fool’s errand.
In the last decade the science of forecasting has made a huge comeback. Nate Silver has been instrumental in using the basic rules of probability and statistics to forecast events in sports and politics, but, more importantly, a growing number of academics have begun to study what makes good forecasters so effective. And the best way to tell this comeback story is Tetlock’s new book Superforecasting: The Art and Science of Prediction, co-authored with Dan Gardner.
In 2011, over five years after his original research project ended, Tetlock and his partner Barbara Mellers launched the Good Judgment Project. They invited anyone who wanted to join to sign up and start forecasting. Every week, thousands of participants answered questions like, “Will Serbia be officially granted European Union candidacy by 31 December 2011” and “Will the London Gold Market Fixing price of gold (USD per ounce) exceed $1,850 on 30 September 2011.”
To understand how Tetlock and his team graded the answers is to get a glimpse into how forecasters think. Key metrics, “calibration” and “resolution,” measure not just accuracy but a forecaster’s ability to assign high probabilities to things that happen and low probabilities to things that don’t. If you’re a meteorologist and it rains 40% of the time and you forecast that it will rain 40% of the time, your calibration is perfect but your resolution is low. If, on the other hand, you forecast a 90% chance that Bernie Sanders will become president and he does, your resolution is high. It’s a constant tug-of-war between making the safe bet and making the right bet. Meteorologists, who usually have access to a century of reliable data, have it relatively easy.
The book is about a small group of people, “superforecasters,” who consistently hit the sweet spot between calibration and resolution.
One of those people is Bill Flack, a retired fifty-five year-old native Nebraskan who worked for the US Department of Agriculture. Flack answered roughly 300 questions, such as “In the next year, will any country withdraw from the Eurozone?” “Will North Korea detonate a nuclear device before the end of this year?” and “How many additional countries will report cases of the Ebola virus in the next eight months.” Of the thousands of other participants answering the same questions, Flack was in the top 2%.
Superforecasters, it turns out, are not geniuses but possess above average intelligence, they pay close attention to the news but know what to ignore, and they like numbers but aren’t math whizzes. They’re intellectually humble foxes who crave different perspectives and encourage dissenting voices. As Tetlock said in a recent talk, they’re “willing to tolerate dissonance.”
They’re also good team players. According to Tetlock, superforecasters regularly used the Good Judgment Project’s open forum to share their thinking in order to improve it.
But, more than these traits, their secret weapon is a set of mental tools that helps them think clearly. Consider the Renzettis, a hypothetical family that lives in a small house. “Frank Renzetti is forty-four and works as a bookkeeper for a moving company. Mary Renzetti is thirty-five and works part-time at a day care. They have one child, Tommy, who is five.” Tetlock and Gardner ask: How likely is it that the Renzettis have a pet?
While it’s tempting to scrutinize the details of the Renzetti family for hidden clues, superforecasters like Bill Flack would first find out what percentage of American households own a pet—62%. From there, he would cautiously use what he knows about the Renzettis to adjust the initial 62% up or down. Daniel Kahnmean calls this the “outside view,” which should always precede the “inside view.” Start with the base rate—how many households own a pet?—and then turn to the details of the Renzettis—how many households with one child have a pet?
Most questions, such as “Will either the French or Swiss inquiries find elevated levels of polonium in the remains of Yasser Arafat’s body?” in which historical data were either unreliable or did not exist, were tougher. Even though Flack wasn’t an expert in polonium, he had researched the story enough to raise his forecast from 60% to 65% when a Swiss autopsy team delayed announcing the findings. He reasoned the delay suggested that the Swiss team had detected polonium but had to conduct more tests to rule out lead, which naturally exists in the human body and produces polonium as it decays.
The promising new appeal of forecasting might seem incompatible for fans of Nassim Taleb, the author and philosopher who is responsible for putting the phrase “black swan” into common English parlance. Daily life is filled with events that comfortably fit under the classic bell curve. Most men are between five and six feet tall, a few are around four and seven feet tall, and even fewer are three or eight feet tall. The distribution of wealth, on the other hand, is fat-tailed, which means that even though the medium household wealth is around $100,000, people like Bill Gates and Warren Buffet exist. It would be like walking past someone who is over 100 feet tall.
Taleb’s point is that our world is much more fat-tailed than we think. From World War I to September 11th, the events that shaped history are distributed like wealth, not height. And because a hallmark of these “improbable but highly consequential events” is that they are impossible to predict (just like black swans were impossible to predict for Europeans before the discovery of Australia) forecasting is a fool’s errand. Should we go ahead and replace C.I.A analysts with chimps?
Tetlock and Gardner’s answer to this question represents the sharpest section of Superforecasting. “We may have no evidence that superforecasters can foresee events like those of September 11, 2001,” they write, “but we do have a warehouse of evidence that they can forecast questions like: Will the United States threaten military action if the Taliban don’t hand over Osama bin Laden? Will the Taliban comply? Will bin Laden flee Afghanistan prior to the invasion? To the extent that such forecasts can anticipate the consequences of events like 9/11, and these consequences make a black swan what it is, we can forecast black swans.”
When Tetlock finished the manuscript, he asked Bill Flack what he thought about pundits like Tom Friedman who regularly dish out predictions. Flack said that even though the media is filled with poor forecasters—Friedman, like so many others, was convinced Saddam Hussein possessed WMDs—some commenters and journalists play an important role by making arguments that exposed holes in his thinking.
Flack had one of the top forecasting records not just because he had the right tools. He succeeded because he embodied one of the oldest traditions in western intellectual history. He was willing to admit what he didn’t know.