Prediction Failure in the Big Data Era

Causality, not Correlation, is where Meaning LIVES

[Excerpt from The Signal and the Noise by Nate Silver, 2012]

Why Most Published Research Findings Are False is an influential paper published by John P.A. Ioannidis, in 2005, citing a variety of statistical and theoretical arguments that claim the majority of hypotheses, deemed true, in medical journals and most other academic and scientific professions are, in fact, false.

Ioannidis’s hypothesis, looks to be one of the true ones; Bayer Labs found that they could not replicate about two-thirds of the positive findings claimed in medical journals when they attempted the experiments themselves. The failure rate for predictions, made in entire fields ranging from seismology to political science, appears to be extremely high.

Most research papers are not really contributing much to generating knowledge.
— John P.A. Ioannidis, Medical Researcher, Stanford School of Medicine

This is why our predictions may be more prone to failure in the era of Big Data. With such exponential increase in the quantity of available information, there is likewise an exponential increase in the number of hypotheses to investigate. If you want to test for relationship between all 2-pair combinations of the 45,000 economic statistics the U.S. government publishes, like: ‘Is there a causal relationship between the bank prime loan rate and Alabama’s unemployment rate?’ you’re faced with literally 1 billion hypotheses to test.

The meaningful relationships within the data -those that speak to causality rather than correlation, testifying to how the world really works— are in much smaller magnitude and not likely to be increasing at nearly so fast a rate as the information itself.

There isn’t any more truth in the world than there was before the internet or the printing press. Most data is just noise, as most of the universe is filled with empty space.
— Nate Silver, American statistician and writer

Meanwhile, when the underlying incidence of something is low in the population [breast cancer in young women; truth in a sea of data] false positives can dominate the results if we’re not careful. “Accounting for a couple of white million papers written in the last 20 years, there are obviously not a couple of million discoveries,” Ioannidis told me. Figure 8-6 represents this graphically, with 80% of true scientific hypotheses correctly deemed true, versus about 90% correctly rejected. And yet, because true findings are so rare, about two-thirds of the findings deemed to be true are actually false! Unfortunately, as Ioannidis figured out, the state of published research in most fields that conduct statistical testing is probably very much like what you see in this chart.

20190912_Nate Silver_Figure 8-6.jpg

Why is the error rate so high?

To some extent, this entire book represents an answer to THAT question.

There are many reasons for it —some to do with our: 1] psychological biases, 2] common methodological errors, and 3] misaligned incentives.

Close to the root of the problem is a flawed type of statistical thinking that researchers unconsciously apply.

#BigData #IoT #MachineLearning #DataScience #DeepLearning #AI #DigitalTransformation #culturalimpact #culturalinnovation #culturalimpact #cultureinnovation #socialimpact #socialinnovation #whyitmatters #whydoesitmatter

To discover more about how Nate Silver uses hard numbers/statistical analysis to tell compelling stories about elections, politics, sports, science, economics plus more, visit:

I am a UX, Brand & Cultural Consultant anchored in leading multi-discipline business stakeholders or agency teams in creating commercial success through the adoption of big ideas supported by compelling experiences that spark discovery, connection & deeper relationships between people.

Ann O