Science has always hinged on the idea that researchers must be able to prove and reproduce the results of their research. Simply put, that is what makes science...science. Yet in recent years, as computing power has increased, the cloud has taken shape, and data sets have grown, a problem has appeared: it has becoming increasingly difficult to generate the same results consistently—even when researchers include the same dataset.
"One basic requirement of scientific results is reproducibility: shake an apple tree, and apples will fall downwards each and every time," observes Kai Zhang, an associate professor in the department of statistics and operations research at The University of North Carolina, Chapel Hill. "The problem today is that in many cases, researchers cannot replicate existing findings in the literature and they cannot produce the same conclusions. This is undermining the credibility of scientists and science. It is producing a crisis."
The problem is so widespread that it is now attracting attention at conferences and in academic papers, and even is garnering attention in the mainstream press. While a number of factors contribute to the problem—including experimental errors, publication bias, the improper use of statistical methods, and subpar machine learning techniques—the common denominator is that researchers are finding patterns in data that have no relationship to the real world. As Zhang puts it, "The chance of picking up spurious signals is higher as the nature of data and data analysis changes."
At a time when anti-science sentiment is growing and junk science is flourishing, the repercussions are potentially enormous. If results cannot be trusted, then the entire nature of research and science comes into question, experts say. What is more, all of this is taking place at a time when machine learning is emerging at the forefront of research. A lack of certainty about the validity of results could also lead people to question the value of machine learning and artificial intelligence.
A simple but disturbing fact is at the center of this problem. Researchers are increasingly starting with no hypothesis and then searching—some might say grasping—for meaningful correlations in data. If the data universe is large enough—and this is frequently the case—there are reasonably good odds that by sheer chance, a valid p-value will appear. Consider: if a person tosses a coin eight times and it lands on heads every time, this is noteworthy; however, if a person tosses a coin 8,000 times and, at some point, the coin lands on heads eight consecutive times, what might appear to be a significant discovery is merely a random event.
The idea that scientific outcomes may be inaccurate or useless is not new. In 2005, John Ioannidis, a professor of health research and policy at Stanford University, published an academic paper titled Why Most Published Findings Are False, in the journal PLOS Medicine. It put the topic of reproducibility of results on the radar of the scientific community. Ioannidis took direct aim at methodologies, study design flaws, and biases. "Simulations show that for most study designs and settings, it is more likely for a research claim to be false than true," he wrote in that paper.
Others took notice. In 2011, Glenn Begley, then head of the oncology division at biopharmaceutical firm Amgen, decided to see if he could reproduce results for 53 foundational papers in oncology that appeared between 2001 and 2011. In the end, he found he could replicate results for only six papers, despite using datasets identical to the originals. That same year, a study by German pharmaceutical firm Bayer found only 25% of studies were reproducible.
This is a topic Ioannidis and others have continued to examine, particularly as the pressure to produce useful studies grows. Says Ioannidis, "Today, we have opportunities to collect and analyze massive amounts of data. Along with this, we have a larger degree of freedom about how we collect data, how we assemble it, and how we interpret it." The challenge, then, is to design a methodology around a hypothesis, and then test it with the available data, or use other valid statistical methods when the number of potential hypotheses tested is extremely large. The need for proper methodologies is magnified in an era where data is easily collected and widely available.
Ioannidis expresses concern about the trend toward an "exploration and pattern-recognition approach" that requires little or no planning—and often uses little or no validation of results. Increasingly, he says, researchers resort to the backwards method of using machine learning to identify a hypothesis, rather than starting with one. The data contains hidden patterns, but it's the coin landing on heads eight times in a row out of 8,000 flips result, rather than landing on heads eight out of eight times. "The approach often reflects the mentality that 'something interesting must be there with all the riches of huge datasets'," Ioannidis explains.
"Simulations show that for most study designs and settings, it is more likely for a research claim to be false than true."
Not surprisingly, machine learning can amplify errors and distortions. Inconsistent training methods and poorly designed statistical frameworks lead to patterns and correlations that have no validity or link to causality in the real world. An emerging problem is a lack of understanding about how to use machine learning tools correctly. In fact, a growing number of commercial applications—particularly those designed for the business world—put enormous analytics and machine learning capabilities in the hands of non-data scientists.
Although the reproducibility problem spans virtually every scientific discipline, it is particularly problematic in medicine, where results are frequently unreproducible and the repercussions are greater. Experts say the research community must address the challenge and find fixes because this not only erodes public confidence, it wastes time, money, and valuable resources, all while generating greater confusion about which drugs, therapies, and procedures actually work. Says Ioannidis, "There are potential repercussions—and they can be quite devastating—if doctors make wrong choices based on inaccurate data or study results."
Brian Nosek, co-founder and executive director of the non-profit Center for Open Science in Charlottesville, VA, says that if there is a crisis, the current situation represents a "crisis of confidence." Greater degrees of freedom along with motivated reasoning can lead researchers unintentionally down paths that produce less-than-credible findings.
Nosek says it is necessary to reexamine the way researchers approach studies at the most basic level. Among other things, this means emphasizing reproducibility as a key requirement for publication, openly sharing data and code so that methodologies and results can be validated by others in the research community, and promoting transparency about funding and affiliations. In pursuit of this goal, the Open Science Framework (OSF) now offers an online repository where researchers can register studies and allow others to examine the supporting data, materials, and code after the research is complete.
A number of other factors also are crucial to boosting the accuracy and validity of findings. Glenn Begley has observed that six key questions lie at the center of sound research and reproducibility:
- Were studies blinded?
- Were all results shown?
- Were experiments repeated?
- Were positive and negative controls shown?
- Were reagents validated?
- Were the statistical tests appropriate?
By boosting due diligence upfront, Begley argues, it is possible to ensure a much higher level of veracity and validity to research results. The same techniques also apply to analytics and machine learning in business and industry, where users often lack the scientific grounding to ensure the methods they use are sound.
In the scientific community, greater scrutiny can also take the form of more vigorous peer reviews and greater oversight from journals. In some cases, researchers are publishing results that haven't been reviewed at all; they essentially are rubber-stamping their own work. This has contributed to an increased number of retractions and corrections in journals. The Journal of Medical Ethics, for example, documented a 10-fold increase in retractions of scientific papers in the PubMed database between 2000 and 2009 alone.
More rigorous statistical methodologies, as well as better use of machine learning, are also critical. As a result, researchers are studying ways to improve analysis. For example, instead of conducting exploratory data analysis on an entire data set, researchers might use data splitting—essentially, separating a training dataset and test dataset and keeping the test dataset hidden until the end, once the results are generalizable.
Another approach involves taking an original training dataset and randomizing it in a way that mimics future datasets by adding random noise repeatedly. If researchers can aggregate all the results and the discovery remains stable (meaning it appears across many different randomized datasets), then it's likely to be reproducible.
Although the inability to reproduce scientific results has grown in recent years, observers say most researchers strive for accurate findings and the problem is largely solvable. Enormous datasets and the widespread use of machine learning are relatively new additions to mainstream science, and it is simply a matter of time before more stringent methodologies emerge, they say.
"We are in the midst of a reformation. The research community is identifying challenges to reproducibility and implementing a variety of solutions to improve. It is an exciting time, not a worrying one," Nosek argues.
Zhang also says there's no reason to push the panic button; scientific methods are messy, difficult, and iterative. "We need to embrace changes. We need to be more selective and careful about avoiding mistakes that lead to irreproducible results and invalid conclusions. Right now, this crisis represents enormous opportunities for statisticians, data scientists, computer scientists, and others to develop a more robust framework for research."
Adds Ioannidis, "I'm optimistic that we will find ways to solve the problem of irreproducibility. We will learn how to use today's tools more effectively, and come up with better methodologies. But it's something we must confront and address."
Why Most Published Research Findings Are False, PLOS Medicine, Aug. 30, 2005. https://doi.org/10.1371/journal.pmed.0020124
Berk, R., Brown, L., Buja, A., Zhang, K., and Zhao, L.
Valid Post-Selection Inference, The Annals of Statistics, 2013, Vol. 41, No. 2, 802–837. https://projecteuclid.org/euclid.aos/1369836961
Science Isn't Broken: It's Just a Hell of a Lot Harder Than We Give It Credit For, FiveThirtyEight. Aug. 19, 2015. https://fivethirtyeight.com/features/science-isnt-broken/#part1
Halsey, L.G., Curran-Everett, D., Vowler, S.L., and Drummond, G.B.
The Fickle P Value Generates Irreproducible Results. Nature Methods, March 2015, Vol. 12 No. 3. pp. 179. https://www.nature.com/articles/nmeth.3288.pdf?origin=ppub
©2019 ACM 0001-0782/19/09
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from [email protected] or fax (212) 869-0481.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2019 ACM, Inc.