Many researchers use machine learning to analyze data. However, this seems to have caused some problems.
At a recent American Association for the Advancement of Science conference in Washington, DC statistician Genever Allen warned strongly that scientists are using machine learning algorithms to find patterns in data even when the algorithms focus on noise that cannot be replicated in repeated experiments.
“Science has an awareness of the reproducibility crisis today,” Allen says. "I would even venture to argue that most of it is related to the application of machine learning techniques in science."
According to Allen, the problem can arise when scientists collect a large amount of genomic data, and then use machine learning algorithms that they do not understand well to identify clusters of similar genomic profiles.
“Often times, these studies don't look inaccurate until the next very large dataset is released, to which someone reapplies these techniques, and says, 'Oh my God, the results of these two studies are not the same,” Allen continues.
She also argues that the problem with machine learning is that it looks for patterns even where they don't exist. It assumes that the solution will be a new generation of algorithms that will better cope with assessing the reliability of the predictions they generate.
“The question is, can we really trust the discoveries that are being made today to apply machine learning techniques to large datasets? Allen says. "In many cases, the answer is likely to be 'Not without rechecking,' but work is already underway on next-generation machine learning systems that will appreciate the inaccuracy and reproducibility of their predictions."