Major breakthroughs in AI have seen machines being entrusted with business and safety-critical decisions, from guiding vehicles to diagnosing diseases. Yet a reproducibility crisis is creating a cloud of uncertainty over the entire field, eroding the confidence on which the AI economy depends.
Reproducibility, the extent to which an experiment can be repeated with the same results, is the basis of quality assurance in science because it enables past findings to be independently verified, building a trustworthy foundation for future discoveries. This is crucial because previous breakthroughs are the barometer by which to measure all subsequent progress.
Without the capacity to reproduce past results, the entire basis on which machines are increasingly making legal, corporate and even medical decisions is called into question. This could stop us from being able to benefit from some of the greatest advances in the field, from the AIs that power smart cities to those that find new drug treatments.
For example, deep reinforcement learning (RL), whereby machines try every possible solution until they find the right one, could enable driverless cars to endlessly crisscross in virtual reality until they learn to safely change lanes in the real world. Yet experts found that RL results are not easily reproducible, raising questions over whether it can be relied on to ensure road safety. An analysis of 30 AI papers similarly found that the majority of them were difficult to reproduce because key records of their methodologies were missing, from training data sets to study parameters.
As a result, Google researcher Ali Rahimi has likened AI to alchemy. The way in which alchemy produced new innovations such as glass alongside false cures such as leeches is directly analogous to the way that AI has discovered potential cancer cures yet failed to distinguish masks from faces.
Lack Of Traceability
The fundamental problem is that data science is not governed by the same generally accepted standards of quality assurance as other fields of science. As a result, the data trail charting the road from the origins of AI to its latest iterations is shrouded in mystery.
There are currently no universal standards governing the data capture, curation and processing techniques that give vital meaning and context to AI experiments. This is the equivalent of climate scientists investigating global warming without any rules on how to document the locations or units of temperature readings.
This is particularly concerning as there are so many iterations involved in developing machine-learning tools and there is no universal benchmark of good practice for implementing and recording them all. A single experiment to create a facial-recognition system involves a complex layer cake of processes, from training runs to software updates, file changes and tweaks to the algorithm. If any of these steps is not meticulously recorded, it would be painstakingly difficult to modify the AI or reproduce the original results.
This stifles innovation because any changes that need to be made to an AI to meet new requirements or create new applications involve the costly process of attempting to retrace all the individual steps that came before. It also means there’s no way of measuring improvement; how can AI labs demonstrate progress if they cannot even reproduce earlier successes?
Stretching The Limits Of Inference
Another problem is that AI experiments often involve humans repeatedly running AI models until they find patterns in data, like the conspiracy theorist who makes spurious correlations between unrelated phenomena because that is what he is looking for. This causes AI experiments to make false inferences from data because machines cannot distinguish correlation from causation, and the more a machine searches for patterns, the more it will find them.
For example, if an AI was used to analyze demographic data for links between cancer and veganism and found no link, the test could be adjusted to examine whether vegans with different hair colors are more likely to get cancer. The machine might find that red-haired vegans were overrepresented among cancer sufferers, but this would obviously not prove a causal link. The lack of any universally accepted limits to inference means machines tend to prioritize deductive over inductive logic, which severely impedes reproducibility.
Market incentives can also impede reproducibility. AI labs are often encouraged by parent companies to get newsworthy results by any means and make them difficult to copy. This encourages researchers to prioritize research outputs over methods and to conceal crucial aspects of their workings.
From Alchemy To Science
AI has the capacity to transform our economy and society, increasing the speed and success of human decisions by extracting value from data faster than ever before. Yet we cannot give so many vital human tasks to machines unless we can verify the data behind their decisions.
All new advances in AI must be fully documented and reproducible so that we have a complete and traceable data chain to help understand its failures and build on its successes.
In order to achieve this, we need to implement universally accepted frameworks to govern the digital economy, similar to those governing other fields of science. Just as every phase of drug development is documented from lab to market, AI needs to be recorded throughout its life cycle. There need to be clear limits of inference governing autonomous data analysis, ensuring that we prioritize quality over quantity in terms of the training data and study parameters. All training data sets must be contextualized and designed within clear parameters.
Best practice methodologies such as RAPIDE should be codified into international gold standards, governing everything from data analytics to the appropriate methods for combining different data streams. Research funding should be dependent on AI labs demonstrating compliance with these standards. All new AI platforms should be quality tested and certified in the same way that physical goods are subject to rigorous quality controls at customs.
This will incentivize best practices in data science, increasing the commercial value and legal standing of AI decisions. Most importantly, it will ensure that AI is built on verifiable and trustworthy foundations.
Written for Forbes by Matt Jones, Lead Analytics Strategist at Tessella.