The reproducibility issues that haunt health-care AI.
Each day, around 350 people in the United States die from lung cancer. Many of those deaths could be prevented by screening with low-dose computed tomography (CT) scans. But scanning millions of people would produce millions of images, and there aren’t enough radiologists to do the work. Even if there were, specialists regularly disagree about whether images show cancer or not. The 2017 Kaggle Data Science Bowl set out to test whether machine-learning algorithms could fill the gap.
An online competition for automated lung cancer diagnosis, the Data Science Bowl provided chest CT scans from 1,397 patients to hundreds of teams, for the teams to develop and test their algorithms. At least five of the winning models demonstrated accuracy exceeding 90% at detecting lung nodules. But to be clinically useful, those algorithms would have to perform equally well on multiple data sets.
To test that, Kun-Hsing Yu, a data scientist at Harvard Medical School in Boston, Massachusetts, acquired the ten best-performing algorithms and challenged them on a subset of the data used in the original competition. On these data, the algorithms topped out at 60–70% accuracy, Yu says. In some cases, they were effectively coin tosses1. “Almost all of these award-winning models failed miserably,” he says. “That was kind of surprising to us.”
But maybe it shouldn’t have been. The artificial-intelligence (AI) community faces a reproducibility crisis, says Sayash Kapoor, a PhD candidate in computer science at Princeton University in New Jersey. As part of his work on the limits of computational prediction, Kapoor discovered that reproducibility failures and pitfalls had been reported in 329 studies across 17 fields, including medicine. He and a colleague organized a one-day online workshop last July to discuss the subject, which attracted about 600 participants from 30 countries. The resulting videos have been viewed more than 5,000 times.