AI in healthcare: why enough quality data trumps good models

When you’re a data scientist, conferences are a great time to reload on state-of-the-art knowledge — and random goodies.

The hyper-targeted MIDL (2019 edition, going virtual this year), which only counted a few hundred attendees, left us inspired to make our product even better.

However, most new research presented small improvements to approaches that were already known. After speaking to other attendees, we quickly realised we weren’t the only ones feeling this way. Interestingly, research concerning traditional computer vision tasks like detection and segmentation were particularly affected. They felt disconnected from clinical practice.

Some notable research presented during MIDL

  • Lossau, T., et al. “Dynamic Pacemaker Artifact Removal (DyPAR) from CT Data using CNNs.” (2018).

Download the academic paper; watch the presentation at MIDL.

  • Souza, R., et al. “A Hybrid, Dual Domain, Cascade of Convolutional Neural Networks for Magnetic Resonance Image Reconstruction.” (2018).

Download the academic paper; watch the presentation at MIDL.

Academia is data starved

All these papers share one simple problem: they use (very) small datasets. A problem that is widespread in the biomedical data science community, as medical specialists that can expertly annotate health data are expensive. This is troubling; since data of actual patients from hospitals contain lots of variations that cannot be captured in really small datasets, the model performance does not translate to real-world clinical practice.

This problem is well known. Initiatives like the Open Health Care by NHS England attempt to resolve this by making huge amounts of data available to everyone. We can only applaud such ambitions, but they solve only half of the problem. At this very moment there are already huge publicly available raw datasets like the NLST for CT Chest scans (Level C on the scale proposed by Hugh Harvey).

The real issue is getting annotated data.

Data matters

To create annotations for medical datasets, researchers need money. The reason is twofold; not only because medical specialists are expensive to hire, but also because annotating software costs money to build. Academia does not have the funds to spend money on this, but companies do.

Good and enough data trumps good models.

This common piece of wisdom is something all data scientists experience every day, so do we at Aidence. For our latest model improvement, we tried a broad range of different approaches, including the latest research from MIDL and ICML. Yet, the performance increase we achieved was minimal. In the end we chose the more costly option of getting radiologists to annotate more data. The extra annotated scans were much more effective than the latest research approaches.

Let there be data

This leaves us in a paradoxical situation. We need academia to find new approaches, but findings cannot be evaluated for generalisability due to a lack of data. Companies on the other hand do have access to annotated data, but they are occupied with building a safe and certified product around the model and integrating it into the hospitals.

Neither of these parties is going to solve this on their own.

On the one hand, it is unlikely that academia will free up funding to create publicly available annotated datasets. While there have been datasets published, but more is needed. On the other hand, correct annotations on specific datasets are a huge competitive advantage for AI startups. For that reason, Aidence and its peers won’t make their datasets available to avoid creating their own competition.

It’s necessary for a third-party to step in.

We see this as a perfect opportunity for governments to stimulate the development of AI software. Governments are neutral in the market and are eager to provide funding for AI systems. Data needs to come from a large pool of patients to ensure enough variation, and hospital networks cannot pull this off on their own.

To overcome this, governments can act as a mediator between hospitals and collect anonymised data from different PACS systems. The focus should be on creating annotated data sets rather than building a unified infrastructure or ensuring interoperability because that will only slow down the process. Researchers and professionals that want to use the annotated data are more than capable of pulling data from different sources and doing data cleanup.

To conclude; governments should provide funding to annotate anonymized data by medical specialists and make the data publicly available for everyone.

The example to follow

Two big annotated datasets were released by the National Institutes of Health (NIH) under Roland Summers. Check out these useful links:




Building clinical applications for the oncology pathway. Insights and opinions on medical imaging AI.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Analysis of the Airbnb Market in Berlin 2020

12 Crucial Roles of Big Data in Business Development

Trudeau on Tour

How does that latest MT upgrade benefit me? How can I find out?


End-to-End Data Science Project — Time Series Analysis for Temperature Forecasting using ARIMA…

Where to have a Airbnb in Seattle?

Introducing SAYN: A Simple Yet Powerful Data Processing Framework

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store


Building clinical applications for the oncology pathway. Insights and opinions on medical imaging AI.

More from Medium

InsurTech: Before talking about artificial intelligence, we need to develop the data foundation

Why AIOPS is important in Data Science

The MIT Master’s in Business Analytics experience from former Data Peer Consultant

The value of a distance/remote learning MSc in Business and Data Analytics?