The delicate balance between data protection and availability
By Ricardo Roovers
I recently got to say the four “magic” words: graduated and a job! My master’s thesis on the EU regulatory framework around autonomous AI as a medical device is complete; and luckily, my internship with Aidence has turned into a full-time role.
One aspect of my research was analysing the clinical evaluation and quality management regulations applicable to AI medical devices. Here, I was intrigued by the (emerging) requirements for the data used in the training, validation, and testing of algorithms.
AI development requires large volumes of annotated data. As adoption continues, this data will likely come under increasing scrutiny, aiming to ensure that AI devices rely on high-quality, diverse datasets. At the same time, data protection and privacy regulations restrict the availability of data.
Innovations such as medical imaging AI support physicians to make faster, more accurate decisions and better help their patients. These innovations benefit individuals, so regulations should encourage their development. Data protection should thus be in balance with data availability. In this article, we assess this balance from a regulatory standpoint, focusing on the EU.
Regulations on data and AI
Generally, software as a medical device (SaMD) in the EU must comply with requirements set out in the EU Medical Device Regulation (MDR). The proposed Artificial Intelligence Act, released earlier this year by the EU Commission, introduces additional AI-specific conformity assessments.
The proposal aims to improve patient health and eliminate possible algorithm bias. It states that the datasets used to develop the AI medical devices must be representative of the intended patient population:
“Training, validation and testing data sets shall take into account, to the extent required by the intended purpose, the characteristics or elements that are particular to the specific geographical, behavioural or functional setting within which the high-risk AI system is intended to be used.”
The proposal further requires manufacturers to document systems and procedures for data management throughout the data’s lifecycle (from initial gathering to labelling and retention).
Additionally, the ISO/TC 215 committee of the International Organisation for Standardization (ISO) is considering introducing requirements for describing, assessing, and communicating data quality.
It is interesting to note that similar data requirements are also appearing in the US. The American College of Radiology (ACR) and the Radiological Society of North America (RSNA) recommended to the FDA and manufacturers that AI algorithms “should be required to undergo testing using multi-site heterogeneous datasets”.
Unfortunately, these regulations may not have the desired effect of minimising bias because the data they require for AI development is not readily available.
Not enough (annotated) data
Data availability is a challenge within the EU. As an example from our field, except for some smaller datasets, there is currently no large, publicly accessible dataset of chest CTs of EU lung cancer patients.
This puts the EU’s position on the worldwide AI market and its population’s health at a disadvantage. It will be especially problematic if, and when, the new regulations on data and AI come into force.
There are (at least) two factors that have contributed to the lack of data sources needed to develop AI for clinical applications based on EU citizens:
Data unavailability can primarily be attested to the General Data Protection Regulation (GDPR). Two of its provisions are particularly high barriers to the collection and usage of health data:
- Purpose limitation: data can only be collected for specific, explicit and legitimate purposes;
- Data minimisation: data must be adequate, relevant and limited to what is necessary for the processing purposes.
Many AI manufacturers are not in direct contact with patients. Hospitals or healthcare practitioners need to ask patients for consent to use their data (e.g. their pseudonymised medical images). However, hospital willingness and internal policies vary. It can be inefficient and resource-intensive, or even impossible, to gather this data.
Furthermore, these provisions restrict the creation of large datasets available to AI developers. They impede EU-based AI manufacturers’ market access and discourage vendors from other regions to provide solutions for EU patients.
As a side note, the US made several large (anonymised) datasets public. Two examples are the National Lung Screening Trial (NLST) and the Lung Imaging Data Consortium and Image Database Resource Initiative (LIDC/IDRI) datasets. We have used these to train and validate the deep learning models in our lung nodule management solution, Veye Lung Nodules.
Creating datasets for medical AI solutions and making them available does not solve the whole problem. In medical imaging, radiologists must label the obtained scans in order to ‘teach’ the AI model.
For instance, take an algorithm to detect, quantify and classify lung nodules. Its development requires radiologists to segment and classify the nodules in each scan from the dataset. It is a time-consuming and costly operation.
One of my first projects at Aidence is coordinating the acquisition of annotations to update the deep learning models of Veye Lung Nodules. Thus, I experience these challenges first-hand.
Finding the balance
While we firmly support the principles of the GDPR, I think there should always be room for nurturing innovation.
Suppose regulations requiring large, heterogeneous datasets, such as the AI Act, come into force before high quality annotated, GDPR-approved datasets are available. In this scenario, the regulations would hinder medical innovation within the EU.
Moreover, the purpose of data quality requirements — to improve patient health and diminish bias — would be defeated. Many AI manufacturers would no longer be able to provide their solutions because they will be unable to comply with the regulations.
We must balance data protection and data availability to promote innovation and continue to improve healthcare. It is a matter of protecting the individuals’ rights, privacy, and rights to high standards of care.
The European Coordination Committee of the Radiological, Electromedical and Healthcare IT Industry (COCIR) has also made a note on the subject:
“Mandatory certification of training data against standards establishing the quality of AI training or evaluation data is not an effective mechanism to assure the AI is safe and effective for the target population or that bias is minimised.”
Possible ways forward
I will point out three possible approaches that tackle the data unavailability from different angles.
A (confident) system for data sharing
International institutes are jointly making an effort to promote access to data by implementing systems and infrastructure that facilitate it. Dutch competence center Nictiz, an excellent example, develops and manages standards that enable electronic information exchange. In the UK, the NHSx’ Data Saves Lives’ strategy makes a similar effort.
The European Health Data Space is another promising initiative. The aim is to promote the safe exchange of patients’ data and people’s control over their health data. It encourages the access and the use of health data for research, policy-making, and regulation. It also follows a trusted governance framework and upholds data protection rules.
As mentioned above, for the data to become usable for AI development, medical specialists must annotate it. A former colleague argued that governments should step in with funding. Research on the guiding principles for constructing an annotated imaging database for AI has already been conducted. It could serve as a starting point.
Finally, a recent article suggested that sharing health data should be based on confidence rather than trust. Data sharing systems and infrastructures should rely on transparency, accountability, representation, and social purpose.
A second possible way to increase data availability is federated learning. This is a method for multiple medical institutions to train AI models collaboratively and without exchanging datasets.
In this setup, an algorithm is distributed from a central server to remote data silos, such as hospitals. The model is trained on data in these silos and sent back to the main server. Here, the models are aggregated into a global solution.
Challenges to the implementation arise from differences in imaging acquisition, labelling protocols, and system architecture across data silos. Furthermore, federated learning does not solve all potential privacy issues.
Synthetic data is artificially created and comparable to real-life data. By using General Adversarial Networks (GANs) different types of synthetic data, such as images of CT-realistic lung nodules, can be created.
GANs are models in which two sets of algorithms compete to increase the accuracy of their predictions (for a more detailed explanation, visit this page).
Several obstacles make it challenging to use synthetic data to train, validate, and test AI algorithms. For starters, to create synthetic data, we need actual data, which remains scarce. Secondly, GANs trained with biased datasets would also be biased toward overrepresented conditions or populations. Thirdly, AI trained on synthetic data currently shows lower accuracy than AI trained on actual data.
The development of AI-based medical devices relies on access to health data. Current and emerging standards and regulations require diverse data sources for AI. At the same time, the GDPR hampers access to this data. The first aims to protect patient health, the latter patient privacy. Balancing the two is proving to be intricate, but necessary to keep improving healthcare.
The AI Act received over 300 feedback documents on various topics. Our regulatory team co-created this consultation, filed by the Medical Device / AI Expert Group (MD-AIG) established by the Netherlands Normalisation Institute (NEN). We argued for minimising the burden on manufacturers, caregivers, and Notified Bodies by checking the new requirements against those in the MDR. The conversation on data availability and protection is, as we can see, far from closed.