IDEA4RC

Intelligent ecosystem to improve
the governance, the sharing,

and the re-use of health data for rare cancers

Newsletter

August 1, 2024

Hello,

In this newsletter we introduce you to the federated learning approach to health data analysis by interviewing Frank Martin, software engineer at the Netherlands comprehensive cancer organisation (IKNL). Frank, together with his colleagues, is working at the software component which will allow IDEA4RC users to analyse data sets stored in different clinical centres in combination, without moving them from their location. He is building on the work done by its institute in the last few years about federated learning, which led to the development of the Vantage6 tool. An upgraded version of Vantage6 will be integrated into the IDEA4RC ecosystem.

If you missed the previous issue, where we talked with Lidia Villanova, project manager at Alliance Against Cancer, and Ariane Weinman, member of the public affair team at EURORDIS, about the launch of the IDEA4RC Community of Interest and patients’ engagement in research, you can find it here.

By subscribing to the newsletter, you will be receiving bi-monthly updates on the project’s advancements. If you want to invite your friends to subscribe, send them this link.

Federated learning in health research: interview with Frank Martin

 

In building its rare cancers ecosystem, IDEA4RC is following a federated learning approach. This means that the data of each clinical centre participating in a specific research study will not leave the centre. Instead, the results of the analysis on each centre’s data will be communicated to a central server, where they will be combined with others in a way that produces results very close to what would have been obtained by running the analysis on the aggregated data. This choice is motivated by the need to protect patients’ privacy. Within the IDEA4RC consortium, the work on federated learning is led by the Netherlands comprehensive cancer centre (IKNL), an independent knowledge institute for oncological and palliative care. We spoke to Frank Martin, software engineer at IKNL, who has been involved in IKNL projects on federated learning since the beginning and is now working on IDEA4RC.

What was the first federated learning project you participated in at IKNL?

It was back in 2018, when we were tasked to understand how we could use federated learning to combine two datasets concerning the outcomes of patients with throat cancer, one extracted from the Netherlands Cancer Registry and the other from the Taiwanese Cancer Registry.

Taiwan could offer insights into prognostic factors and the effectiveness of treatments given the high incidence that type of cancer has in the country, much higher than in the Netherlands. We run a statistical model, called Cox proportional hazard model, to find whether factors such as the patient’s age at diagnosis, tumour stage and grade, volume of the hospital where patients were treated, type of treatment, influenced in different ways the survival rate in Taiwan and in the Netherlands.

The results were published two years later in Scientific Reports, and highlighted for instance that the risk of death for patients younger than 60 years, with advanced stage, higher grade or receiving adjuvant therapy after surgery was lower in the Netherlands than in Taiwan. On the other hand, patients older than 70 years, with early stage, lower grade and receiving surgery alone in the Netherlands were at higher risk of death than those in Taiwan.

Did you have to deal with legal issues in this first project?

In that project we didn’t have to deal with legal issues, it was more a test on a proof-of-concept architecture to conduct a federated data analysis. At the time federated learning wasn’t as popular as it is now, and there still wasn’t agreement on the terminology, some referred to it as distributed computing.

The need to consider the legal framework and to develop legal agreements between the parties involved in the federated learning architecture came up with the STARTER project, which aimed at developing a European registry for rare adult solid cancers starting from the clinical data derived from diagnostic tests and treatments performed by healthcare providers as part of patient management, as well as from national and regional cancer registries. In that project we developed further the federated architecture and ended up with the first version of the Vantage6 tool, which then evolved in the one we are now using for the IDEA4RC platform.

Federated learning is presented as an easier way to fulfil data protection requirements, but it still poses risks. Can you give us an idea of what these risks are?

Federated learning emerged as an appealing alternative to the centralised approach to data analysis as soon as it became clear that anonymising the data is not sufficient to avoid the possibility to re-identify the single patients that contributed to a specific data set. It is not enough to simply remove the sensitive meta data about a patient, such as their name, surname, gender and date of birth. Federated learning looked like a promising alternative to allow analysing large and diverse datasets without risking violating patients’ privacy. With rare diseases, obtaining large and diverse datasets requires to combine data from multiple sources, and this is where federated learning comes into play. Say you want to compute the average age of the patients with a specific type of rare cancer. In a federated learning approach, you ask each centre to communicate to a central server or aggregator the number of patients with that cancer and the sum of their ages. The central server then can easily compute the average age across all the centres just by summing up the sums of the ages and then dividing it by the total number of patients. This is an example of how federated learning would work to compute a very simple statistics, but there are methods to conduct more complex statistical analysis with the same approach.

However, there are cases where it is possible to deduce sensitive features of single patients starting from the results of the analysis, for example by running the same analysis on different cohort of patients changing the inclusion criteria and finding the right combination of those criteria that exclude just a single patient. This attack strategy is called differencing. There are several other attack strategies, especially for deep neural networks, such as reconstruction, model inversion or deep leakage from gradients. This is why there need to be legal safeguards also on federated learning systems. 

How can these risks be managed and mitigated?

Some of these threats cannot be avoided completely, but you can run checks beforehand and during the whole sequence of analysis to spot potential malicious users. Another strategy to mitigate the risk of re-identification in the federated learning framework is called differential privacy. When using differential privacy, you add noise to the data from each source that prevent re-identification while at the same time not changing the statistical distribution of data.

For Vantage6, we constructed a security and privacy document for the infrastructure to map these risks. In addition, we analyse every algorithm separately to find its specific risks. These documents are then shared with the data station owners before doing any analysis.

What are the other hurdles that federated learning needs to overcome?

Data harmonisation is one of the most important hurdles and is what colleagues working in the IDEA4RC project are trying to overcome with dedicated tools. In the first federated project involving the Netherlands and Taiwan cancer registries we spent a great deal of time arranging the data so that they can be analysed in combination. IDEA4RC common data model and the ETL engines that are being designed aspire to automatize that process. This guarantees not only to save time but also to make this kind of ecosystem sustainable in the long term and thus capable of making an impact.

How different is for the final users, such as the clinical data analysts and researchers, to conduct a federated analysis rather than a centralised one?

It is a completely different experience, at least with the tools we have today. However, they are evolving and increasingly considering the needs of the users. One should always keep in mind that federated analytics tools have only been developed for a few years, whereas the centralized tools that data scientists are accustomed to have been around for decades.

The main difference resides in the exploratory phase of the data analysis, where researchers try to make sense of the data through various visualizations and statistics. Some of them can be reproduced also with federated learning tools, but some others, such as scatter plots or outliers’ visualization cannot, because they would involve exchanging patient-level information.

We started to work on the user interface of Vantage6 in the BlueBerry project, which picked up the baton from STARTER to upscale and expand the European rare adult solid cancer registry. In BlueBerry we developed an interface that allows to interact with the system without any programming experience.

In IDEA4RC we are working to further improve the user experience through the introduction of data frames. These are constructed from the original source to provide the user with much more input on what kind of data they are dealing with. It is also a stepping stone towards more real-time processing as we do not have to query the data source every time we compute something. I believe we can get extremely close to reproducing their analysis routine with some more development.

This improved version of the Vantage6 user interface will then be integrated with the rest of the IDEA4RC virtual assistant, which is being developed by colleagues at Universidad Politécnica de Madrid.

In April this year the European Parliament has adopted a final version of the European Health Data Space (EHDS), which introduces a new framework for the primary and secondary use of health data. Do you believe that the federated approach is better positioned to fulfil the EHDS requirements on data protection and patients’ privacy than the centralised one?

I think we will need to wait and see how these two different approaches perform as the EHDS is adopted by the different Member States and enters into force. We know that with the GDPR the lack of homogeneity in the way Member States interpreted that piece of Regulation impaired the re-use of health data for research. At the same time, we should be aware that federated learning is just one of several privacy-enhancing technologies that will need to work in concert. Additionally, I don’t believe that a single approach will prevail in the end, it really depends on what kind of use case you need to conduct.

 

Meetings, results
and updates

 

Eugenio Gaeta and Giuseppe Fico, researchers at Universidad Politécnica de Madrid, together with Roberta Gazzarata, Giorgio Cangioli and Catherine Chronaki of HL7 Europe, co-authored a new publication in the International Journal of Medical Informatics. They reviewed the scientific and gray English literature from 2017 to 2023, ending up with 93 scientific papers on the employment of HL7 FHIR for chronic disease management and 35 HL7 FHIR Implementation Guidelines on the same topic. They concerned primarily cancer (45%) cardiovascular diseases (more than 15%) and diabetes (almost 15%). Articles come from Europe primarily, with Germany and Italy at the top of the list, with Americas and USA to follow. The analysis also indicates that the popularity of HL7 FHIR as a robust technical interface standard for the health sector has been steadily rising since its inception in 2010, reaching a peak in 2021.Find out more here.

 

On July 16th Claudia Egher, researcher at Utrecht University, presented the work done so far in IDEA4RC together with Susan van Hees and Wouter Boon during the panel “Unexpected ways of knowledge production. Spaces for co-creation in Research Infrastructures” at the quadrennial joint meeting of the European Association for the Study of Science and Technology and the Society for Social Studies of Science (EASST4S2024). You can read the abstract here.

 

IDEA4RC will participate in Vitalis 2025, the Nordic region’s leading conference and trade fair on the future of healthcare. At Vitalis 2025, IDEA4RC will present both an opening keynote on May 19 and a conference track on May 21. “We are proud to present our AI-assisted data ecosystem for the first time more broadly. This is an important step towards improving care for patients with rare cancers. IDEA4RC’s data ecosystem is groundbreaking and will revolutionise the way we manage health data. By participating in Vitalis, we can discuss challenges and opportunities to improve the care of patients with rare diseases with other key actors at the conference. In this way, we hope for a broader implementation of our results”, Andreas Muth, Head of the Department of Surgery at Sahlgrenska University Hospital, and responsible for IDEA4RC in Sweden. You can find the press release here.

 

What’s up in health
data sharing and reuse
in the EU

 

In an opinion piece published in the journal EURACTIV, Jean-Marc Bourez, CEO of the European Institute of Innovation and Technology – Health (EIT-Health), brought up the main results of a consultation conducted among stakeholders about the challenges in the implementation phase of the European Health Data Space. The consultation’s results have been summarised in a report published in April and involved organisations across business, research, and education, as well as healthcare leaders and practitioners. Among the practical, political and logistical hurdles towards an effective implementation of the EHDS identified by the report two broadly permeate. First, data collection practices and the degree of digitalisation of the national health systems varies a lot across the EU. To enable the EHDS to function effectively, interoperability is needed, requiring modernised data infrastructure and improved connectivity between health institutions and data access bodies. Second, stakeholders were clear that they want to be involved in the implementation process, engaging also patients and citizens. The need for continuing education which will place data at the forefront of healthcare delivery should also be recognised and implemented.  You can read the full piece here.

 

May 2024 marked the start of TEHDAS2, the joint action to promote the secondary use of health data in the EU. Over 60 organisations from 29 European countries are participating in the action. The TEHDAS2 project will create guidelines and technical specifications to enable smooth cross-border use of health data. The guidelines and technical specifications are aimed at data holders and data users, as well as health data access bodies, which are new entities to be established by the member states when the EHDS comes into force. TEHDAS2 continues and builds on the work carried out in the previous joint action, TEHDAS, which ended in July 2023, as well as the ongoing HealthData@EU Pilot project testing the EHDS infrastructure. You can read the news in the TEHDAS2 website here.