Federated learning in health research: interview with Frank Martin

In building its rare cancers ecosystem, IDEA4RC is following a federated learning approach. This means that the data of each clinical centre participating in a specific research study will not leave the centre. Instead, the results of the analysis on each centre’s data will be communicated to a central server, where they will be combined with others in a way that produces results very close to what would have been obtained by running the analysis on the aggregated data. This choice is motivated by the need to protect patients’ privacy. Within the IDEA4RC consortium, the work on federated learning is led by the Comprehensive Cancer Centre Netherlands (IKNL), an independent knowledge institute for oncological and palliative care. We spoke to Frank Martin, software engineer at IKNL who has been involved in IKNL projects on federated learning since the beginning and is now working on IDEA4RC

What was the first federated learning project you participated in at IKNL?

It was back in 2018, when we were tasked to understand how we could use federated learning to combine two datasets concerning the outcomes of patients with throat cancer, one extracted from the Netherlands Cancer Registry and the other from the Taiwanese Cancer Registry.

Taiwan could offer insights into prognostic factors and the effectiveness of treatments given the high incidence that type of cancer has in the country, much higher than in the Netherlands. We run a statistical model, called Cox proportional hazard model, to find whether factors such as the patient’s age at diagnosis, tumour stage and grade, volume of the hospital where they were treated, type of treatment, influenced in different ways the survival rate in Taiwan and in the Netherlands.

The results were published two years later in Scientific Reports, and highlighted for instance that the risk of death for patients younger than 60 years, with advanced stage, higher grade or receiving adjuvant therapy after surgery was lower in the Netherlands than in Taiwan. On the other hand, patients older than 70 years, with early stage, lower grade and receiving surgery alone in the Netherlands were at higher risk of death than those in Taiwan.

Did you have to deal with legal issues in this first project?

In that project we didn’t have to deal with legal issues, it was more a test on a proof-of-concept architecture to conduct a federated data analysis. At the time federated learning wasn’t as popular as it is now, and there still wasn’t agreement on the terminology, some referred to it as distributed computing.

The need to consider the legal framework and to develop legal agreements between the parties involved in the federated learning architecture came up with the STARTER project, which aimed at developing a European registry for rare adult solid cancers starting from the clinical data derived from diagnostic tests and treatments performed by healthcare providers as part of patient management, as well as from national and regional cancer registries. In that project we developed further the federated architecture and ended up with the first version of the Vantage6 tool, which then evolved in the one we are now using for the IDEA4RC platform.

Federated learning is presented as an easier way to fulfil data protection requirements, but it still poses risks. Can you give us an idea of what these risks are?

Federated learning emerged as an appealing alternative to the centralised approach to data analysis as soon as it became clear that anonymising the data is not sufficient to avoid the possibility to re-identify the single patients that contributed to a specific data set. It is not enough to simply remove the sensitive meta data about a patient, such as their name, surname, gender and date of birth. Federated learning looked like a promising alternative to allow analysing large and diverse datasets without risking violating patients’ privacy. With rare diseases, obtaining large and diverse datasets requires to combine data from multiple sources, and this is where federated learning comes into play. Say you want to compute the average age of the patients with a specific type of rare cancer. In a federated learning approach, you ask each centre to communicate to a central server or aggregator the number of patients with that cancer and the sum of their ages. The central server then can easily compute the average age across all the centres just by summing up the sums of the ages and then dividing it by the total number of patients. This is an example of how federated learning would work to compute a very simple statistics, but there are methods to conduct more complex statistical analysis with the same approach.

However, there are cases where it is possible to deduce sensitive features of single patients starting from the results of the analysis, for example by running the same analysis on different cohort of patients changing the inclusion criteria and finding the right combination of those criteria that exclude just a single patient. This attack strategy is called differencing. There are several other attack strategies, especially for deep neural networks, such as reconstruction, model inversion or deep leakage from gradients. This is why there need to be legal safeguards also on federated learning systems.

How can these risks be managed and mitigated?

Some of these threats cannot be avoided completely, but you can run checks beforehand and during the whole sequence of analysis to spot potential malicious users. Another strategy to mitigate the risk of re-identification in the federated learning framework is called differential privacy. When using differential privacy, you add noise to the data from each source that prevent re-identification while at the same time not changing the statistical distribution of data.

For Vantage6, we constructed a security and privacy document for the infrastructure to map these risks. In addition, we analyse every algorithm separately to find its specific risks. These documents are then shared with the data station owners before doing any analysis.

What are the other hurdles that federated learning needs to overcome?

Data harmonisation is one of the most important hurdles and is what colleagues working in the IDEA4RC project are trying to overcome with dedicated tools. In the first federated project involving the Netherlands and Taiwan cancer registries we spent a great deal of time arranging the data so that they can be analysed in combination. IDEA4RC common data model and the ETL engines that are being designed aspire to automatize that process. This guarantees not only to save time but also to make this kind of ecosystem sustainable in the long term and thus capable of making an impact.

How different is for the final users, such as the clinical data analysts and researchers, to conduct a federated analysis rather than a centralised one?

It is a completely different experience, at least with the tools we have today. However, they are evolving and increasingly considering the needs of the users. One should always keep in mind that federated analytics tools have only been developed for a few years, whereas the centralized tools that data scientists are accustomed to have been around for decades.

The main difference resides in the exploratory phase of the data analysis, where researchers try to make sense of the data through various visualizations and statistics. Some of them can be reproduced also with federated learning tools, but some others, such as scatter plots or outliers’ visualization cannot, because this would involve exchanging patient-level information.

We started to work on the user interface of Vantage6 in the BlueBerry project, which picked up the baton from STARTER to upscale and expand the European rare adult solid cancer registry. In BlueBerry we developed an interface that allows interact with the system without any programming experience.

In IDEA4RC we are working to further improve the user experience through the introduction of data frames. These are constructed from the original source to provide the user with much more input on what kind of data they are dealing with. It is also a stepping stone towards more real-time processing as we do not have to query the data source every time we compute something. I believe we can get extremely close to reproducing their analysis routine with some more development.

This improved version of the Vantage6 user interface will then be integrated with the rest of the IDEA4RC virtual assistant, which is being developed by colleagues at Universidad Politécnica de Madrid.

In April this year the European Parliament has adopted a final version of the European Health Data Space, which introduces a new framework for the primary and secondary use of health data. Do you believe that the federated approach is better positioned to fulfil the EHDS requirements on data protection and patients’ privacy than the centralised one?

I think we will need to wait and see how these two different approaches perform as the EHDS is adopted by the different Member States and enters into force. We know that with the GDPR the lack of homogeneity in the way Member States interpreted that piece of Regulation impaired the re-use of health data for research. At the same time, we should be aware that federated learning is just one of several privacy enhancing technologies that will need to work in concert. Additionally, I don’t believe that a single approach will prevail in the end, it really depends on what kind of use case you need to conduct.