Sie sind hier: Skip Navigation LinksMedizininformatikzentrum - Abteilung Medical Data Science


​​Folgende Forschungsthemen werden derzeit bearbeitet.

Privacy Preserving Distributed Analysis


In the era of Big data, huge amounts of medical and health-related data are constantly generated by medical centers, institutes, and organizations, usually located at multiple sites. Handling such large distributed data is challenging, yet of great importance for analytics and machine learning modeling in the medical field. There are two main approaches to analyzing data across multiple locations: pooled and distributed. In pooled data analysis, all data are collected in a central location before being analyzed. This approach requires less computing power and time, however, the data owners may not be willing to share all or part of their data. With distributed data analysis, on the other hand, data do not leave the generating stations and are analyzed at their original locations. In recent years, the Personal Health Train (PHT) has become a novel concept for this paradigm shift that addresses the distributed analysis of large-scale medical data.

In this project, we aim to leverage the unique advantages of PHT to enable privacy-preserving record linkage. Our goal is to identify duplicates across several locations and demonstrate the impact of duplicate elimination on the entire data analysis process.

Figure: Top-level overview of the distributed record linkage process using the Personal Health Train.
Involved Team Members: Maximilian Jugl, Navid Shekarchizadeh, Toralf Kirsten

Cooperation Partners: Sascha Welten (RWTH Aachen), Yongli Mou (RWTH Aachen), Oya Beyan (University Hospital Cologne), Samira Zeynalova (IMISE, Leipzig University), Florens Rohde (Database Chair, Leipzig University), Erhard Rahm (Database Chair, Leipzig University)

Relevant Publications

Jugl, Maximilian, Toralf Kirsten, und Ulrich Sax. „Assessment of Bloom Filter Parameters for Privacy-Preserving Record Linkage“. 67. Jahrestagung Der Deutschen Gesellschaft Für Medizinische Informatik, Biometrie Und Epidemiologie e. V. (GMDS), 13. Jahreskongress Der Technologie- Und Methodenplattform Für Die Vernetzte Medizinische Forschung e.V. (TMF), 2022.

Synthetic Data Generation in the Medical Domain


Synthetic data generation has become particularly important in the medical field for two main reasons. Medical data are not readily accessible to researchers due to patient privacy and data protection regulations. Synthetic data generation can address this issue by providing artificial data that resembles real medical data while not associated with real patients. Moreover, in cases such as rare diseases, only few data records are available, making diagnosis or treatment difficult even for experts. Synthetic data can also improve this situation and enhance the efficiency of data analysis, as modern artificial intelligence methods require a sufficient amount of data to achieve optimal results.

In this project, we use state-of-the-art generative deep learning models, in particular Generative Adversarial Networks (GANs), to generate synthetic data relevant to medical applications. The models are applicable to both structured (tabular) medical data and medical images. The quality of the generated data is evaluated using conventional evaluation metrics, for example, in the case of tabular data by measuring the distribution of each feature or the pair-correlation among different features and comparing them with the corresponding real data. Furthermore, we investigate evaluation frameworks with different settings to see how the optimal accuracy for the target application can be achieved. For example, in the case of a classification problem with tabular data, we explore whether the incorporation of synthetic data in training the model along with real data can improve the classification accuracy.


Figure: Schematic overview of the evaluation of data generated by GANs. The target classifier is trained on an extended dataset that comprises synthetic data and original Train data. The classification accuracy on the Test data is then compared with Silver Standard.

Involved Team Members: Masoud Abedi, Lars Hempel, Sina Sadeghi, Toralf Kirsten

Abedi, M.; Hempel, L.; Sadeghi, S.; Kirsten, T. GAN-Based Approaches for Generating Structured Data in the Medical Domain.
Appl. Sci. 2022, 12, 7075.

Intensive Care Unit Data Analytics


The intensive care units (ICUs) that provide comprehensive life-saving care for critically ill patients, face multiple challenges in day-to-day operations and management. One important example is the increasing demand for critical care for patients with severe conditions, which limits the capacity of the ICU. This means, for example, lack of available beds for patients or excessive workloads for medical staff and hospital personnel, which leads to delays in ICU admission and ultimately increased morbidity and mortality. The COVID-19 pandemic since the early of 2020 and in subsequent months has made this even more evident, with creating urgent need for space, supplies, and medical personnel, and placing significant strain on healthcare systems worldwide.

The rapid increase in critical care data volumes, driven by the digitization of healthcare in recent years, has created numerous opportunities to address those challenges. Data analytics has proven beneficial in various areas of medicine. Utilizing advanced data science methods, we are conducting research in several projects using available critical care data to gain valuable information from them and potentially improve the quality of care. This is achieved through providing optimal care for patients with critical illnesses and better planning of resources in the ICU.

Predictive Modeling of ICU Length of Stay

Patient length of stay in the ICU is an important process indicator that measures the quality of care in the ICU. While a longer ICU stay is associated with higher care costs and resource utilization, early ICU discharge potentially causes medical complications, increases the risk of readmission to the ICU, or even leads to a higher mortality rate. A proper estimation of patient length of stay in the ICU assists the healthcare management in allocating appropriate resources and better planning for the future.

The goal is to develop a predictive model for length of stay and readmission of patients admitted to the ICU. The model incorporates diagnostic data from the patient's initial conditions, observations, and medical measurements. It utilizes machine learning methods to predict patient length of stay in the ICU and estimate the likelihood of readmission to the ICU in the event of poor clinical care.

Involved Team Members: Lars Hempel, Ulrike Klotz, Sina Sadeghi, Toralf Kirsten
Cooperation Partners: Sven Bercker (Leipzig University Medical Center)

Heart Failure Predictions using NT-proBNP

Heart failure is a prevalent health problem associated with high morbidity and mortality and consequently rising healthcare costs. Predictive models for heart disease are therefore of great importance to the healthcare system, as they assist physicians to diagnose such life-threatening conditions at earlier stages and adapt their treatment accordingly. The models estimate the likelihood of heart disease occurrence in individuals based on laboratory measurements as risk indicators as well as demographic data. The model also provides insight to physicians and can suggest further measures for patients as needed, such as electrocardiography.

A correlation between NT-proBNP protein levels and heart failure and atrial fibrillation has been demonstrated in the literature. The goal of this project is to employ machine learning to model this correlation between the NT-proBNP protein levels and the likelihood of heart failure and atrial fibrillation.

Involved Team Members: Navid Shekarchizadeh, Masoud Abedi, Sina Sadeghi, Toralf Kirsten

Cooperation Partners: Samira Zeynalova (IMISE, Leipzig University), Frank Meineke (IMISE, Leipzig University), Markus Löffler (IMISE, Leipzig University)

Personal Health Train: Station Registry

The Personal Health Train (PHT) is a novel approach to performing distributed data analysis in the medical domain. It allows data owners to execute analysis tasks on-premise while keeping full control over the access to their data, following FAIR data sharing principles. These tasks are represented as “trains”, moving between “stations”, collecting analysis results and returning them to the researcher.

We’ve been working closely with the PADME team from the RWTH Aachen to develop the central Station Registry, which allows users of the PADME implementation of the PHT to create stations, projects, and to start the onboarding process for new stations. We provide the Station Registry with new features and maintenance.


→Navigation between process steps

→Filter of entity screens

→Overview of in the navigation selected entities

→Adding, editing and deletion of entities

→Onboarding of new stations

Station Registry is the central access point to search and select locations where the Personal Health Train software (station) is installed and running.

Involved Team Members: Maximilian Jugl, Julian Müller, Toralf Kirsten

Cooperation Partners: Sascha Welten (RWTH Aachen), Yongli Mou (RWTH Aachen), Oya Beyan (University Hospital Cologne)

Stephanstraße 9c, Haus 5.2
04103 Leipzig
0341 - 97 10283