Synthetic Data
MYO Health - Maria Antonaki, Myrsini Ouzounelli
In today’s data-driven world, synthetic data has emerged as a pivotal tool, offering a promising avenue for advancing research, facilitating product development and underpinning entire business models that are reliant on large datasets. This approach has gained particular attention in the healthcare industry, where privacy and accessibility are paramount. But what exactly is synthetic data, and why is it hailed as a game-changer compared to real-world data (RWD), the lifeblood of modern healthcare?
What is Synthetic Data?
Synthetic data represents a paradigm shift, referring to artificially generated datasets, which are created by computer algorithms and accurately mimic real-world datasets in terms of their statistical properties, relationships and distributions.(1,2) This offers an attractive alternative that addresses privacy concerns, streamlines data utility agreements, protocol submissions, ethics review approvals and decreases costs.(3) The main types of synthetic data in clinical settings include tabular, time-series, or text-based synthetic data. Additional categories also include synthetic images, video, or audio simulation.(3) These replicas can be used in lieu of RWD for hypothesis generation and testing on a large scale, enhancing predictive model robustness through training or validating machine learning (ML) models.
Techniques for Generating Synthetic Data
The fidelity of synthetic data depends on the methodology employed and its intended applications.(4) There are three different techniques for generating synthetic data: the first leverages the statistical properties of real data, including population distributions; the second involves starting with the source data, which is then manually obfuscated and manipulated to closely preserve the relationships among data points, while safeguarding the identities of individuals and other sensitive information and; the third, a model-based approach, uses machine learning techniques such as generative adversarial networks (GANs) or neural networks to decipher how the data points relate to each other.(4)
Synthetic data can be categorized into three distinct types, each tailored to specific purposes: fully synthetic data, which lacks real information but mirrors the characteristics of real-world data; partially synthetic data, where sensitive details are substituted with synthetic values for privacy and data accuracy; and hybrid synthetic data, which combines genuine and made-up data to achieve a balance between privacy protection and data usefulness.
The Impact on Healthcare
The revolution that synthetic data brings to the dynamic healthcare landscape lies in its ability to overcome the hurdles posed by RWD.(2) Obtaining large amounts of real-world data can be costly and time-consuming. Limited RWD access and restricted data sharing slow scientific progress and cause considerable delays in providing crucial information to regulatory agencies and bringing benefits to patient care.(2) RWD-related challenges such as those outlined above are prompting regulators to seek solutions in a dynamic and evolving healthcare landscape.(2)
The Regulatory Framework and European Initiatives
Projections suggest that synthetic data may surpass real-source data usage by 2030. With a fast-growing interest in synthetic data within the healthcare industry, there is a need for clear regulatory guidelines or policies governing their generation and utilization. Current data protection regulations, such as the General Data Protection Regulation (GDPR) in Europe and the Health Insurance Portability and Accountability Act (HIPAA) in the US, remain limited in addressing all potential aspects associated with the use of synthetic data.(5,6) Several European institutions, regulatory bodies, such as the European Medicines Agency (EMA), as well as key stakeholders, are actively exploring the uses of synthetic data. Through Horizon Europe the EU funds research to assess and develop methods and standards for the effective use of synthetic data in regulatory decision-making and health technology assessment (HTA)(2) . Other ongoing synthetic data initiatives include those of the Clinical Practice Research Datalink (CPRD) in the United Kingdom, the Charité Lab for Artificial Intelligence in Medicine (CLAIM) in Germany and the Infrastructure for National Cohorts in Europe, Canada, and Africa (CINECA) project. These efforts showcase Europe’s diverse attempts to harness synthetic data for various purposes.(2)
The growing potential of synthetic data
The use of synthetic data is growing fast. A recent systematic review identified several types of use cases of synthetic data in health care: a) simulation and prediction research, b) hypothesis, methods, and algorithm testing, c) epidemiology/public health research, d) health IT development, e) education and training, f) public release of datasets, and g) linking data. The review also identified readily and publicly accessible healthcare datasets, databases, and sandboxes containing synthetic data with varying degrees of utility for research, education, and software development. This systematic review provided ample evidence that synthetic data are very helpful across a wide spectrum of research in health care and that it is uniquely positioned to bridge data access gaps in research and evidence-based policymaking.(7) With the potential to revolutionize everything from drug discovery to patient care, synthetic data emerges as a powerful ally, offering endless possibilities. Indicatively, synthetic data has the potential to estimate the benefit of screening and healthcare policies, treatments, or clinical interventions, augment machine learning algorithms (e.g., image classification pipelines), pre-train machine learning models that can then be fine-tuned for specific patient populations and improve public health models to predict outbreaks of infectious diseases.(8) The potential of synthetic data to revolutionize healthcare is undeniable, promising enhanced research capabilities and cost-effective solutions.
1. Myles P, Ordish J, Branson R. Synthetic data and the innovation, assessment, and regulation of AI medical devices. RF Q. 2021;1:48-53
2.Alloza, C., Knox, B., Raad, H., Aguilà, M., Coakley, C., Mohrova, Z., Boin, É., Bénard, M., Davies, J., Jacquot, E., Lecomte, C., Fabre, A. and Batech, M. (2023), A Case for Synthetic Data in Regulatory Decision-Making in Europe. Clin Pharmacol Ther, 114: 795-801. https://doi.org/10.1002/cpt.3001
3.Giuffrè, M., Shung, D.L. Harnessing the power of synthetic data in healthcare: innovation, application, and privacy. npj Digit. Med. 6, 186 (2023). https://doi.org/10.1038/s41746-023-00927-3
4.PHG Foundation. Are synthetic health data ‘personal data’? Available at: https://www.phgfoundation.org/publications/reports/are-synthetic-health-data-personal-data/ (last accessed on May 05, 2024)
5.Arora, A. & Arora, A. Synthetic patient data in health care: a widening legal loophole. Lancet 399(Apr), 1601–1602 (2022).
6.Appenzeller, A., Leitner, M., Philipp, P., Krempel, E. & Beyerer, J. Privacy and Utility of Private Synthetic Data for Medical Data Analyses. Appl. Sci. 12, 12320 (2022).
7.Gonzales A, Guruswamy G, Smith SR. Synthetic data in health care: A narrative review. PLOS Digit Health. 2023 Jan 6;2(1):e0000082. doi: 10.1371/journal.pdig.0000082. PMID: 36812604; PMCID: PMC9931305.
8.Giuffrè, M., Shung, D.L. Harnessing the power of synthetic data in healthcare: innovation, application, and privacy. npj Digit. Med. 6, 186 (2023). https://doi.org/10.1038/s41746-023-00927-3