Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Should we synthesize more than we need: impact of synthetic data generation for high-dimensional cross-sectional medical data
1
Zitationen
4
Autoren
2025
Jahr
Abstract
OBJECTIVE: In medical research and education, generative artificial intelligence/machine learning (AI/ML) models to synthesize artificial medical data can enable the sharing of high-quality data while preserving the privacy of patients. Given that such data is often high-dimensional, a relevant consideration is whether to synthesize the entire dataset when only a task-relevant subset is needed. This study evaluates how the number of variables in training impacts fidelity, utility, and privacy of the synthetic data (SD). MATERIAL AND METHODS: We used 12 cross-sectional medical datasets, defined a downstream task with corresponding core variables, and derived 6354 variants by adding adjunct variables to the core. SD was generated using 7 different generative models and evaluated for fidelity, downstream utility, and privacy. Mixed-effect models were used to assess the effect of adjunct variables on the respective evaluation metric, accounting for the medical dataset as a random component. RESULTS: Fidelity was unaffected by the number of adjunct variables in 5/7 SDG models. Similarly, downstream utility remained stable in 6/7 (predictive task) and 5/7 (inferential task) SDG models. Where significant effects were observed, they were minimal, resulting, for example, in a 0.05 decrease in Area under the Receiver Operating Characteristic curve (AUROC) when adding 120 variables. Privacy was not impacted by the number of adjunct variables. DISCUSSION: Our findings show that fidelity, utility, and privacy are preserved when generating a more comprehensive medical dataset than the task-relevant subset. CONCLUSION: Our findings support a cost-effective, utility, and privacy-preserving way of implementing SDG into medical research and education.
Ähnliche Arbeiten
k-ANONYMITY: A MODEL FOR PROTECTING PRIVACY
2002 · 8.452 Zit.
Calibrating Noise to Sensitivity in Private Data Analysis
2006 · 6.971 Zit.
Deep Learning with Differential Privacy
2016 · 5.765 Zit.
Federated Machine Learning
2019 · 5.741 Zit.
Communication-Efficient Learning of Deep Networks from Decentralized\n Data
2016 · 5.614 Zit.