Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Should we synthesize more than we need: impact of synthetic data generation for high-dimensional cross-sectional medical data

2025·1 Zitationen·Journal of the American Medical Informatics AssociationOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

OBJECTIVE: In medical research and education, generative artificial intelligence/machine learning (AI/ML) models to synthesize artificial medical data can enable the sharing of high-quality data while preserving the privacy of patients. Given that such data is often high-dimensional, a relevant consideration is whether to synthesize the entire dataset when only a task-relevant subset is needed. This study evaluates how the number of variables in training impacts fidelity, utility, and privacy of the synthetic data (SD). MATERIAL AND METHODS: We used 12 cross-sectional medical datasets, defined a downstream task with corresponding core variables, and derived 6354 variants by adding adjunct variables to the core. SD was generated using 7 different generative models and evaluated for fidelity, downstream utility, and privacy. Mixed-effect models were used to assess the effect of adjunct variables on the respective evaluation metric, accounting for the medical dataset as a random component. RESULTS: Fidelity was unaffected by the number of adjunct variables in 5/7 SDG models. Similarly, downstream utility remained stable in 6/7 (predictive task) and 5/7 (inferential task) SDG models. Where significant effects were observed, they were minimal, resulting, for example, in a 0.05 decrease in Area under the Receiver Operating Characteristic curve (AUROC) when adding 120 variables. Privacy was not impacted by the number of adjunct variables. DISCUSSION: Our findings show that fidelity, utility, and privacy are preserved when generating a more comprehensive medical dataset than the task-relevant subset. CONCLUSION: Our findings support a cost-effective, utility, and privacy-preserving way of implementing SDG into medical research and education.

Autoren

Institutionen

Themen

Privacy-Preserving Technologies in DataArtificial Intelligence in Healthcare and EducationMachine Learning in Healthcare

Volltext beim Verlag öffnen

Should we synthesize more than we need: impact of synthetic data generation for high-dimensional cross-sectional medical data

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen