Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Measuring the Quality of AI-Generated Clinical Notes: A Systematic Review and Experimental Benchmark of Evaluation Methods
1
Zitationen
9
Autoren
2025
Jahr
Abstract
High-quality clinical documentation is essential for safe and effective care, yet its production remains time consuming and prone to error. Large language models (LLMs) have shown potential for supporting clinical note generation, but their clinical adoption depends on how the quality of generated text is assessed, and current evaluation practices vary widely. We systematically searched Ovid Medline and Scopus on 10 April 2025 for peer-reviewed studies that used LLMs to generate clinical notes and reported an evaluation of text quality. Screening followed PRISMA guidelines, and the protocol was preregistered in PROSPERO. Evaluation metrics and outcome measures were synthesised narratively. Informed by these findings, we designed a controlled experimental setup using five synthetic clinical cases, with targeted perturbations to examine the behaviour of commonly used automated metrics and LLM-based evaluators. Thirty-seven studies met the inclusion criteria. Reported evaluations were dominated by lexical overlap metrics, primarily ROUGE and BLEU, whereas semantic similarity metrics such as BERTScore and BLEURT were less frequent. Human evaluation was common but heterogeneous in criteria and reporting, most often addressing correctness, fluency, and clinical acceptability. In the experimental analysis, lexical overlap metrics detected deletions and factual modifications but penalised meaning-preserving paraphrases. Semantic metrics and LLM-based evaluators were more tolerant of paraphrasing while remaining sensitive to clinically relevant changes, with performance varying by model and language. We conclude that lexical overlap metrics are insufficient as standalone proxies for clinical text quality and recommend a layered evaluation strategy combining semantic metrics, LLM-as-evaluator, and targeted human review to support scalable assessment.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.439 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.315 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.756 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.526 Zit.