OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 13.04.2026, 10:49

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Measuring the Quality of AI-Generated Clinical Notes: A Systematic Review and Experimental Benchmark of Evaluation Methods

2025·1 Zitationen·Artificial Intelligence in MedicineOpen Access
Volltext beim Verlag öffnen

1

Zitationen

9

Autoren

2025

Jahr

Abstract

High-quality clinical documentation is essential for safe and effective care, yet its production remains time consuming and prone to error. Large language models (LLMs) have shown potential for supporting clinical note generation, but their clinical adoption depends on how the quality of generated text is assessed, and current evaluation practices vary widely. We systematically searched Ovid Medline and Scopus on 10 April 2025 for peer-reviewed studies that used LLMs to generate clinical notes and reported an evaluation of text quality. Screening followed PRISMA guidelines, and the protocol was preregistered in PROSPERO. Evaluation metrics and outcome measures were synthesised narratively. Informed by these findings, we designed a controlled experimental setup using five synthetic clinical cases, with targeted perturbations to examine the behaviour of commonly used automated metrics and LLM-based evaluators. Thirty-seven studies met the inclusion criteria. Reported evaluations were dominated by lexical overlap metrics, primarily ROUGE and BLEU, whereas semantic similarity metrics such as BERTScore and BLEURT were less frequent. Human evaluation was common but heterogeneous in criteria and reporting, most often addressing correctness, fluency, and clinical acceptability. In the experimental analysis, lexical overlap metrics detected deletions and factual modifications but penalised meaning-preserving paraphrases. Semantic metrics and LLM-based evaluators were more tolerant of paraphrasing while remaining sensitive to clinically relevant changes, with performance varying by model and language. We conclude that lexical overlap metrics are insufficient as standalone proxies for clinical text quality and recommend a layered evaluation strategy combining semantic metrics, LLM-as-evaluator, and targeted human review to support scalable assessment.

Ähnliche Arbeiten