Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Evaluating clinical AI summaries with large language models as judges
14
Zitationen
17
Autoren
2025
Jahr
Abstract
Electronic Health Records (EHRs) contain vast clinical data that are difficult for providers to synthesize. Generative AI with Large Language Models (LLMs) can summarize records to reduce cognitive burden, but ensuring accuracy requires reliable evaluation. Human review is the gold standard but is costly and slow. To address this, we introduce and validate an automated LLM-based method to assess real-world EHR multi-document summaries. Benchmarking against the validated Provider Documentation Summarization Quality Instrument (PDSQI), our LLM-as-a-Judge framework demonstrated strong inter-rater reliability with human evaluators. GPT-o3-mini achieved an intraclass correlation coefficient of 0.818 (95% CI 0.772-0.854), a median score difference of 0 from humans, and completed evaluations in 22 seconds. Overall, reasoning models excelled in inter-rater reliability, particularly for evaluations requiring advanced reasoning and domain expertise, outperforming non-reasoning, task-trained, and multi-agent approaches. By automating high-quality evaluations, a medical LLM-as-a-Judge provides a scalable, efficient way to identify accurate, safe AI-generated clinical summaries.
Ähnliche Arbeiten
"Why Should I Trust You?"
2016 · 14.396 Zit.
A Comprehensive Survey on Graph Neural Networks
2020 · 8.729 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.270 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.702 Zit.
Artificial intelligence in healthcare: past, present and future
2017 · 4.437 Zit.