Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Evaluating clinical AI summaries with large language models as judges

2025·19 Zitationen·npj Digital MedicineOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

Electronic Health Records (EHRs) contain vast clinical data that are difficult for providers to synthesize. Generative AI with Large Language Models (LLMs) can summarize records to reduce cognitive burden, but ensuring accuracy requires reliable evaluation. Human review is the gold standard but is costly and slow. To address this, we introduce and validate an automated LLM-based method to assess real-world EHR multi-document summaries. Benchmarking against the validated Provider Documentation Summarization Quality Instrument (PDSQI), our LLM-as-a-Judge framework demonstrated strong inter-rater reliability with human evaluators. GPT-o3-mini achieved an intraclass correlation coefficient of 0.818 (95% CI 0.772-0.854), a median score difference of 0 from humans, and completed evaluations in 22 seconds. Overall, reasoning models excelled in inter-rater reliability, particularly for evaluations requiring advanced reasoning and domain expertise, outperforming non-reasoning, task-trained, and multi-agent approaches. By automating high-quality evaluations, a medical LLM-as-a-Judge provides a scalable, efficient way to identify accurate, safe AI-generated clinical summaries.

Autoren

Institutionen

Themen

Machine Learning in HealthcareArtificial Intelligence in Healthcare and EducationElectronic Health Records Systems

Volltext beim Verlag öffnen

Evaluating clinical AI summaries with large language models as judges

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen