Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Same Verdict, Different Reasons: LLM-as-a-Judge and Clinician Disagreement on Medical Chatbot Completeness

2026·0 Zitationen·arXiv (Cornell University)Open Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

LLM-as-a-Judge frameworks are increasingly trusted to automate evaluation in place of human experts, yet their reliability in high-stakes medical contexts remains unproven. We stress-test this assumption for detecting incomplete patient-facing medical responses, evaluating three rubric granularities (General-Likert, Analytical-Rubric, Dynamic-Checklist) and three backbone models across two clinician-annotated datasets, including HealthBench, the largest publicly available benchmark for medical response evaluation. LLM Judges discriminate complete from incomplete responses at and slightly above near chance (AUC $0.49$--$0.66$); at the threshold required to recall $90\%$ of incomplete responses, clinicians must still review the vast majority of the dataset, offering no triage utility. Even when model and clinician verdicts agree, they rarely cite the same explanation; and when they diverge, false positives stem from over-flagging non-essential gaps while false negatives reflect outright detection failures. These results reveal that LLM Judges and clinicians apply fundamentally different completeness standards; a finding that undermines their use as autonomous evaluators or triage filters in clinical settings.

Autoren

Themen

Artificial Intelligence in Healthcare and EducationMachine Learning in HealthcareTopic Modeling

Volltext beim Verlag öffnen

Same Verdict, Different Reasons: LLM-as-a-Judge and Clinician Disagreement on Medical Chatbot Completeness

Abstract

Ähnliche Arbeiten

Autoren

Themen