Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Across generations, sizes, and types, large language models poorly report self-confidence in gastroenterology clinical reasoning tasks

2026·0 Zitationen·npj gut and liver.Open Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

This study evaluated confidence calibration across 48 large language models (LLMs) using 300 gastroenterology board exam-style multiple-choice questions. Regardless of accuracy, all models demonstrated poor self-estimation of certainty. Even the best-calibrated systems (o1 preview, GPT-4o, Claude-3.5-Sonnet) exhibited substantial overconfidence (Brier scores 0.15–0.2, AUROC ~ 0.6). Models maintained high confidence regardless of question difficulty or response correctness. In their current form, LLMs cannot be relied upon to communicate uncertainty, and human oversight remains essential for safe use.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationClinical Reasoning and Diagnostic SkillsTopic Modeling

Volltext beim Verlag öffnen

Across generations, sizes, and types, large language models poorly report self-confidence in gastroenterology clinical reasoning tasks

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen