Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Benchmarking Clinical Reasoning in Large Language Models: A Comparative Assessment Study

2026·0 Zitationen·medRxivOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

Abstract Evaluation of Large Language Models (LLMs) and their clinical competence has mainly focused on conventional multiple-choice (MCQ) formatted medical question answering exams, yielding benchmarks like MedQA-USMLE, where models have already exceeded expert-level performance. However, alternative assessment methods have recently been proposed, such as SCT-Bench based on Script Concordance Testing (SCT), which evaluates clinical reasoning and probabilistic thinking under uncertainty. Reasoning-optimized models have unexpectedly scored worse on SCT-Bench despite outperforming non-reasoning models on other medical benchmarks. This study compared performance metrics, uncertainty proxies and clinical reasoning qualities between MedQA-USMLE and the public subset of SCT-Bench using instruction-tuned GPT-4.1, contrasting baseline and Chain-of-Thought (CoT) prompting across sampled responses. CoT prompts were designed to explicitly instruct the model to apply cognitive clinical reasoning strategies, with their usage subsequently evaluated across both benchmark formats. CoT prompting improved MedQA performance from 86.4% to 93.0%, while SCT-Bench score showed a non-significant decline from 77.7% to 74.7%. GPT-4.1 systematically overestimated the impact of new information under CoT, leading to overconfidence and increased extreme ratings on SCT questions. Sample-based majority voting significantly improved MedQA scores under CoT but had no meaningful effect on SCT-Bench. Response entropy analysis showed that CoT increased overall answer variability, while simultaneously clustering correct responses on MedQA, an effect absent on SCT-Bench. Calibration and ROC were substantially poorer on SCT-Bench than on MedQA, though CoT improved both on either benchmark. Qualitative analysis confirmed GPT-4.1 could apply situation-appropriate reasoning strategies and showed signs of metacognitive awareness about its own reasoning process, with expert rating patterns suggesting possible alignment with expert-like logic. These findings further corroborate limitations in elicited clinical reasoning for SCT-based benchmarking and suggest that reasoning-aware evaluation frameworks could contribute meaningfully to the medical AI benchmark landscape.

Autoren

Institutionen

Austrian Research Institute for Artificial Intelligence(AT)

Themen

Artificial Intelligence in Healthcare and EducationTopic ModelingClinical Reasoning and Diagnostic Skills

Volltext beim Verlag öffnen

Benchmarking Clinical Reasoning in Large Language Models: A Comparative Assessment Study

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen