Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Benchmarking Clinical Reasoning in Large Language Models: A Comparative Assessment Study
0
Zitationen
2
Autoren
2026
Jahr
Abstract
Abstract Evaluation of Large Language Models (LLMs) and their clinical competence has mainly focused on conventional multiple-choice (MCQ) formatted medical question answering exams, yielding benchmarks like MedQA-USMLE, where models have already exceeded expert-level performance. However, alternative assessment methods have recently been proposed, such as SCT-Bench based on Script Concordance Testing (SCT), which evaluates clinical reasoning and probabilistic thinking under uncertainty. Reasoning-optimized models have unexpectedly scored worse on SCT-Bench despite outperforming non-reasoning models on other medical benchmarks. This study compared performance metrics, uncertainty proxies and clinical reasoning qualities between MedQA-USMLE and the public subset of SCT-Bench using instruction-tuned GPT-4.1, contrasting baseline and Chain-of-Thought (CoT) prompting across sampled responses. CoT prompts were designed to explicitly instruct the model to apply cognitive clinical reasoning strategies, with their usage subsequently evaluated across both benchmark formats. CoT prompting improved MedQA performance from 86.4% to 93.0%, while SCT-Bench score showed a non-significant decline from 77.7% to 74.7%. GPT-4.1 systematically overestimated the impact of new information under CoT, leading to overconfidence and increased extreme ratings on SCT questions. Sample-based majority voting significantly improved MedQA scores under CoT but had no meaningful effect on SCT-Bench. Response entropy analysis showed that CoT increased overall answer variability, while simultaneously clustering correct responses on MedQA, an effect absent on SCT-Bench. Calibration and ROC were substantially poorer on SCT-Bench than on MedQA, though CoT improved both on either benchmark. Qualitative analysis confirmed GPT-4.1 could apply situation-appropriate reasoning strategies and showed signs of metacognitive awareness about its own reasoning process, with expert rating patterns suggesting possible alignment with expert-like logic. These findings further corroborate limitations in elicited clinical reasoning for SCT-based benchmarking and suggest that reasoning-aware evaluation frameworks could contribute meaningfully to the medical AI benchmark landscape.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.402 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.270 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.702 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.507 Zit.