OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 13.05.2026, 18:52

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Reasoning‐optimised large language models reach near‐expert accuracy on board‐style orthopaedic exams: A multi‐model comparison on 702 multiple‐choice questions

2025·1 Zitationen·Knee Surgery Sports Traumatology ArthroscopyOpen Access
Volltext beim Verlag öffnen

1

Zitationen

6

Autoren

2025

Jahr

Abstract

PURPOSE: The purpose of this study was to compare the accuracy, calibration, reproducibility and operating cost of seven large language models (LLMs)-including four newer models capable of using advanced reasoning techniques to analyse complex medical information and generate accurate responses-on text-only orthopaedic multiple-choice questions (MCQs) and to quantify gains over GPT-4. METHODS: From Orthobullets, 702 unique, non-image MCQs (drawn from AAOS Self-Assessment Examinations, Self-Assessment-Based Questions and Orthopaedic In Training Examination-Based Questions banks) were extracted. Each question was submitted to the following LLMs: OpenAI o3, Anthropic Claude Sonnet 4, Claude Opus 4 (with/without 'Extended Thinking') and Google Gemini 2.5 Pro. Additionally, OpenAI's GPT-4, GPT-4o and the open-weight Gemma 3 27B served as comparators. The primary outcome was overall accuracy. The secondary outcomes were topic and difficulty-stratified accuracy, calibration (expected calibration error [ECE] and Brier score), reproducibility (flip rate on a retest question subset), latency, token use and cost. Statistical tests included paired McNemar, Cochran Q, ordinal logistic regression and Fleiss κ (Bonferroni-adjusted α = 0.05). RESULTS: ); OpenAI o3 led with 93.6% (95% CI = 91.5-95.2), which represents a 34% relative error reduction. Accuracy tended to decline with question difficulty, yet the reasoning advantage persisted in every difficulty stratum. Claude Opus 4 showed the best calibration (ECE = 0.023), while GPT-4 exhibited overconfidence (ECE = 0.215). All models except Gemma 3 27B exhibited non-zero flip rates. Median query time: 0.9 s (Gemma) to 15.9 s (Gemini 2.5 Pro). Cost: 0 to 29.9 USD per 1000 queries. CONCLUSIONS: Reasoning-optimised LLMs now answer text-based orthopaedic exam questions with high accuracy and substantially better confidence calibration than earlier models. However, persistent stochasticity and large latency-cost disparities may limit clinical deployment. LEVEL OF EVIDENCE: N/A.

Ähnliche Arbeiten