Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Reasoning‐optimised large language models reach near‐expert accuracy on board‐style orthopaedic exams: A multi‐model comparison on 702 multiple‐choice questions
1
Zitationen
6
Autoren
2025
Jahr
Abstract
PURPOSE: The purpose of this study was to compare the accuracy, calibration, reproducibility and operating cost of seven large language models (LLMs)-including four newer models capable of using advanced reasoning techniques to analyse complex medical information and generate accurate responses-on text-only orthopaedic multiple-choice questions (MCQs) and to quantify gains over GPT-4. METHODS: From Orthobullets, 702 unique, non-image MCQs (drawn from AAOS Self-Assessment Examinations, Self-Assessment-Based Questions and Orthopaedic In Training Examination-Based Questions banks) were extracted. Each question was submitted to the following LLMs: OpenAI o3, Anthropic Claude Sonnet 4, Claude Opus 4 (with/without 'Extended Thinking') and Google Gemini 2.5 Pro. Additionally, OpenAI's GPT-4, GPT-4o and the open-weight Gemma 3 27B served as comparators. The primary outcome was overall accuracy. The secondary outcomes were topic and difficulty-stratified accuracy, calibration (expected calibration error [ECE] and Brier score), reproducibility (flip rate on a retest question subset), latency, token use and cost. Statistical tests included paired McNemar, Cochran Q, ordinal logistic regression and Fleiss κ (Bonferroni-adjusted α = 0.05). RESULTS: ); OpenAI o3 led with 93.6% (95% CI = 91.5-95.2), which represents a 34% relative error reduction. Accuracy tended to decline with question difficulty, yet the reasoning advantage persisted in every difficulty stratum. Claude Opus 4 showed the best calibration (ECE = 0.023), while GPT-4 exhibited overconfidence (ECE = 0.215). All models except Gemma 3 27B exhibited non-zero flip rates. Median query time: 0.9 s (Gemma) to 15.9 s (Gemini 2.5 Pro). Cost: 0 to 29.9 USD per 1000 queries. CONCLUSIONS: Reasoning-optimised LLMs now answer text-based orthopaedic exam questions with high accuracy and substantially better confidence calibration than earlier models. However, persistent stochasticity and large latency-cost disparities may limit clinical deployment. LEVEL OF EVIDENCE: N/A.
Ähnliche Arbeiten
The Strengths and Difficulties Questionnaire: A Research Note
1997 · 14.699 Zit.
Making sense of Cronbach's alpha
2011 · 14.061 Zit.
QUADAS-2: A Revised Tool for the Quality Assessment of Diagnostic Accuracy Studies
2011 · 13.802 Zit.
A method for estimating the probability of adverse drug reactions
1981 · 11.544 Zit.
Clarifying Confusion: The Confusion Assessment Method
1990 · 5.253 Zit.
Autoren
Institutionen
- University of Lisbon(PT)
- Erasmus Hospital(BE)
- Centre Hospitalier Universitaire Brugmann(BE)
- Institute for Biotechnology and Bioengineering(PT)
- University of Miyazaki(JP)
- Universitätsklinik Balgrist(CH)
- Centro Hospitalar Póvoa de Varzim Vila do Conde EPE
- University of Minho(PT)
- Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento(PT)
- University of Gothenburg(SE)