Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Large language models versus human examinee performance on Israeli anesthesiology board examinations
0
Zitationen
11
Autoren
2026
Jahr
Abstract
Large Language Models (LLMs) demonstrate increasing capabilities in medical knowledge assessment, yet limitations remain in cross-population validation, direct human-AI comparisons, and evaluation of newer models in anesthesiology contexts. This study addresses these gaps by conducting a head-to-head comparison between newer LLMs and human examinees on official Israeli multiple-choice board examinations. We evaluated two LLMs (Claude 3.7 Sonnet and ChatGPT-4) against anonymized aggregate data from 381 examinees on three consecutive official Israeli anesthesiology board examinations (2023–2024), comprising 450 multiple-choice questions stratified by difficulty, discrimination ability, and topic. Each model was tested twice per exam. Claude 3.7 Sonnet achieved 73.67% accuracy, significantly outperforming both human examinees (62.77%, P < 0.001) and ChatGPT-4 (64.44%, P < 0.001). However, both LLMs performed below the upper quartile of human performance (78.05%). While LLMs excelled on easy questions and theoretical domains like cardiac physiology (Claude: 96.88%, ChatGPT-4: 81.25%), they showed lower performance in areas such as ambulatory (Claude: 30.00%, ChatGPT-4: 10.00%) and regional anesthesia (Claude: 44.44%, ChatGPT-4: 38.89%). Human examinees demonstrated consistent performance across all domains, whereas LLMs showed extreme variability. Self-consistency was substantial for both LLMs (κ = 0.66–0.68), but agreement with human responses was moderate (κ = 0.34–0.39). While advanced LLMs currently exceed average examinee performance on anesthesiology board examinations, they fall short of top-quartile examinees at present and demonstrate significant performance variability across different topic areas.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.418 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.288 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.726 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.516 Zit.