OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 10.04.2026, 22:18

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Large language models versus human examinee performance on Israeli anesthesiology board examinations

2026·0 Zitationen·Scientific ReportsOpen Access
Volltext beim Verlag öffnen

0

Zitationen

11

Autoren

2026

Jahr

Abstract

Large Language Models (LLMs) demonstrate increasing capabilities in medical knowledge assessment, yet limitations remain in cross-population validation, direct human-AI comparisons, and evaluation of newer models in anesthesiology contexts. This study addresses these gaps by conducting a head-to-head comparison between newer LLMs and human examinees on official Israeli multiple-choice board examinations. We evaluated two LLMs (Claude 3.7 Sonnet and ChatGPT-4) against anonymized aggregate data from 381 examinees on three consecutive official Israeli anesthesiology board examinations (2023–2024), comprising 450 multiple-choice questions stratified by difficulty, discrimination ability, and topic. Each model was tested twice per exam. Claude 3.7 Sonnet achieved 73.67% accuracy, significantly outperforming both human examinees (62.77%, P < 0.001) and ChatGPT-4 (64.44%, P < 0.001). However, both LLMs performed below the upper quartile of human performance (78.05%). While LLMs excelled on easy questions and theoretical domains like cardiac physiology (Claude: 96.88%, ChatGPT-4: 81.25%), they showed lower performance in areas such as ambulatory (Claude: 30.00%, ChatGPT-4: 10.00%) and regional anesthesia (Claude: 44.44%, ChatGPT-4: 38.89%). Human examinees demonstrated consistent performance across all domains, whereas LLMs showed extreme variability. Self-consistency was substantial for both LLMs (κ = 0.66–0.68), but agreement with human responses was moderate (κ = 0.34–0.39). While advanced LLMs currently exceed average examinee performance on anesthesiology board examinations, they fall short of top-quartile examinees at present and demonstrate significant performance variability across different topic areas.

Ähnliche Arbeiten