OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 18.05.2026, 03:24

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Assessment of frontier Large Language Models in sleep medicine

2026·0 Zitationen·Frontiers in Digital HealthOpen Access
Volltext beim Verlag öffnen

0

Zitationen

7

Autoren

2026

Jahr

Abstract

Study objectives To evaluate and compare the performance of two proprietary frontier large language models (LLMs), ChatGPT-5 and Grok-4, on diagnostic reasoning and foundational knowledge tasks within the specialty domain of sleep medicine. Methods The models were evaluated on two tasks: case-based reasoning using 79 clinical vignettes from the AASM Case Book of Sleep Medicine and knowledge assessment using 897 multiple-choice questions (MCQs) from board review materials. For vignettes, final diagnosis was scored by concept-level exact match, and differential diagnosis (DDx) was scored on a fixed top-5 output using concept-level matching with synonym normalization to compute precision, recall, and F 1-score. MCQ performance was the proportion correct. Inter-model performance was compared using the Mann–Whitney U test. Results Both models achieved high accuracy for final diagnosis (92.4% for both; 95% CI 86.4, 98.4) and MCQs (ChatGPT-5: 93.0%; Grok-4: 92.8%). However, performance on generating a comprehensive differential diagnosis was suboptimal, with modest F 1-scores for both ChatGPT-5 (0.55 ± 0.20) and Grok-4 (0.59 ± 0.20). There were no statistically significant differences in performance between the two models across any metric ( p > 0.05). Conclusions Frontier LLMs demonstrated high accuracy in sleep medicine tasks requiring knowledge recall and direct pattern recognition but showed more limited performance in complex clinical reasoning tasks such as generating a comprehensive differential diagnosis. These findings suggest that current general-purpose models may be more reliable for focused knowledge support than for broad hypothesis generation. Future studies should evaluate whether domain-adapted models or clinician-in-the-loop workflows can improve real-world diagnostic usefulness and safety.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationClinical Reasoning and Diagnostic SkillsMachine Learning in Healthcare
Volltext beim Verlag öffnen