OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 12.05.2026, 06:12

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Differential performance of large language models in advanced cardiac life support assessment: A comprehensive multi-dimensional analysis of accuracy, consistency, and visual recognition capabilities

2026·0 Zitationen·PLoS ONEOpen Access
Volltext beim Verlag öffnen

0

Zitationen

10

Autoren

2026

Jahr

Abstract

BACKGROUND: Large Language Models (LLMs) have been increasingly adopted in healthcare settings, yet comparative evaluations of their performance in standardized medical assessments remain limited. This study aims to evaluate the accuracy and consistency of four LLMs in answering Advanced Cardiac Life Support (ACLS) questions. METHODS: In this observational study, 50 ACLS questions were categorized as knowledge-based (n = 29), visual content (n = 12), or case-based (n = 9). Each question was posed to ChatGPT-4o, Gemini 2.0, Claude 3.5, and DeepSeek R1 on three separate occasions to assess consistency. Performance was evaluated using three accuracy metrics: overall accuracy (all three responses correct), strict accuracy (at least two responses correct), and ideal accuracy (at least one response correct). RESULTS: ChatGPT-4o demonstrated superior performance with 100% accuracy across all categories and perfect consistency (Fleiss' Kappa = 1.0). Claude 3.5 achieved 92.0% overall accuracy with excellent consistency (Fleiss' Kappa = 0.89). Gemini 2.0 showed 86.0% overall accuracy with moderate consistency (Fleiss' Kappa = 0.58). DeepSeek R1 performed lowest at 70.0% overall accuracy with moderate consistency (Fleiss' Kappa = 0.58) and failed completely on visual content questions (0%). All models achieved 100% accuracy on knowledge-based questions. Performance differences were statistically significant across models (p < 0.001). CONCLUSION: LLMs demonstrate variable capabilities in ACLS knowledge assessment, with ChatGPT-4o showing exceptional performance. While these models show promise as supplementary tools in resuscitation education and clinical decision support, significant variations in visual recognition capabilities and response consistency highlight the importance of critical evaluation before clinical implementation.

Ähnliche Arbeiten