Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Artificial Intelligence in Cardiology Examination: A Comparative Study of Physicians, ChatGPT-4, and ChatGPT-5

2026·0 Zitationen·CureusOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

Background Medicine is increasingly applying artificial intelligence (AI), particularly large language models (LLMs). Previous studies have shown that models such as GPT-4 can achieve performance comparable to physicians on medical knowledge tests. However, direct comparisons between the newest GPT-5 model, its predecessor GPT-4, and physicians are lacking, especially with respect to theoretical versus clinical question types. Methodology This comparative study analyzed 120 multiple-choice questions from the spring 2025 Polish National Cardiology Specialization Examination administered by the Medical Examination Center (CEM) in Łódź. The dataset included 62 theoretical questions and 58 clinical scenario-based questions. Examination performance of physicians who sat for the exam (n = 153) was compared with responses generated by ChatGPT-4 and ChatGPT-5. Both models were evaluated independently using the original Polish-language questions under standardized, exam-like conditions, without access to external databases or tools. Model accuracy was calculated overall and by question type. Statistical comparisons were performed using one-sample proportion Z-tests and McNemar’s tests, with a Bonferroni correction applied for multiple comparisons (significance threshold set at p-values <0.0167). Results Physicians achieved a mean accuracy of 72.5% compared with 77.5% for ChatGPT-4 and 79.2% for ChatGPT-5. The difference between physicians and GPT-5 did not reach statistical significance (p = 0.100). For theoretical questions, GPT-4 achieved the highest score (85.5%), but statistical significance was not achieved after Bonferroni correction (p = 0.044). For clinical questions, GPT-5 achieved the highest score (77.6%) compared to physicians (70.7%) and GPT-4 (69.0%), although the differences did not reach statistical significance. Conclusions There were no statistically significant differences in accuracy detected between GPT-4, GPT-5, and physicians. These results suggest a possible role for LLMs as educational support tools, although confirmation in real-world clinical settings remains necessary. GPT-4 obtained numerically higher accuracy on theoretical questions, whereas GPT-5 scored higher on clinical scenario-based items; however, these differences were not statistically significant. This pattern may indicate evolving performance characteristics across newer generations of LLMs, but such an interpretation remains tentative and requires further validation. Overall, the findings point to the potential usefulness of LLMs in medical education and knowledge-support contexts; however, they should not be interpreted as evidence of clinical competence. Further studies across multiple specialties and real-world healthcare environments are needed to determine their practical applicability.

Autoren

Themen

Artificial Intelligence in Healthcare and EducationExplainable Artificial Intelligence (XAI)Radiomics and Machine Learning in Medical Imaging

Volltext beim Verlag öffnen

Artificial Intelligence in Cardiology Examination: A Comparative Study of Physicians, ChatGPT-4, and ChatGPT-5

Abstract

Ähnliche Arbeiten

Autoren

Themen