Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Artificial Intelligence in Cardiology Examination: A Comparative Study of Physicians, ChatGPT-4, and ChatGPT-5
0
Zitationen
16
Autoren
2026
Jahr
Abstract
Background Medicine is increasingly applying artificial intelligence (AI), particularly large language models (LLMs). Previous studies have shown that models such as GPT-4 can achieve performance comparable to physicians on medical knowledge tests. However, direct comparisons between the newest GPT-5 model, its predecessor GPT-4, and physicians are lacking, especially with respect to theoretical versus clinical question types. Methodology This comparative study analyzed 120 multiple-choice questions from the spring 2025 Polish National Cardiology Specialization Examination administered by the Medical Examination Center (CEM) in Łódź. The dataset included 62 theoretical questions and 58 clinical scenario-based questions. Examination performance of physicians who sat for the exam (n = 153) was compared with responses generated by ChatGPT-4 and ChatGPT-5. Both models were evaluated independently using the original Polish-language questions under standardized, exam-like conditions, without access to external databases or tools. Model accuracy was calculated overall and by question type. Statistical comparisons were performed using one-sample proportion Z-tests and McNemar’s tests, with a Bonferroni correction applied for multiple comparisons (significance threshold set at p-values <0.0167). Results Physicians achieved a mean accuracy of 72.5% compared with 77.5% for ChatGPT-4 and 79.2% for ChatGPT-5. The difference between physicians and GPT-5 did not reach statistical significance (p = 0.100). For theoretical questions, GPT-4 achieved the highest score (85.5%), but statistical significance was not achieved after Bonferroni correction (p = 0.044). For clinical questions, GPT-5 achieved the highest score (77.6%) compared to physicians (70.7%) and GPT-4 (69.0%), although the differences did not reach statistical significance. Conclusions There were no statistically significant differences in accuracy detected between GPT-4, GPT-5, and physicians. These results suggest a possible role for LLMs as educational support tools, although confirmation in real-world clinical settings remains necessary. GPT-4 obtained numerically higher accuracy on theoretical questions, whereas GPT-5 scored higher on clinical scenario-based items; however, these differences were not statistically significant. This pattern may indicate evolving performance characteristics across newer generations of LLMs, but such an interpretation remains tentative and requires further validation. Overall, the findings point to the potential usefulness of LLMs in medical education and knowledge-support contexts; however, they should not be interpreted as evidence of clinical competence. Further studies across multiple specialties and real-world healthcare environments are needed to determine their practical applicability.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.456 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.332 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.779 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.533 Zit.