OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 24.05.2026, 12:32

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Comprehensive analysis of the performance of GPT-3.5 and GPT-4 on the American Urological Association self-assessment study program exams from 2012-2023.

2023·1 Zitationen·PubMed
Volltext beim Verlag öffnen

1

Zitationen

2

Autoren

2023

Jahr

Abstract

INTRODUCTION: Artificial intelligence (AI) applications, specifically generative pre-trained transformers, have shown potential in medical education and board-style examinations. To assess this capability, we conducted a study comparing the performance of GPT-3.5 and GPT-4 on the American Urological Association (AUA) 2022 self-assessment study program (SASP) exams from 2012-2023. METHODS: We used a standardized prompt to administer questions from the AUA SASP exams spanning 2012-2023, totalling 1679 questions. The performance of the two AI models, GPT-3.5 and GPT-4, was evaluated based on the number of questions answered correctly. Statistical analysis was performed using Fisher's exact test and independent sample t-tests to compare the performance of GPT-4 to that of GPT-3.5 among test years and urology topic areas. Percentile scores were not calculable, however, a score of 50% is required to acquire CME credits on AUA SASP exams. RESULTS: The analysis showed significantly superior performance by GPT-4, which scored above 50% across all exam years except 2018, with scores ranging from 48-64%. In contrast, GPT-3.5 consistently scored below this threshold, with scores ranging from 26-38%. The total combined score for GPT-4 was 55%, significantly higher than the 33% achieved by GPT-3.5 (odds ratio [OR] 2.5, 95% confidence interval [CI] 2.2-2.9, p<0.001). GPT-4 significantly outperformed GPT-3.5 among AUA SASP test years from 2012-2023 (mean difference 23, t(22) 14, 95% CI 19-26, p<0.001), as well as among urology topic areas (mean difference 21, t(52)=5.5, 95% CI 13-29, p<0.001). CONCLUSIONS: GPT-4 scored significantly higher than GPT-3.5 on the AUA SASP exams in overall performance, across all test years, and in various urology topic areas. This suggests improvement in evolving AI language models in answering clinical urology questions; however, certain aspects of medical knowledge and clinical reasoning remain challenging for AI language models.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationMachine Learning in HealthcareRadiomics and Machine Learning in Medical Imaging
Volltext beim Verlag öffnen