OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 10.04.2026, 04:53

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Does AI have utility in medical student surgical education? A comparative analysis of chatbots in answering standardized surgical multiple-choice questions

2025·0 Zitationen·Global Surgical Education - Journal of the Association for Surgical EducationOpen Access
Volltext beim Verlag öffnen

0

Zitationen

5

Autoren

2025

Jahr

Abstract

Abstract Purpose Artificial intelligence (AI) chatbots have potential as adjunctive medical education tools. AI chatbots can provide question-specific explanations with more supplemental content for students to learn from self-assessments. Chatbot performance on general surgery exams has not been studied at the medical student level. This study aims to assess the accuracy of popular low-cost chatbots—ChatGPT, Gemini, and Claude—in answering National Board of Medical Examiners (NBME) surgery practice questions for use in medical student education. Character count, as a proxy for question complexity, was assessed in relation to accuracy. Methods ChatGPT-4o mini, ChatGPT o3-mini, Gemini 2.0 Flash, and Claude 3.5 Sonnet were prompted to answer 20 multiple-choice questions from NBME Surgery Sample Items and provide justification on three attempts. Character count, answer choice, and explanation were recorded for each question. A logistic regression model assessed the relationship between accuracy and question character count. Results ChatGPT o3-mini and Claude 3.5 Sonnet scored 100% on all three attempts. Gemini 2.0 Flash scored 95% on all three attempts with an odds ratio of 0.904 [0.446, 1.831] ( p = 0.7794). ChatGPT-4o mini scored 95%, averaged on three attempts with odds ratios of 0.998 [0.993, 1.003] ( p = 0.4669). There was no statistically significant relationship between character count and accuracy. Conclusions The lack of correlation between question length and response accuracy implies that question complexity may not impact performance of these models. ChatGPT o3-mini and Claude 3.5 Sonnet outperform their counterparts on standardized general surgery exam questions, showcasing potential as supplementary tools for surgery students. ChatGPT-4o mini and Gemini 2.0 Flash have room for improvement to concurrently serve this purpose. Future models can continue to familiarize themselves with core surgical concepts to provide more comprehensive explanations.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationSurgical Simulation and TrainingAnatomy and Medical Technology
Volltext beim Verlag öffnen