Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Performance of 5 AI Models on United States Medical Licensing Examination Step 1 Questions: Comparative Observational Study
0
Zitationen
6
Autoren
2026
Jahr
Abstract
BACKGROUND: Artificial intelligence (AI) models are increasingly being used in medical education. Although models like ChatGPT have previously demonstrated strong performance on United States Medical Licensing Examination (USMLE)-style questions, newer AI tools with enhanced capabilities are now available, necessitating comparative evaluations of their accuracy and reliability across different medical domains and question formats. OBJECTIVE: This study aimed to evaluate and compare the performance of 5 publicly available AI models: Grok, ChatGPT-4, Copilot, Gemini, and DeepSeek, on the USMLE Step 1 free 120-question set, assessing their accuracy and consistency across question types and medical subjects. METHODS: This cross-sectional observational study was conducted between February 10 and March 5, 2025. Each of the 119 USMLE-style questions (excluding 1 audio-based item) was presented to each AI model by using a standardized prompt cycle. Models answered each question 3 times to assess confidence and consistency. Questions were categorized as text-based or image-based and as case-based or information-based. Statistical analysis was performed using chi-square and Fisher exact tests, with Bonferroni adjustment for pairwise comparisons. RESULTS: Grok achieved the highest score (109/119, 91.6%), followed by Copilot (101/119, 84.9%), Gemini (100/119, 84%), ChatGPT-4 (95/119, 79.8%), and DeepSeek (86/119, 72.3%). DeepSeek's lower score was due to an inability to process visual media, resulting in 0% accuracy on image-based items. When limited to text-only questions (n=96), DeepSeek's accuracy increased to 89.6% (86/96), matching Copilot. Grok showed the highest accuracy on image-based (21/23, 91.3%) and case-based questions (70/78, 89.7%), with statistically significant differences observed between Grok and DeepSeek on case-based items (P=.01). The models performed best in biostatistics and epidemiology (5.8/6, 96.7%) and worst in musculoskeletal, skin, and connective tissue (4.4/7, 62.9%). Grok maintained 100% consistency in responses, while Copilot demonstrated the most self-correction (112/119, 94.1% consistency), improving its accuracy to 89.9% (107/119) on the third attempt. CONCLUSIONS: AI models showed varying strengths across domains, with Grok demonstrating the highest accuracy and consistency in this dataset, particularly for image-based and reasoning-heavy questions. Although ChatGPT-4 remains widely used, newer models like Grok and Copilot also performed competitively. Continuous evaluation is essential as AI tools rapidly evolve.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.719 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.628 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.176 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.880 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.