Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Evaluating the Performance of ChatGPT 3.5 and 4.0 on StatPearls Oculoplastic Surgery Text- and Image-Based Exam Questions
10
Zitationen
3
Autoren
2024
Jahr
Abstract
INTRODUCTION: The emergence of large language models (LLMs) has led to significant interest in their potential use as medical assistive tools. Prior investigations have analyzed the overall comparative performance of LLM versions within different ophthalmology subspecialties. However, limited investigations have characterized LLM performance on image-based questions, a recent advance in LLM capabilities. The purpose of this study was to evaluate the performance of Chat Generative Pre-Trained Transformers (ChatGPT) versions 3.5 and 4.0 on image-based and text-only questions using oculoplastic subspecialty questions from StatPearls and OphthoQuestions question banks. METHODS: This study utilized 343 non-image questions from StatPearls, 127 images from StatPearls, and 89 OphthoQuestions. All of these questions were specific to Oculoplastics. The information collected included correctness, distribution of answers, and if an additional prompt was necessary. Text-only questions were compared between ChatGPT-3.5 and ChatGPT-4.0. Also, text-only and multimodal (image-based) questions answered by ChatGPT-4.0 were compared. RESULTS: ChatGPT-3.5 answered 56.85% (195/343) of text-only questions correctly, while ChatGPT-4.0 achieved 73.46% (252/343), showing a statistically significant difference in accuracy (p<0.05). The biserial correlation between ChatGPT-3.5 and human performance on the StatPearls question bank was 0.198, with a standard deviation of 0.195. When ChatGPT-3.5 was incorrect, the average human correctness was 49.39% (SD 26.27%), and when it was correct, human correctness averaged 57.82% (SD 30.14%) with a t-statistic of 3.57 and a p-value of 0.0004. For ChatGPT-4.0, the biserial correlation was 0.226 (SD 0.213). When ChatGPT-4.0 was incorrect, human correctness averaged 45.49% (SD 24.85%), and when it was correct, human correctness was 57.02% (SD 29.75%) with a t-statistic of 4.28 and a p-value of 0.0006. On image-only questions, ChatGPT-4.0 correctly answered 56.94% (123/216), significantly lower than its performance on text-only questions (p<0.05). DISCUSSION AND CONCLUSION: This study shows that ChatGPT-4.0 performs better on the oculoplastic subspecialty than prior versions. However, significant challenges remain regarding accuracy, particularly when integrating image-based prompts. While showing promise within medical education, further progress must be made regarding LLM reliability, and caution should be used until further advancement is achieved.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.646 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.554 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.071 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.851 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.