Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Accuracy and Reproducibility of Different Artificial Intelligence Chatbots’ Responses to Patient-Based Vitreoretinal Questions: A Comparative Study
1
Zitationen
13
Autoren
2026
Jahr
Abstract
Background: Generative artificial intelligence (AI) chatbots are increasingly used by patients and their reliability in complex ophthalmic conditions remains uncertain. This study aimed to compare the accuracy, comprehensiveness, and reproducibility of five AI chatbots-ChatGPT-5.o, DeepSeek R1, Meta AI, Grok 3.0, and Google Gemini 2.5 Pro-in responding to patient-centered vitreoretinal questions. Methods: A total of 135 questions covering diabetic retinopathy, floaters/flashes, age-related macular degeneration, retinal tear/detachment, and vitrectomy were sourced from the American Academy of Ophthalmology "Ask an Ophthalmologist" database. Each question was submitted twice to each chatbot under standardized instructions. Two board-certified vitreoretinal ophthalmologists independently graded responses for accuracy and reproducibility. Accuracy was calculated as the proportion of responses graded "Correct and Comprehensive" or "Accurate but incomplete"; reproducibility was defined as agreement between the two responses. Results: ChatGPT-5.o achieved the highest overall accuracy (94%, n=127/135, 95% CI: 89.9%-98.1%) with a reproducibility rate of 96.3% (n=130/135, 95% CI: 93.1%-99.5%). DeepSeek R1 demonstrated the greatest reproducibility (98.5%, n=133/135, 95% CI: 96.5%-100.0%) and high accuracy (92.6%, n=125/135, 95% CI: 88.1%-97.1%). Meta AI showed 91% (95% CI: 86.1%-95.9%) accuracy and 94% (95% CI: 89.9%-98.1%) reproducibility, whereas Grok 3.0 yielded the lowest accuracy (49.6%, n=67/135, 95% CI: 41.2%-58.0%) despite moderate reproducibility (88.1%, n=119/135, 95% CI: 82.7%-93.5%). Google Gemini 2.5 Pro recorded 72.6% (95% CI: 65.1%-80.1%) accuracy and the lowest reproducibility (77%, 95% CI: 69.9%-84.1%). By category, "Vitrectomy" scored the highest across all chatbots (94%, 95% CI: 87.2%-100.0%), followed by "Macular degeneration" (90%, 95% CI: 85.0%-95.0%). However, the category "Diabetic retinopathy" scored the lowest accuracy rate (64.7%, 95% CI: 52.1%-77.3%). Conclusion: ChatGPT-5.o and DeepSeek R1 approached high accuracy and reproducibility comparable to clinical standards, indicating potential as patient-education tools in vitreoretinal care. However, variability across models and disease categories highlights the need for cautious clinical adoption and continued optimization to ensure safe, reliable information delivery.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.697 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.602 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.127 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.872 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.