Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Comparison of the accuracy and reliability of ChatGPT-4o and Gemini in answering HIV-related questions
1
Zitationen
2
Autoren
2025
Jahr
Abstract
Large language models (LLMs) such as ChatGPT and Gemini are increasingly being used to obtain health information, including topics such as HIV. This study aims to comparatively evaluate the accuracy, reliability, and reproducibility of ChatGPT and Gemini in answering HIV-related questions obtained from official public health sources, clinical guidelines, and social media. A total of 156 HIV-related questions were asked to ChatGPT-4o and Google Gemini 1.5 Flash across three categories: questions derived from the United States Centers for Disease Control and Prevention (CDC) resources (44.2%, n = 69), guidelines (30.8%, n = 48), and social media (25.0%, n = 39). Responses were rated on a 4-point scale (1 = completely wrong, 4 = completely correct) by two infectious disease specialists. The reproducibility of both LLMs was also evaluated. The median score (IQR) of the answers generated for all questions was 4.00 (0.00) for ChatGPT and 4.00 (1.00) for Gemini (p = 0.051). The rate of completely correct answers was 81.4% for ChatGPT and 71.8% for Gemini (p = 0.045). ChatGPT demonstrated significantly lower accuracy in guideline-based questions (47.9%) than in CDC-related (97.1%) and social media-derived (94.9%) questions (p < 0.001 for both). Similarly, Gemini demonstrated significantly lower accuracy in guideline-based questions (35.4%) compared to CDC-related (88.4%) and social media-derived (87.2%) questions (p < 0.001 for both). Considering the questions according to the topics, the lowest accuracy rate for both LLMs was in the subject of ‘Prevention and Treatment’ (67.2% for ChatGPT, 54.7% for Gemini). The reproducibility of the answers was 94.8% for ChatGPT and 90.3% for Gemini. ChatGPT and Gemini, answered CDC- and social media–based questions with high accuracy. However, both LLMs showed lower accuracy for guideline-based and “Prevention and Treatment” questions. These findings suggest that while such models may provide useful general information, they are not yet reliable for clinical decision-making, and their outputs should be verified against evidence-based clinical guidelines.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.400 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.261 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.695 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.506 Zit.