OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 07.04.2026, 05:22

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Comparison of the accuracy and reliability of ChatGPT-4o and Gemini in answering HIV-related questions

2025·1 Zitationen·BMC Infectious DiseasesOpen Access
Volltext beim Verlag öffnen

1

Zitationen

2

Autoren

2025

Jahr

Abstract

Large language models (LLMs) such as ChatGPT and Gemini are increasingly being used to obtain health information, including topics such as HIV. This study aims to comparatively evaluate the accuracy, reliability, and reproducibility of ChatGPT and Gemini in answering HIV-related questions obtained from official public health sources, clinical guidelines, and social media. A total of 156 HIV-related questions were asked to ChatGPT-4o and Google Gemini 1.5 Flash across three categories: questions derived from the United States Centers for Disease Control and Prevention (CDC) resources (44.2%, n = 69), guidelines (30.8%, n = 48), and social media (25.0%, n = 39). Responses were rated on a 4-point scale (1 = completely wrong, 4 = completely correct) by two infectious disease specialists. The reproducibility of both LLMs was also evaluated. The median score (IQR) of the answers generated for all questions was 4.00 (0.00) for ChatGPT and 4.00 (1.00) for Gemini (p = 0.051). The rate of completely correct answers was 81.4% for ChatGPT and 71.8% for Gemini (p = 0.045). ChatGPT demonstrated significantly lower accuracy in guideline-based questions (47.9%) than in CDC-related (97.1%) and social media-derived (94.9%) questions (p < 0.001 for both). Similarly, Gemini demonstrated significantly lower accuracy in guideline-based questions (35.4%) compared to CDC-related (88.4%) and social media-derived (87.2%) questions (p < 0.001 for both). Considering the questions according to the topics, the lowest accuracy rate for both LLMs was in the subject of ‘Prevention and Treatment’ (67.2% for ChatGPT, 54.7% for Gemini). The reproducibility of the answers was 94.8% for ChatGPT and 90.3% for Gemini. ChatGPT and Gemini, answered CDC- and social media–based questions with high accuracy. However, both LLMs showed lower accuracy for guideline-based and “Prevention and Treatment” questions. These findings suggest that while such models may provide useful general information, they are not yet reliable for clinical decision-making, and their outputs should be verified against evidence-based clinical guidelines.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationTopic ModelingComputational and Text Analysis Methods
Volltext beim Verlag öffnen