OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 13.05.2026, 03:17

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Performance of Large Language Models in Analyzing Common Hypertension Scenarios

2025·3 Zitationen·Hypertension
Volltext beim Verlag öffnen

3

Zitationen

9

Autoren

2025

Jahr

Abstract

BACKGROUND: Hypertension, the leading cause of cardiovascular mortality, remains suboptimally controlled. Large language models (LLMs) could improve hypertension control by augmenting clinical decision-making, but their reliability for guideline-driven tasks is unverified. This study evaluated the accuracy and safety of hypertension management recommendations generated by 3 LLMs. METHODS: Fifty-one vignettes were constructed and submitted to the LLMs (GPT-4, Gemini, Medical Large Language Model [by Google; MedLM]) and a hypertension expert to generate the responses. Three blinded reviewers rated each response on a 4-point accuracy scale, a binary safety (safe/unsafe) scale, and attempted to identify the source (LLM versus expert) providing the response. RESULTS: GPT-4 had the highest accuracy (83%) and safety (86%) scores among LLMs but remained inferior to expert responses (92% accuracy, 93% safety). Gemini and MedLM performed significantly worse (accuracy: 64% and 35%; safety: 73% and 39%, respectively). GPT-4 generated the most guideline-concordant responses (46%) among the 3 LLMs (Gemini 35%, MedLM 14%) but was lower than expert responses (68%). Interrater reliability for accuracy ratings was higher for LLM-generated responses (GPT-4 [intraclass correlation coefficient, 0.30], Gemini [intraclass correlation coefficient, 0.61], and MedLM [intraclass correlation coefficient, 0.58]), with lower agreement for expert responses (intraclass correlation coefficient, 0.23). A similar pattern was observed for safety and source discrimination ratings. The agreement was strongest for safety assessments and weakest for source discrimination. CONCLUSIONS: Among the 3 tested LLMs, GPT-4 demonstrated closer agreement to expert decisions, thereby showing greater potential for supporting hypertension management. Despite their potential, current LLM versions are inferior to expert recommendations. Human-in-the-loop supervision remains essential when deploying LLMs for clinical decision-making.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationMachine Learning in HealthcareExplainable Artificial Intelligence (XAI)
Volltext beim Verlag öffnen