Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Performance of Large Language Models in Analyzing Common Hypertension Scenarios
3
Zitationen
9
Autoren
2025
Jahr
Abstract
BACKGROUND: Hypertension, the leading cause of cardiovascular mortality, remains suboptimally controlled. Large language models (LLMs) could improve hypertension control by augmenting clinical decision-making, but their reliability for guideline-driven tasks is unverified. This study evaluated the accuracy and safety of hypertension management recommendations generated by 3 LLMs. METHODS: Fifty-one vignettes were constructed and submitted to the LLMs (GPT-4, Gemini, Medical Large Language Model [by Google; MedLM]) and a hypertension expert to generate the responses. Three blinded reviewers rated each response on a 4-point accuracy scale, a binary safety (safe/unsafe) scale, and attempted to identify the source (LLM versus expert) providing the response. RESULTS: GPT-4 had the highest accuracy (83%) and safety (86%) scores among LLMs but remained inferior to expert responses (92% accuracy, 93% safety). Gemini and MedLM performed significantly worse (accuracy: 64% and 35%; safety: 73% and 39%, respectively). GPT-4 generated the most guideline-concordant responses (46%) among the 3 LLMs (Gemini 35%, MedLM 14%) but was lower than expert responses (68%). Interrater reliability for accuracy ratings was higher for LLM-generated responses (GPT-4 [intraclass correlation coefficient, 0.30], Gemini [intraclass correlation coefficient, 0.61], and MedLM [intraclass correlation coefficient, 0.58]), with lower agreement for expert responses (intraclass correlation coefficient, 0.23). A similar pattern was observed for safety and source discrimination ratings. The agreement was strongest for safety assessments and weakest for source discrimination. CONCLUSIONS: Among the 3 tested LLMs, GPT-4 demonstrated closer agreement to expert decisions, thereby showing greater potential for supporting hypertension management. Despite their potential, current LLM versions are inferior to expert recommendations. Human-in-the-loop supervision remains essential when deploying LLMs for clinical decision-making.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.652 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.567 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.083 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.856 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.