OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 12.05.2026, 06:12

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Qualitative metrics from the biomedical literature for evaluating large language models in clinical decision-making: a narrative review

2024·24 Zitationen·BMC Medical Informatics and Decision MakingOpen Access
Volltext beim Verlag öffnen

24

Zitationen

9

Autoren

2024

Jahr

Abstract

BACKGROUND: The large language models (LLMs), most notably ChatGPT, released since November 30, 2022, have prompted shifting attention to their use in medicine, particularly for supporting clinical decision-making. However, there is little consensus in the medical community on how LLM performance in clinical contexts should be evaluated. METHODS: We performed a literature review of PubMed to identify publications between December 1, 2022, and April 1, 2024, that discussed assessments of LLM-generated diagnoses or treatment plans. RESULTS: We selected 108 relevant articles from PubMed for analysis. The most frequently used LLMs were GPT-3.5, GPT-4, Bard, LLaMa/Alpaca-based models, and Bing Chat. The five most frequently used criteria for scoring LLM outputs were "accuracy", "completeness", "appropriateness", "insight", and "consistency". CONCLUSIONS: The most frequently used criteria for defining high-quality LLMs have been consistently selected by researchers over the past 1.5 years. We identified a high degree of variation in how studies reported their findings and assessed LLM performance. Standardized reporting of qualitative evaluation metrics that assess the quality of LLM outputs can be developed to facilitate research studies on LLMs in healthcare.

Ähnliche Arbeiten