Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Evaluating large language models for evidence-based clinical question answering

2026·0 Zitationen·PatternsOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

<h2>Summary</h2> Large language models show potential in clinical applications, yet reliability for evidence-based medicine requires rigorous evaluation. We curated a multi-source benchmark with more than 20,000 question answering pairs from systematic reviews and clinical guidelines to assess performance on GPT-5, GPT-4o-mini, Claude 4, and DeepSeek-v3. Accuracy was highest with structured guidelines (90%), lower with narrative sources (70%), and lowest with systematic reviews (50%–60%). All models struggled with ambiguous evidence. We found that higher citation counts for source material correlated with increased accuracy and observed moderate geographic variation in performance. However, accuracy did not vary significantly by publication year or domain prevalence. Retrieval-augmented generation bolstered performance; providing the top three PubMed-retrieved articles yielded a 23% accuracy gain. These patterns were consistent across models, demonstrating that source clarity and targeted retrieval drive performance. We conclude that stratified evaluation and retrieval strategies are essential for ensuring factual alignment and reliability in high-stakes clinical decision-making.

Autoren

Institutionen

Themen

Topic ModelingArtificial Intelligence in Healthcare and EducationBiomedical Text Mining and Ontologies

Volltext beim Verlag öffnen

Evaluating large language models for evidence-based clinical question answering

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen