OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 17.05.2026, 00:48

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Low-energy small language models with retrieval-augmented generation can surpass large-model performance in rheumatology

2026·0 Zitationen·Frontiers in MedicineOpen Access
Volltext beim Verlag öffnen

0

Zitationen

7

Autoren

2026

Jahr

Abstract

Background Large language models (LLMs) are increasingly explored for clinical decision support but are limited by high computational and energy demands. Smaller language models (SLMs), particularly when combined with retrieval-augmented generation (RAG), may offer a more sustainable alternative. Rheumatology, characterized by diagnostic complexity and guideline-driven management, represents a suitable test domain. Methods Five state-of-the-art language models (GPT-4o, Mixtral-8 × 7b-32768, Llama-3.1-Nemotron-70b-Instruct, Qwen-Turbo 2.5, Claude-3.5-Sonnet) were evaluated regarding their suitability for clinical decision support using ten standardized, anonymized rheumatology cases. Models were assessed with and without RAG, and with or without a predefined diagnosis. Diagnostic and therapeutic accuracy were quantified using F1 scores. Factual consistency and relevance were assessed using the Retrieval-Augmented Generation Assessment Score (RAGAS). Results Mixtral-8 × 7b-32768 with RAG achieved the highest diagnostic (72%) and therapeutic (73%) F1 scores. Nemotron-70b showed strong diagnostic performance without RAG (71%), while Qwen-Turbo performed well in therapeutic recommendations without retrieval (72%). The highest RAGAS score was observed for Mixtral with RAG (81%). Performance regarding clinical decision support varied substantially across models and configurations. Conclusion SLMs combined with RAG can match or exceed the performance of larger LLMs for clinical decision support while requiring significantly fewer computational resources. Despite promising results, clinically relevant errors persisted across all models, underscoring the need for expert oversight and further real-world validation.

Ähnliche Arbeiten