Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Reducing Hallucinations in Medical AI Through Citation Enforced Prompting in RAG Systems
0
Zitationen
2
Autoren
2026
Jahr
Abstract
The safe integration of Large Language Models in clinical environments requires strict adherence to verified medical evidence. As part of the PARROT AI project, this study provides a systematic evaluation of how prompting strategies affect the reliability of Retrieval-Augmented Generation (RAG) pipelines using the MedQA USMLE benchmark (N=500). Four prompting strategies were examined: Baseline (zero-shot), Neutral, Expert Chain-of-Thought (Expert-CoT) with structured clinical reasoning, and StrictCitations with mandatory evidence grounding. The experiments covered six modern model architectures: Command R (35B), Gemma 2 (9B and 27B), Llama 3.1 (8B), Mistral Nemo (12B), and Qwen 2.5 (14B). Evaluation was conducted using the Deterministic RAG Evaluator, providing an objective assessment of grounding through the Unsupported Sentence Ratio (USR) based on TF-IDF and cosine similarity. The results indicate that structured reasoning in the Expert-CoT strategy significantly increases USR values (reaching 95–100%), as models prioritize internal diagnostic logic over verbatim context. In contrast, the StrictCitations strategy, while maintaining high USR due to the conservative evaluation threshold, achieves the highest level of verifiable grounding and source adherence. The analysis identifies a statistically significant Verbosity Signal (r=0.81,p<0.001), where increased response length serves as a proxy for model uncertainty and parametric leakage, a pattern particularly prominent in Llama 3.1 and Gemma 2. Overall, the findings demonstrate that prompting strategy selection is as critical for clinical reliability as model architecture. This work delivers a reproducible framework for the development of trustworthy medical AI assistants and highlights citation-enforced prompting as a vital mechanism for improving clinical safety.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.402 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.270 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.702 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.507 Zit.