Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
The Retrieval-Reasoning Paradox: Evidence Disengagement in a Reasoning-Augmented Language Model
0
Zitationen
1
Autoren
2026
Jahr
Abstract
Retrieval-Augmented Generation (RAG) grounds language model outputs in external evidence, while reasoning-augmented models produce extended chain-of-thought (CoT) traces to strengthen multi-step problem solving. The prevailing assumption in production deployments is that these two capabilities are complementary. This paper challenges that assumption through a controlled experiment. On the HotpotQA multi-hop question-answering benchmark with BM25 retrieval, I compare two models from the same family that differ only in whether extended reasoning is enabled: Gemini 2.5 Flash (standard) and Gemini 2.5 Pro with thinking mode (reasoning). Enabling reasoning mode reduces exact-match accuracy from 37.6% to 13.6%-a 64% relative degradation-despite generating 56% longer responses. To diagnose this failure, I introduce three metrics: Evidence Utilization Rate (EUR), Parametric Override Frequency (POF), and Reasoning-Evidence Alignment Score (REAS). The reasoning model exhibits significantly lower EUR (p=0.011) and REAS (p<0.001), indicating systematic disengagement from the retrieved documents. A no-retrieval baseline reveals that providing documents to the reasoning model yields ΔEM=-0.005.-retrieval offers no benefit. Even when retrieval succeeds (gold answer present in the top-3 passages), the reasoning model reaches only 22.2% EM, well below the standard model's overall 37.6%. Decomposition of the output shows that 81% of the response consists of reasoning chain text that is less grounded in the evidence than the final answer. I term this the retrieval-reasoning paradox: extended CoT actively undermines evidence-based grounding.