OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 11.04.2026, 23:58

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

The Economics of Accuracy for Medical Reasoning with Large Language Models

2025·0 Zitationen·medRxivOpen Access
Volltext beim Verlag öffnen

0

Zitationen

1

Autoren

2025

Jahr

Abstract

Abstract Deploying large language models (LLMs) in clinical settings is limited by security, reliability, latency, and accessibility concerns that favor smaller, on-device or on-premise models. However, these smaller models may struggle to meet accuracy requirements. While fine-tuning and retrieval-augmented generation (RAG) can improve domain-specific accuracy, these methods require additional labeled data, technical skill, and infrastructure. In contrast, test-time scaling —allocating extra token-budget during inference—offers a training-free alternative to increasing accuracy. However, the trade-offs between these strategies and their interaction with model size remain poorly understood for medical reasoning. To address this gap, we compare three approaches—test-time scaling, fine-tuning, and context grounding—using the Gemma and MedGemma family of LLMs (Gemma-3 1B, Gemma-3 4B, Gemma-3 27B, MedGemma-4B, and MedGemma 27B) and evaluate these systems across common biomedical question-answering (QA) datasets and a set of recently released medical exam questions with the performance of practicing clinicians available for comparison. We test baseline prompts (direct answer, Chain-of-Thought, and self-consistency) while introducing a new prompting method we call “prompt-chaining for continuous reflection” (PCCR) that forces inference time minimum token-generation budgets. We assess accuracy and tokens-generated, allowing us to investigate the accuracy–efficiency trade-offs across prompting, context-grounding, fine-tuning, and model scales. We discover equivalency points where smaller models perform comparably to larger ones with increased reasoning budgets, context-grounding, or fine-tuning. We also find inflection points where context-grounding and test-time scaling used together lead to degrading performance. Using these empirical results, we formulate a general framework with equations to balance cost-benefit trade-offs when engineering LLM-based systems for medical reasoning and QA. We recommend generalizable configurations, designs, and patterns to achieve accuracy and efficiency objectives for example use-cases relevant to healthcare organizations. Author summary When doctors and hospitals want to use artificial intelligence for medical tasks, they face difficult choices. The most capable AI systems are expensive to run and require sending sensitive patient and hospital data to external servers. Smaller systems that can run locally are more practical but may be less accurate. Our research asked: with what configurations can we make smaller AI systems perform as well as larger ones for medical reasoning? We tested five AI models of different sizes, including both generalist and medically-specialized models, on thousands of medical questions using various prompting strategies, including a new and very simple method we developed that encourages the AI to reason more extensively. We discovered several surprising findings which we describe in detail. Most importantly, we found that smaller, cheaper models can match larger ones when given the right combination of prompting strategy, specialized training, and supporting information. We translated these findings into practical guidelines that can help choose AI configurations that balance accuracy, speed, and cost. Our framework could help make medical AI more accessible to institutions with limited computational resources while achieving the highest possible accuracy.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Topic ModelingArtificial Intelligence in Healthcare and EducationMachine Learning in Healthcare
Volltext beim Verlag öffnen