Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
CARMINA: optimizing low-parameter language models for high-quality cardiovascular research assistance
0
Zitationen
3
Autoren
2026
Jahr
Abstract
Abstract Introduction Large language models (LLMs) and their use in chatbots have demonstrated impressive capabilities in biomedical contexts [1]; however, the hallucinations, privacy issues and substantial computational requirements limit their widespread implementation in resource-constrained environments. Current approaches either sacrifice performance for efficiency, require prohibitive computational resources, or require payment for each word. Purpose We have developed and validated CARMINA (Cardiovascular And Research-driven Molecular Insight with Novel Assistant), a specialized biomedical assistant powered by smaller, resource-efficient, and open-source language models. We hypothesized that carefully optimized Retrieval-Augmented Generation (RAG) systems using models with fewer parameters (≤7B) could achieve performance comparable to larger models while maintaining or even improving factual accuracy and scientific rigor in cardiovascular research applications. Methods We constructed a comprehensive biomedical RAG system using four different language models: llama3.1:7b, gemma2:2b, qwen2:7b, and phi3:3.8b [2–5]. Models are coupled with a MongoDB vector database containing 650,000 indexed PubMed cardiology-related abstracts and GTE-large embedding model [6]. We optimized the system through prompt engineering to reduce hallucinations and provide source citations. For benchmarking, we developed a questionnaire with ~250 questions extracted from scientific abstracts using llama3.1. The questions were taylored to assess the groundedness, relevance, and context-independence [7,8] of the answers provided by CARMINA Model responses were systematically evaluated using an independent language model (llama3.1:7b) for accuracy, completeness, reference quality, and clarity, varying in the number of retrieved context documents 1-5 papers). Results Our benchmarking demonstrated that qwen2:7b is the most consistent model across all evaluation metrics [Figure 1]. All models acknowledged their lack of information answering "I don´t know" whenever needed and provided relevant references for their responses. The optimized RAG architecture significantly reduced hallucination rates compared to standard implementations. Furthermore, the use of larger open-source models does not substantially improve performance. Conclusion CARMINA shows that small language models, when equipped with specialized RAG workflows and optimizations techniques, can provide reliable research assistance even better than non-specialized larger models. This approach offers a solution for resource-limited environments while maintaining scientific accuracy and guaranteeing privacy. In future work, we plan to address the limitations of automated benchmarking methodologies, and the inherent risks associated with using LLMs as evaluators [9,10].
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.436 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.311 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.753 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.523 Zit.