Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
From Guidelines to Real‐Time Conversation: Expert‐Validated Retrieval‐Augmented and Fine‐Tuned <scp>GPT</scp> ‐4 for Hepatitis C Management
3
Zitationen
10
Autoren
2025
Jahr
Abstract
BACKGROUND AND AIMS: Advances in artificial intelligence, particularly large language models (LLMs), hold promise for transforming chronic disease management such as Hepatitis C Virus (HCV) infection. This study evaluates the impact of retrieval-augmented generation (RAG) and supervised fine-tuning (SFT) on both open-ended question answering (accuracy and clarity) and on LLM-recommended treatment regimens for clinical scenarios. METHODS: We employed OpenAI's GPT-4 Turbo in four configurations-baseline, RAG-Top1, RAG-Top 10 and SFT-using the 2020 EASL HCV guidelines as external knowledge or fine-tuning data. For the question set, guidelines were segmented at the paragraph level and encoded into 3072-dimensional embeddings. Fifteen questions covering general, patient and physician perspectives were scored on a 10-point accuracy scale and binary accuracy/clarity by four experts. Separately, we created 25 simulated clinical scenarios; a consensus of four hepatologists defined the gold-standard DAA regimens. Model performance on these cases was measured by two metrics: 'partial accuracy' (≥ one correct DAA without errors) and 'complete accuracy' (all correct DAAs without errors). RESULTS: On open-ended questions, RAG-Top10 outperformed baseline in accuracy (91.7% vs. 36.6%; p < 0.001) and clarity (91.7% vs. 46.6%; p < 0.001). RAG-Top1 achieved 81.7% accuracy and 86.6% clarity (both p < 0.001), while SFT reached 71.7% accuracy and 88.3% clarity (p < 0.001). Similarly, RAG-Top10 achieved the highest performance in prescribing the correct DAA regimen according to expert consensus in 76% of cases (vs. 24% for baseline model, p < 0.001). CONCLUSIONS: Both RAG-Top10 and SFT markedly enhance LLM performance in guideline-driven HCV management-improving not only response accuracy and clarity but also DAA selection in clinical scenarios. RAG-Top10's broader context retrieval confers the greatest gains, while SFT underscores the value of domain-specific alignment. Rigorous, expert-informed evaluation frameworks are essential for the safe integration of LLMs into clinical practice.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.578 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.470 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.984 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.814 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Autoren
Institutionen
- University of Trieste(IT)
- Yale University(US)
- Humanitas University(IT)
- IRCCS Humanitas Research Hospital(IT)
- University Hospital of Geneva(CH)
- Geneva College(US)
- Azienda Socio Sanitaria Territoriale Grande Ospedale Metropolitano Niguarda(IT)
- University of Milano-Bicocca(IT)
- Hospital Clínic de Barcelona(ES)
- Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas(ES)
- Consorci Institut D'Investigacions Biomediques August Pi I Sunyer(ES)
- Universitat de Barcelona(ES)
- Inserm(FR)
- Université Paris-Est Créteil(FR)
- Assistance Publique – Hôpitaux de Paris(FR)
- Institut Mondor de Recherche Biomédicale(FR)
- Hôpital Paul-Brousse(FR)