Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Automating Evaluation of LLM-generated Responses to Patient Questions about Rare Diseases
0
Zitationen
7
Autoren
2025
Jahr
Abstract
Objectives: Patients with rare diseases often struggle to find accurate medical information, and large language model (LLM)-based chatbots may help meet this need. However, evaluating LLM-generated free-text answers typically requires physician review, which is time-consuming and difficult to scale. This study compared traditional natural language processing (NLP) metrics to emerging LLM-based evaluation approaches for assessing answer quality in the context of Complex Lymphatic Anomalies (CLAs). Materials and Methods: We compiled 25 common patients' questions about CLAs and generated 175 responses to these questions from seven LLMs. Three expert physicians scored these responses for accuracy. We compared these physician-assigned scores with automated scores, generated by four NLP sentence similarity metrics (BLEU, ROUGE, METEOR, BERTScore) and six LLM evaluators (GPT-4, GPT-4o, Qwen3-32B, DeepSeek-R1-14B, Gemma3-27B, LLaMA3.3-70B). We examined both LLM-based scoring with and without reference answers (reference-guided vs reference-free). We calculated Spearman, Phi, and Kendall's Tau correlation coefficients to assess alignment between automated and physician-assigned scores. Results: = 0.240-0.403). Reference-guided scoring outperformed reference-free methods. Discussion: Reference-guided LLM-based evaluation methods approximate expert physicians' judgment better than traditional NLP metrics, offering an effective, scalable approach for assessing LLM-generated responses to patient questions about rare disease. Conclusion: LLM-based evaluation, particularly reference-guided scoring with GPT models, can support the scalable development and evaluation of LLM-based rare disease-specific chatbot systems.
Ähnliche Arbeiten
Trimmomatic: a flexible trimmer for Illumina sequence data
2014 · 68.989 Zit.
Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology
2015 · 31.802 Zit.
BEDTools: a flexible suite of utilities for comparing genomic features
2010 · 30.201 Zit.
HTSeq—a Python framework to work with high-throughput sequencing data
2014 · 22.569 Zit.
A global reference for human genetic variation
2015 · 19.812 Zit.