Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Automating Evaluation of LLM-generated Responses to Patient Questions about Rare Diseases

2025·0 Zitationen·JAMIA OpenOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

Objectives: Patients with rare diseases often struggle to find accurate medical information, and large language model (LLM)-based chatbots may help meet this need. However, evaluating LLM-generated free-text answers typically requires physician review, which is time-consuming and difficult to scale. This study compared traditional natural language processing (NLP) metrics to emerging LLM-based evaluation approaches for assessing answer quality in the context of Complex Lymphatic Anomalies (CLAs). Materials and Methods: We compiled 25 common patients' questions about CLAs and generated 175 responses to these questions from seven LLMs. Three expert physicians scored these responses for accuracy. We compared these physician-assigned scores with automated scores, generated by four NLP sentence similarity metrics (BLEU, ROUGE, METEOR, BERTScore) and six LLM evaluators (GPT-4, GPT-4o, Qwen3-32B, DeepSeek-R1-14B, Gemma3-27B, LLaMA3.3-70B). We examined both LLM-based scoring with and without reference answers (reference-guided vs reference-free). We calculated Spearman, Phi, and Kendall's Tau correlation coefficients to assess alignment between automated and physician-assigned scores. Results: = 0.240-0.403). Reference-guided scoring outperformed reference-free methods. Discussion: Reference-guided LLM-based evaluation methods approximate expert physicians' judgment better than traditional NLP metrics, offering an effective, scalable approach for assessing LLM-generated responses to patient questions about rare disease. Conclusion: LLM-based evaluation, particularly reference-guided scoring with GPT models, can support the scalable development and evaluation of LLM-based rare disease-specific chatbot systems.

Autoren

Institutionen

Themen

Genomics and Rare DiseasesArtificial Intelligence in Healthcare and EducationAI in Service Interactions

Volltext beim Verlag öffnen

Automating Evaluation of LLM-generated Responses to Patient Questions about Rare Diseases

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen