OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 23.05.2026, 09:01

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Automating Evaluation of LLM-generated Responses to Patient Questions about Rare Diseases

2025·0 Zitationen·JAMIA OpenOpen Access
Volltext beim Verlag öffnen

0

Zitationen

7

Autoren

2025

Jahr

Abstract

Objectives: Patients with rare diseases often struggle to find accurate medical information, and large language model (LLM)-based chatbots may help meet this need. However, evaluating LLM-generated free-text answers typically requires physician review, which is time-consuming and difficult to scale. This study compared traditional natural language processing (NLP) metrics to emerging LLM-based evaluation approaches for assessing answer quality in the context of Complex Lymphatic Anomalies (CLAs). Materials and Methods: We compiled 25 common patients' questions about CLAs and generated 175 responses to these questions from seven LLMs. Three expert physicians scored these responses for accuracy. We compared these physician-assigned scores with automated scores, generated by four NLP sentence similarity metrics (BLEU, ROUGE, METEOR, BERTScore) and six LLM evaluators (GPT-4, GPT-4o, Qwen3-32B, DeepSeek-R1-14B, Gemma3-27B, LLaMA3.3-70B). We examined both LLM-based scoring with and without reference answers (reference-guided vs reference-free). We calculated Spearman, Phi, and Kendall's Tau correlation coefficients to assess alignment between automated and physician-assigned scores. Results: = 0.240-0.403). Reference-guided scoring outperformed reference-free methods. Discussion: Reference-guided LLM-based evaluation methods approximate expert physicians' judgment better than traditional NLP metrics, offering an effective, scalable approach for assessing LLM-generated responses to patient questions about rare disease. Conclusion: LLM-based evaluation, particularly reference-guided scoring with GPT models, can support the scalable development and evaluation of LLM-based rare disease-specific chatbot systems.

Ähnliche Arbeiten