Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Ask the Right Questions: Prompting Strategies Shape LLM Performance on Biliary Tract Cancer Guideline Queries
0
Zitationen
8
Autoren
2026
Jahr
Abstract
INTRODUCTION: This study evaluates how different prompting strategies affect the performance of three advanced large language models (LLMs) (GPT-4o, Claude 3.5 Sonnet, and Llama 3 70b) when answering questions about biliary tract cancer (BTC). We used European Society for Medical Oncology (ESMO) guidelines as our reference standard. The study aims to assess their accuracy, conciseness, evidence quality, and rates of hallucinations. METHODS: We conducted a cross-sectional analysis using 40 clinical questions derived from the ESMO BTC guidelines. We tested three prompting strategies: no prompt, short prompt, and long prompt. Two independent senior physicians evaluated the responses for accuracy, conciseness, and evidence quality. Inter-rater reliability, text length of response, model performance, and hallucination rates were analyzed. RESULTS: Prompting strategies significantly influenced LLM performance. Long prompts improved evidence quality and accuracy, especially for GPT-4o and Claude 3.5 Sonnet, while short prompts enhanced conciseness. GPT-4o exhibited superior overall performance, with higher accuracy and conciseness scores, whereas Claude 3.5 Sonnet excelled in evidence quality but generated longer responses. Llama 3 70b showed deficiencies in both accuracy and evidence quality. Hallucination rates were lowest for GPT-4o and Claude 3.5 Sonnet, but nearly 40% of their references were fabricated or misattributed. CONCLUSION: Prompting strategies substantially affect LLM performance in medical contexts. While GPT-4o and Claude 3.5 Sonnet demonstrate promising potential with proper prompts, the risk of hallucinations necessitates careful cross-verification. Future studies should incorporate real-world clinical scenarios to further evaluate LLM capabilities and limitations.