Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Clinical Reliability of Large Language Models in Complex Haematology: A Multidimensional Evaluation in Hemophilia–Oncology
0
Zitationen
5
Autoren
2026
Jahr
Abstract
Background: The co-existence of hemophilia and cancer presents one of the most complex clinical scenarios, demanding individualised therapeutic planning to balance oncologic efficacy and hemostatic safety. This study evaluated the ability of two Large Language Models (LLMs)—ChatGPT (GPT-4) and Microsoft Copilot (GPT-4–based)—to generate clinically appropriate recommendations for real cases of hemophilia with concurrent malignancy. Methods: Six consecutive adult cases of hemophilia and cancer, managed at the Hemophilia Centre of Padua, Italy, were selected for evaluation. Identical structured prompts were submitted to both LLMs. Two independent expert clinicians rated the model outputs across five domains (Decision/Rationale, Strategy, Selected Drug, Regimen, and Assessment) using a four-level ordinal scale. Results: LLMs demonstrated uneven performances. Outputs were consistently rated as highly reliable in domains involving high-level synthesis, such as Assessment and Strategy. However, substantial limitations were observed in the clinically demanding domains of Selected Drug and Regimen. Critically, in the Selected Drug domain, there was complete agreement between the two expert raters for neither system. This severe lack of concordance signifies that clinicians assigned different adequacy ratings to the same output in every case, reflecting ambiguity, lack of specificity, and inconsistent clinical interpretability of the drug-related information provided by LLMs. Conclusions: While LLMs possess the capacity for high-level reasoning and strategic planning, their inability to translate principles into precise, consistent, and clinically interpretable therapeutic plans—particularly regarding drug selection and treatment regimens—is a significant constraint. These deficiencies, highlighted by the minimal expert concordance in critical domains, necessitate rigorous clinical validation before the responsible integration of LLMs into the management of this uniquely vulnerable patient population.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.402 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.270 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.702 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.507 Zit.