Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Using a Diverse Test Suite to Assess Large Language Models on Fast Health Care Interoperability Resources Knowledge: Comparative Analysis
3
Zitationen
14
Autoren
2025
Jahr
Abstract
Background: Recent natural language processing breakthroughs, particularly with the emergence of large language models (LLMs), have demonstrated remarkable capabilities on general knowledge benchmarks. However, there is limited data on the performance and understanding of these models in relation to the Fast Healthcare Interoperability Resources (FHIR) standard. The complexity and specialized nature of FHIR present challenges for LLMs, which are typically trained on broad datasets and may have a limited understanding of the nuances required for domain-specific tasks. Improving health data interoperability can greatly benefit the use of clinical data and interaction with electronic health records. Objective: This study presents the Fast Healthcare Interoperability Resources (FHIR) Workbench, a comprehensive suite of datasets designed to evaluate the ability of LLMs to understand and apply the FHIR standard. Methods: In total, 4 evaluation datasets were created to assess the FHIR knowledge and capabilities of LLMs. These tasks include multiple-choice questions on general FHIR concepts and the FHIR Representational State Transfer (REST) application programming interface, as well as correctly identifying the resource type and generating FHIR resources from unstructured clinical patient notes. In addition, we evaluate open-source LLMs, such as Qwen 2.5 Coder and DeepSeek-V3, and commercial LLMs, including GPT-4o and Gemini 2, on these tasks in a zero-shot setting. To provide context for interpreting LLM performance, a subset of the datasets was human-evaluated by recruiting 6 participants with varying levels of FHIR expertise. Results: Our evaluation across multiple FHIR tasks revealed nuanced performance metrics. Commercial models demonstrated exceptional capabilities, with GPT-4o achieving a 0.9990 F1-score on the FHIR-ResourceID task, 0.9400 on the FHIR-QA task, and 0.9267 on the FHIR-RESTQA task. Open-source models also demonstrated strong performance, with DeepSeek-v3 achieving 0.9400 on FHIR-QA, 0.9400 on FHIR-RESTQA, and 0.9142 on FHIR-ResourceID. Qwen 2.5 Coder-7B-Instruct demonstrated high accuracy, scoring 0.9533 on FHIR-QA and 0.8920 on FHIR-ResourceID. However, all models struggled with the Note2FHIR task, with performance ranging from 0.0382 (OLMo) to a maximum of 0.3633 (GPT-4.5-preview), highlighting the significant challenge of converting unstructured clinical text into FHIR-compliant resources. Human participants achieved accuracy scores ranging from 0.50 to 1.0 across the first 3 tasks. Conclusions: This study highlights the competitive performance of both open-source models, such as Qwen and DeepSeek, and commercial models, such as GPT-4o and Gemini, in FHIR-related tasks. While open-source models are advancing rapidly, commercial models still have an advantage for specific, complex tasks. The FHIR Workbench offers a valuable platform for evaluating the capabilities of these models and promoting improvements in health data interoperability.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.646 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.554 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.071 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.851 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.