Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Benchmarking agreement between large language models and published clinical trial conclusions across four artificial intelligence platforms
0
Zitationen
9
Autoren
2026
Jahr
Abstract
Advanced large language models (LLMs) such as ChatGPT, Gemini, Grok3, and Claude offer new possibilities for medical research interpretation and clinical decision support. While these models demonstrate remarkable natural language processing capabilities, their ability to independently reason through clinical trial data and produce conclusions consistent with published trial interpretations remains underexplored. The objective was to evaluate the reliability of LLMs in interpreting numerical and statistical healthcare data. For this study, landmark randomized controlled trials (RCTs) were selected as a standardized domain to minimize bias from poor-quality research designs. Twenty landmark RCTs from the New England Journal of Medicine were analyzed in neurosurgical and cardiovascular intervention domains. Four AI platforms were evaluated using a structured prompt covering five domains: evidence interpretation, statistical understanding, clinical relevance, limitation recognition, and practical applicability. Two independent raters scored all AI outputs on a 0–5 scale per domain, and interobserver reliability was assessed. Primary outcomes included concordance with published trial conclusions, accuracy of primary outcome identification, and appropriateness of recommendations. Secondary outcomes included output pattern analysis, recognition of limitations, and handling of confounding factors. ChatGPT demonstrated the highest concordance with published conclusions at 100.0%, followed by Gemini at 84%, Grok3 at 72%, and Claude at 68%. However, these concordance scores should be interpreted cautiously, as the LLMs may have been trained on these published trials, potentially inflating alignment with published conclusions. ChatGPT and Gemini accurately identified limitations and confounding factors, while Grok3 and Claude struggled in these secondary outcomes. Interobserver reliability between raters was good (Cronbach’s α = 0.868), supporting the consistency of scoring. Certain LLMs, particularly ChatGPT and Gemini, can reliably interpret clinical trial data and align closely with human conclusions, suggesting a potential role in data summarization and evidence synthesis. These findings highlight the importance of selecting AI tools carefully and highlight potential applications in research and clinical workflows.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.436 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.311 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.753 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.523 Zit.