OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 13.04.2026, 01:39

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Benchmarking agreement between large language models and published clinical trial conclusions across four artificial intelligence platforms

2026·0 Zitationen·Scientific ReportsOpen Access
Volltext beim Verlag öffnen

0

Zitationen

9

Autoren

2026

Jahr

Abstract

Advanced large language models (LLMs) such as ChatGPT, Gemini, Grok3, and Claude offer new possibilities for medical research interpretation and clinical decision support. While these models demonstrate remarkable natural language processing capabilities, their ability to independently reason through clinical trial data and produce conclusions consistent with published trial interpretations remains underexplored. The objective was to evaluate the reliability of LLMs in interpreting numerical and statistical healthcare data. For this study, landmark randomized controlled trials (RCTs) were selected as a standardized domain to minimize bias from poor-quality research designs. Twenty landmark RCTs from the New England Journal of Medicine were analyzed in neurosurgical and cardiovascular intervention domains. Four AI platforms were evaluated using a structured prompt covering five domains: evidence interpretation, statistical understanding, clinical relevance, limitation recognition, and practical applicability. Two independent raters scored all AI outputs on a 0–5 scale per domain, and interobserver reliability was assessed. Primary outcomes included concordance with published trial conclusions, accuracy of primary outcome identification, and appropriateness of recommendations. Secondary outcomes included output pattern analysis, recognition of limitations, and handling of confounding factors. ChatGPT demonstrated the highest concordance with published conclusions at 100.0%, followed by Gemini at 84%, Grok3 at 72%, and Claude at 68%. However, these concordance scores should be interpreted cautiously, as the LLMs may have been trained on these published trials, potentially inflating alignment with published conclusions. ChatGPT and Gemini accurately identified limitations and confounding factors, while Grok3 and Claude struggled in these secondary outcomes. Interobserver reliability between raters was good (Cronbach’s α = 0.868), supporting the consistency of scoring. Certain LLMs, particularly ChatGPT and Gemini, can reliably interpret clinical trial data and align closely with human conclusions, suggesting a potential role in data summarization and evidence synthesis. These findings highlight the importance of selecting AI tools carefully and highlight potential applications in research and clinical workflows.

Ähnliche Arbeiten