Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Assessing the clinical reasoning of large language models on complex rheumatology cases: A multidimensional evaluation of four artificial intelligence

2026·0 Zitationen·Health Informatics JournalOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

BackgroundLarge language models (LLMs) have demonstrated promising capabilities in medical diagnostic reasoning, yet their performance in specialized clinical domains such as rheumatology remains incompletely characterized. While diagnostic accuracy has been evaluated, critical dimensions including calibration, reasoning quality, and temporal stability have not been systematically assessed across contemporary models.ObjectivesThis study aimed to comprehensively evaluate and compare the diagnostic accuracy, certainty expression, reasoning quality, and hallucination rates of four state-of-the-art LLMs ChatGPT-4, Claude 3.5, DeepSeek-V3, and Gemini 1.5 Pro in complex rheumatologic case scenarios.DesignA cross-sectional, analytical, and comparative study was conducted following STARD and TRIPOD guidelines, adapted for LLM evaluation. Nine complex rheumatologic cases from published case reports were evaluated at three time points (Days 1, 5, and 10) between July 1 and September 18,2025.MethodsStandardized clinical vignettes were submitted to each LLM under controlled experimental conditions. Two blinded senior rheumatologists independently assessed diagnostic accuracy, reasoning quality across five analytical dimensions using Likert scales, and hallucination frequency. Certainty expression and temporal stability were quantified using intraclass correlation coefficients. Correlation analyses examined relationships between reasoning quality and confidence expression.ResultsAll models achieved near-perfect diagnostic accuracy, with ChatGPT, Claude and Gemini correctly identifying the primary diagnosis in 100% of cases and DeepSeek in 88.9%. However, Spearman correlation analysis revealed uniformly weak and non-significant associations between reasoning quality and expressed certainty across all models (ρ range: -0.156 to 0.215, all p>0.05), indicating fundamental miscalibration. ChatGPT demonstrated the highest reasoning score (3.89±0.23) and lowest hallucination rate (7.4%), while Gemini showed the highest hallucination frequency (18.5%). Temporal stability was excellent for ChatGPT (ICC=0.84) and good for DeepSeek (ICC=0.79).ConclusionDespite exceptional diagnostic accuracy, current LLMs exhibit critical limitations in confidence calibration and variable hallucination rates, representing significant barriers to safe clinical deployment in rheumatology.

Autoren

Institutionen

Inoue Hospital(JP)

Themen

Artificial Intelligence in Healthcare and EducationClinical Reasoning and Diagnostic SkillsRheumatoid Arthritis Research and Therapies

Volltext beim Verlag öffnen

Assessing the clinical reasoning of large language models on complex rheumatology cases: A multidimensional evaluation of four artificial intelligence

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen