Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Diagnostic Performance of a Large Language Model ( <scp>ChatGPT</scp> ‐4o) in Chronic Rhinosinusitis <scp>CT</scp> Scan Interpretation

2026·0 Zitationen·Laryngoscope Investigative OtolaryngologyOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

ABSTRACT Background Large language models (LLMs), such as ChatGPT, are increasingly utilized by physicians for clinical decision support due to their ease of use and versatility. However, their performance in diagnostic imaging remains largely untested. This study prospectively evaluates ChatGPT's ability to interpret sinus computed tomography (CT) scans for chronic rhinosinusitis (CRS), using radiologist assessment as the reference standard. Methods In this prospective cohort study, 102 coronal sinus CT scans were evaluated by both a board‐certified radiologist and ChatGPT‐4o. Each scan was screen recorded and uploaded twice to ChatGPT to assess repeatability, resulting in 306 total interpretations. The radiologist reviewed the same screen recordings provided to ChatGPT. Both raters assessed 11 predefined binary anatomical features and generated Lund‐Mackay scores. Diagnostic performance was assessed using standard accuracy metrics, and inter‐rater agreement was evaluated using established reliability coefficients. Results ChatGPT demonstrated variable performance across anatomical features. Sensitivity ranged from 0.00 to 0.89, and specificity from 0.26 to 0.95. The model demonstrated relatively high sensitivity for mucosal thickening (0.84) and sinus expansion (0.73), as well as strong agreement with the radiologist for the lamina papyracea (AC1 = 0.92) and anterior ethmoid artery (AC1 = 0.77). However, performance was poor for air‐fluid levels and bone thinning. Agreement with the radiologist was low across most features (AC1 < 0.4 in 82% of variables), and repeatability between ChatGPT versions was limited (mean AC1 = 0.29). Correlation between runs for Lund‐Mackay scores was weak ( r = 0.11), and agreement with the radiologist was poor (ICC < 0.07). Conclusion ChatGPT demonstrates partial capability in identifying specific sinus CT findings; however, it lacks overall diagnostic consistency. Human radiologists remain essential, and the clinical use of LLMs in imaging should be approached with caution.

Autoren

Institutionen

Themen

Sinusitis and nasal conditionsArtificial Intelligence in Healthcare and EducationClinical Reasoning and Diagnostic Skills

Volltext beim Verlag öffnen

Diagnostic Performance of a Large Language Model ( <scp>ChatGPT</scp> ‐4o) in Chronic Rhinosinusitis <scp>CT</scp> Scan Interpretation

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen