Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Retrieval‐augmented ChatGPT‐4o improves accuracy but reduces readability in hip arthroscopy patient education

2025·2 Zitationen·Knee Surgery Sports Traumatology ArthroscopyOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

PURPOSE: To compare the accuracy, readability and patient-centeredness of responses generated by standard ChatGPT-4o and its retrieval-augmented 'deep research' mode for hip arthroscopy education, addressing the current uncertainty about the reliability of large language models in orthopaedic patient information. METHODS: Thirty standardised patient questions were derived through structured searches of reputable orthopaedic health information websites. Both ChatGPT configurations independently generated responses. Two fellowship-trained orthopaedic surgeons assessed each response independently, using 5-point Likert scales (1 = poor, 5 = excellent) for accuracy, clarity, comprehensiveness and readability. Intra- and interrater reliabilities were calculated, and readability metrics were also evaluated using Flesch-Kincaid Grade Level (FKGL) and Flesch Reading Ease Score (FRES). RESULTS: Deep Research outperformed the standard model in accuracy (4.7 ± 0.4 vs. 4.0 ± 0.5; p = 0.012) and comprehensiveness (4.8 ± 0.3 vs. 3.9 ± 0.6; p < 0.001). The standard model performed better in clarity (4.6 ± 0.4 vs. 4.4 ± 0.5; p = 0.048). Readability Likert scores were comparable (p = 0.729), but FKGL and FRES favoured the standard model (both p < 0.001). Interrater intraclass correlation coefficients (ICC) ranged from 0.57 to 0.83; intrarater ICCs from 0.63 to 0.79. CONCLUSION: Deep research provides superior scientific rigour, whereas the standard model offers better readability. A hybrid approach combining model strengths may maximise educational effectiveness, though clinical oversight remains essential to mitigate misinformation risks. The observed differences were modest in magnitude, aligning with previously reported accuracy-readability trade-offs in LLMs. These results should be interpreted as exploratory and hypothesis-generating. LEVEL OF EVIDENCE: Level IV, cross-sectional, comparative simulation study.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationClinical Reasoning and Diagnostic SkillsRadiology practices and education

Volltext beim Verlag öffnen

Retrieval‐augmented ChatGPT‐4o improves accuracy but reduces readability in hip arthroscopy patient education

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen