Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Retrieval‐augmented ChatGPT‐4o improves accuracy but reduces readability in hip arthroscopy patient education
2
Zitationen
8
Autoren
2025
Jahr
Abstract
PURPOSE: To compare the accuracy, readability and patient-centeredness of responses generated by standard ChatGPT-4o and its retrieval-augmented 'deep research' mode for hip arthroscopy education, addressing the current uncertainty about the reliability of large language models in orthopaedic patient information. METHODS: Thirty standardised patient questions were derived through structured searches of reputable orthopaedic health information websites. Both ChatGPT configurations independently generated responses. Two fellowship-trained orthopaedic surgeons assessed each response independently, using 5-point Likert scales (1 = poor, 5 = excellent) for accuracy, clarity, comprehensiveness and readability. Intra- and interrater reliabilities were calculated, and readability metrics were also evaluated using Flesch-Kincaid Grade Level (FKGL) and Flesch Reading Ease Score (FRES). RESULTS: Deep Research outperformed the standard model in accuracy (4.7 ± 0.4 vs. 4.0 ± 0.5; p = 0.012) and comprehensiveness (4.8 ± 0.3 vs. 3.9 ± 0.6; p < 0.001). The standard model performed better in clarity (4.6 ± 0.4 vs. 4.4 ± 0.5; p = 0.048). Readability Likert scores were comparable (p = 0.729), but FKGL and FRES favoured the standard model (both p < 0.001). Interrater intraclass correlation coefficients (ICC) ranged from 0.57 to 0.83; intrarater ICCs from 0.63 to 0.79. CONCLUSION: Deep research provides superior scientific rigour, whereas the standard model offers better readability. A hybrid approach combining model strengths may maximise educational effectiveness, though clinical oversight remains essential to mitigate misinformation risks. The observed differences were modest in magnitude, aligning with previously reported accuracy-readability trade-offs in LLMs. These results should be interpreted as exploratory and hypothesis-generating. LEVEL OF EVIDENCE: Level IV, cross-sectional, comparative simulation study.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.646 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.554 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.071 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.851 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.