OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 07.04.2026, 09:02

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Can large language models follow guidelines? A comparative study of ChatGPT-4o and DeepSeek AI in clavicle fracture management based on AAOS recommendations

2025·1 Zitationen·BMC Medical Informatics and Decision MakingOpen Access
Volltext beim Verlag öffnen

1

Zitationen

2

Autoren

2025

Jahr

Abstract

Artificial intelligence (AI)-based large language models (LLM) are increasingly used in healthcare education. However, the accuracy, readability, and reliability of their medical outputs remain a concern. This study aimed to compare the quality of responses generated by ChatGPT-4o and DeepSeek AI regarding the diagnosis and treatment of clavicle fractures, based on the 2022 AAOS Clinical Practice Guidelines (CPG). Fourteen clinical questions were formulated based on the AAOS CPG for clavicle fractures. Each question was independently submitted to ChatGPT-4o and DeepSeek AI. Responses were evaluated using standardized scoring tools, including DISCERN, PEMAT-P, CLEAR, Flesch-Kincaid Grade Level, Flesch Reading Ease, and Gunning-Fog Index. Two orthopedic surgeons independently rated the responses, and inter-rater scores were averaged. Statistical comparison between the two AI models was conducted using the Mann–Whitney U test. DeepSeek AI generated responses with a significantly higher word count (median: 572, IQR: 258.25 vs. 438.5, IQR: 229; p = 0.016), CLEAR score (median: 18, IQR: 0.75 vs. 16, IQR: 0.75; p < 0.001). No statistically significant differences were found in PEMAT understandability (median: 77.7 vs. 77.7; p = 0.519), PEMAT actionability (median: 0 vs. 0; p = 1.000), or PEMAT total score (median: 57.2 vs. 58; p > 0.05). Similarly, no statistically significant differences were observed in DISCERN (52.1 vs. 51.6; p > 0.05), readability indices, binary accuracy (ChatGPT: 0.93, DeepSeek: 0.89; p > 0.05), or weighted accuracy (ChatGPT: 0.83, DeepSeek: 0.79; p > 0.05). Both models demonstrated generally high accuracy levels. Both ChatGPT-4o and DeepSeek AI generated coherent and clinically relevant responses to guideline-based questions on clavicle fracture management. However, neither model achieved meaningful scores in PEMAT actionability, and occasional inaccuracies and hallucinations were observed. While DeepSeek produced longer responses, verbosity did not correspond to superior quality. These findings suggest that LLMs may serve as supplementary tools for medical education and reference, but they cannot replace evidence-based clinical judgment, underscoring the need for supervised integration and ongoing validation. Finally, the number of prompts analyzed (14) was limited, reflecting the scope of a single guideline; this sample size restricts statistical power and generalizability, and larger multi-guideline datasets will be needed in future studies.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Shoulder and Clavicle InjuriesHip and Femur FracturesArtificial Intelligence in Healthcare and Education
Volltext beim Verlag öffnen