Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Multidisciplinary expert evaluation of large language models on questions regarding bariatric surgery: a comparative analysis of ERNIE Bot 4.0, ChatGPT-4, Claude 3 Opus, and Gemini Pro

2026·0 Zitationen·Scientific ReportsOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

Large language models (LLMs) have potential in bariatric surgery consultations, but current evaluations are limited to bariatric specialists, contradicting guidelines that call for multidisciplinary assessment. This study uses a multidisciplinary framework to evaluate LLM performance on bariatric surgery queries. Four LLMs (ERNIE Bot 4.0, ChatGPT-4, Claude 3 Opus, and Gemini Pro) were tested on 50 common bariatric surgery questions, generating 200 responses. A panel of seven experts (4 bariatric surgeons, 1 obesity physician, and 2 dietitians) assessed accuracy and comprehensiveness. Three prompt approaches were used to evaluate self-correction: basic review, web-enabled review, and evidence-based review. Qualitative analysis identified poorly rated responses. The study adheres to TRIPOD-LLM Statement reporting guideline. Rater agreement was fair, accuracy rating, Fleiss’ kappa = 0.210 (95% CI: 0.208–0.212; Z = 7.815; P < 0.001); comprehensiveness rating, Fleiss’ kappa = 0.464 (95% CI: 0.453–0.476; Z = 2.543; P < 0.011). Claude 3 Opus provided the longest answers, while ERNIE Bot 4.0 had the highest accuracy (19.46 ± 2.07). ChatGPT-4 had 90.0% “good” responses, compared to 84.0% for ERNIE Bot 4.0, 80.0% for Claude 3 Opus, and 48.0% for Gemini Pro (P ≤ 0.05). All LLMs performed sub-optimally in comprehensiveness (scores: 2.86–3.13 out of 5). However, they showed significant self-correction capabilities, especially when using internet searches and evidence-based resources. LLMs show potential for bariatric surgery education, but their direct clinical use is challenging due to variable accuracy and suboptimal comprehensiveness. Future efforts should focus on developing specialized LLMs with robust evidence and multidisciplinary input to ensure patient safety and optimal outcomes.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationMachine Learning in HealthcareBariatric Surgery and Outcomes

Volltext beim Verlag öffnen

Multidisciplinary expert evaluation of large language models on questions regarding bariatric surgery: a comparative analysis of ERNIE Bot 4.0, ChatGPT-4, Claude 3 Opus, and Gemini Pro

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen