Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Multidisciplinary expert evaluation of large language models on questions regarding bariatric surgery: a comparative analysis of ERNIE Bot 4.0, ChatGPT-4, Claude 3 Opus, and Gemini Pro
0
Zitationen
10
Autoren
2026
Jahr
Abstract
Large language models (LLMs) have potential in bariatric surgery consultations, but current evaluations are limited to bariatric specialists, contradicting guidelines that call for multidisciplinary assessment. This study uses a multidisciplinary framework to evaluate LLM performance on bariatric surgery queries. Four LLMs (ERNIE Bot 4.0, ChatGPT-4, Claude 3 Opus, and Gemini Pro) were tested on 50 common bariatric surgery questions, generating 200 responses. A panel of seven experts (4 bariatric surgeons, 1 obesity physician, and 2 dietitians) assessed accuracy and comprehensiveness. Three prompt approaches were used to evaluate self-correction: basic review, web-enabled review, and evidence-based review. Qualitative analysis identified poorly rated responses. The study adheres to TRIPOD-LLM Statement reporting guideline. Rater agreement was fair, accuracy rating, Fleiss’ kappa = 0.210 (95% CI: 0.208–0.212; Z = 7.815; P < 0.001); comprehensiveness rating, Fleiss’ kappa = 0.464 (95% CI: 0.453–0.476; Z = 2.543; P < 0.011). Claude 3 Opus provided the longest answers, while ERNIE Bot 4.0 had the highest accuracy (19.46 ± 2.07). ChatGPT-4 had 90.0% “good” responses, compared to 84.0% for ERNIE Bot 4.0, 80.0% for Claude 3 Opus, and 48.0% for Gemini Pro (P ≤ 0.05). All LLMs performed sub-optimally in comprehensiveness (scores: 2.86–3.13 out of 5). However, they showed significant self-correction capabilities, especially when using internet searches and evidence-based resources. LLMs show potential for bariatric surgery education, but their direct clinical use is challenging due to variable accuracy and suboptimal comprehensiveness. Future efforts should focus on developing specialized LLMs with robust evidence and multidisciplinary input to ensure patient safety and optimal outcomes.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.402 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.270 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.702 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.507 Zit.