Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Evaluating the progression of artificial intelligence and large language models in medicine through comparative analysis of ChatGPT-3.5 and ChatGPT-4 in generating vascular surgery recommendations
10
Zitationen
4
Autoren
2023
Jahr
Abstract
ObjectiveArtificial intelligence (AI) continues to become increasingly integrated with clinical medicine. Generative AI, and particularly Large Language Models (LLMs) like ChatGPT-3.5 and ChatGPT-4, have shown promise in generating human-like text, providing a potential tool for augmenting clinical care. These online AI chatbots have already demonstrated remarkable clinical potential, having passed the USMLE, for example. The evaluation of these LLMs in the surgical literature, especially as it applies to judgement and decision-making, is sparse. This study aimed to 1) evaluate the efficacy of ChatGPT-4 in providing clinician-level vascular surgery recommendations and 2) compare its performance with its predecessor, ChatGPT-3.5, to gauge the progression of clinical competencies of LLMs.MethodsA set of forty clinician-level questions spanning four domains of vascular surgery (carotid artery disease, visceral artery aneurysms, abdominal aortic aneurysms, chronic limb-threatening ischemia) were generated by clinical experts. These domains were chosen based on the availability of updated guidelines published before September 2021, which served as the cut-off date for the training dataset of the LLMs. The questions, devoid of additional context or prompts, were inputted into ChatGPT-3.5 and ChatGPT-4 between March 20 and March 25, 2023. Responses were independently evaluated by two blinded reviewers using a 5-point Likert scale assessing comprehensiveness, accuracy, and consistency with guidelines. The Flesch-Kincaid Grade Level of each response was also determined. Independent samples t-test and Fisher’s exact test were employed for comparative analysis.ResultsChatGPT-4 significantly outperformed ChatGPT-3.5 by providing appropriate recommendations in 38 out of 40 questions (95%) as compared to 13 out of 40 (32.5%) by ChatGPT-3.5 (Fisher's Exact Test p < 0.001). Despite longer response lengths (chatGPT-4 mean 317 ± 58 words vs. chatGPT-3.5 mean 265 ± 74 words (p < 0.001), the reading ease of both models remained similar, corresponding to college-level graduate texts.ConclusionChatGPT-4 can consistently respond accurately to complex clinician-level vascular surgery questions. This also represents a substantial advancement in performance compared to its predecessor, which was released only a few months prior, highlighting the progress of performance of LLMs in clinical medicine. Several limitations persist with the use of LLMs, including hallucinations, data privacy issues, and the black box problem, However, these findings suggest that with further refinements, LLMs like ChatGPT-4 have the potential to become indispensable tools in clinical decision-making, thereby marking an exciting frontier in the fusion of AI with clinical medicine and vascular surgery.
Ähnliche Arbeiten
Classification of Surgical Complications
2004 · 30.360 Zit.
2013 ESH/ESC Guidelines for the management of arterial hypertension
2013 · 13.658 Zit.
CONSORT 2010 Statement: updated guidelines for reporting parallel group randomised trials
2010 · 13.467 Zit.
Seventh Report of the Joint National Committee on Prevention, Detection, Evaluation, and Treatment of High Blood Pressure
2003 · 13.245 Zit.
2013 ACCF/AHA Guideline for the Management of Heart Failure
2013 · 12.597 Zit.