Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Exploring the pitfalls of large language models: Inconsistency and inaccuracy in answering pathology board examination‐style questions

2023·26 Zitationen·Pathology International

Volltext beim Verlag öffnen

Zitationen

Autoren

2023

Jahr

Abstract

Over the past decade, artificial intelligence (AI) has made significant progress, particularly in the development of large language models (LLMs). ChatGPT (OpenAI), including its advanced versions GPT-3.5 and GPT-4, and Google Bard (Google) are two LLMs that have been widely used to generate human-like text, understand context, respond to queries, and facilitate language translation.1 Importantly, Google Bard has the unique ability to access real-world, current information through Google Search. This capability renders it especially beneficial for answering queries that require the latest data. These LLMs have been tested for various applications, including in the medical field. For example, ChatGPT has shown promise not only in general medical licensing exams in the United States2 but also in specialized areas such as neurosurgery, where GPT-4 surpassed GPT-3.5 and Google Bard.3 However, despite advances in AI and machine learning in pathology, there has been limited research on the applicability of LLMs in this specific area.4 The present study aimed to fill that gap by evaluating the performance of ChatGPT and Google Bard in the field of pathology. The present study compared the performance of ChatGPT (GPT-4) and Google Bard, using questions from the PathologyOutlines.com Question Bank (https://www.pathologyoutlines.com/review-questions), a resource for pathology examination preparation. The question bank had 3365 questions across pathology subspecialties, but for this study, we selected 150 multiple-choice questions, with 10 from each of 15 subspecialties to ensure a balanced dataset: autopsy and forensics, bone, joints and soft tissues, breast, dermatopathology, gastrointestinal and liver, genitourinary and adrenal, gynecological, head and neck, hematopathology, informatics and digital pathology, medical renal, neuropathology, stains and CD markers/immunohistochemistry, thoracic, and clinical pathology. Each question was presented in a single best answer, multiple-choice format. Both LLMs were presented with the same set of questions. No additional context or hints were provided to the models apart from the questions themselves, to simulate real-world application. Questions containing images were excluded from the question bank as ChatGPT is not capable of processing image data. Additionally, to evaluate the consistency of both LLMs, the same set of 150 questions was posed to the models on two separate occasions, with a 2-week interval. Overall, ChatGPT significantly outperformed Google Bard across all subspecialties; ChatGPT achieved a total score of 122 out of 150 compared with Google Bard's score of 70 (p < 0.001; χ2-test). Detailed performance outcomes of each LLM across all subspecialties are presented in Table 1. In the assessment of consistency of both LLMs, test scores were largely consistent between the first and second sessions. The scores of ChatGPT were 122 and 126 out of 150 in the first and second tests, respectively, while Google Bard scored 70 and 69 in the same tests. Despite the relative stability in test scores, a detailed inspection revealed significant changes; identical answers in both sessions were present in only 85% (127/150) of ChatGPT's responses and a lower 61% (92/150) of Google Bard's (Table 1). ChatGPT initially provided 28 incorrect answers. In the retest, it corrected 11 of these but also altered seven correct answers to incorrect ones. Among the initial errors, ChatGPT repeated five. Google Bard exhibited a similar but more pronounced pattern. Starting with 80 incorrect responses, it corrected 19 in the retest, made new errors in 20 previously correct answers, and repeated 19 of its original mistakes in the retest. We identified incorrect answers from both LLMs in our study. One example was a question about Lynch syndrome, “Lynch syndrome usually arises from a germline mutation in a gene coding for a mismatch repair protein. A germline mutation in which of the following genes could also cause Lynch syndrome?”, with options A. BRAF, B. CDH1, C. EPCAM, D. MUTYH. The correct answer was C. EPCAM. However, Google Bard incorrectly chose option D. MUTYH and justified its answer by associating MUTYH mutations with Lynch syndrome, which is a factual inaccuracy as MUTYH mutations cause a different type of hereditary colorectal cancer known as MUTYH-associated polyposis. In contrast, ChatGPT correctly selected the answer as C. EPCAM. Another example was a question about nemaline myopathy, “In which gene are de novo mutations most commonly associated with nemaline myopathy?”, with options A. NEB, B. KLHL40, C. TPM3, D. ACTA1, and E. TNNT1. The correct answer was D. ACTA1, but both models selected A. NEB. Specifically, ChatGPT justified its answer by stating that NEB is the most commonly involved gene in cases of nemaline myopathy, while Google Bard indicated that de novo mutations in the NEB gene are the most common cause of this condition. These responses exhibit factual and interpretative inaccuracies, as NEB mutations, while common in nemaline myopathy, are typically inherited in an autosomal recessive manner, not de novo. Our study further underscores the strengths and weaknesses of LLMs in medicine. While ChatGPT consistently surpassed Google Bard in accuracy and consistency, neither model answered all questions correctly, suggesting knowledge or comprehension gaps. Additionally, retesting after 2 weeks revealed inconsistencies in both LLMs' responses to the same questions, highlighting potential reliability issues. In assessing LLMs' comprehension of medical queries, our study identified two error types. First, Google Bard displayed factual inaccuracies, incorrectly linking MUTYH to Lynch syndrome despite vast data access. Second, ChatGPT exhibited interpretation errors. While answering a question about “de novo” mutations in nemaline myopathy, it correctly identified NEB as a common cause but overlooked the specific “de novo” context, highlighting LLMs' potential for nuanced misunderstandings. Another important consideration in the application of LLMs in medical fields is their consistency or reliability, defined as the models’ ability to provide the same answer to identical prompts when asked on different occasions. Our assessment of consistency revealed a suboptimal consistency rate for both LLMs (i.e., 85% in ChatGPT and 61% in Google Bard), which are consistent with the results of another study that evaluated ChatGPT's responses to surgical case questions.5 Such inconsistencies underline the current limitations of LLMs and highlight the necessity for further development and refinement to improve their consistency for effective use in the medical field. There are some limitations of our study. First, there was no direct comparison with human performance. While our results shed light on the capabilities of LLMs in answering complex medical questions, understanding how their performance compares directly to medical students or professionals remains crucial. Additionally, our focus was largely on pathology questions in English language. To generalize our findings, future studies should encompass different medical specialties and languages. Lastly, the challenge of incorporating images into our evaluation also presents a significant limitation. In conclusion, our study indicates that LLMs have the potential to assist in clinical decision-making in the future, although both models demonstrated inconsistencies and inaccuracies, emphasizing the need for their further development and rigorous validation. While the potential of these AI models is promising, human oversight and expertise remain crucial in the medical field. Conception and design of the study, acquisition and analysis of data, and drafting the manuscript: Shunsuke Koga. This manuscript was edited and proofread by ChatGPT (GPT-4, OpenAI), and the author verified the final content. None declared.

Autoren

Shunsuke Koga

Institutionen

Hospital of the University of Pennsylvania(US)

Themen

Artificial Intelligence in Healthcare and EducationRadiomics and Machine Learning in Medical ImagingAutopsy Techniques and Outcomes

Volltext beim Verlag öffnen

Exploring the pitfalls of large language models: Inconsistency and inaccuracy in answering pathology board examination‐style questions

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen