Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
6ER-037 Performance and concordance of artificial intelligence in the Board of Pharmacy Specialties
0
Zitationen
11
Autoren
2025
Jahr
Abstract
<h3>Background and Importance</h3> Artificial Intelligence (AI) is increasingly assuming a pivotal role in modern society. Its diverse applications are transforming numerous tasks, including those within hospital pharmacy. However, the development of robust AI evaluation tools is essential to ensure their effective integration into professional workflows. <h3>Aim and Objectives</h3> To assess the performance and concordance of three AI systems (ChatGPT 3.5, ChatGPT 4.0, and Gemini) in addressing Board of Pharmacy Specialties (BPS) examination questions. <h3>Material and Methods</h3> Observational and cross-sectional study conducted in August 2024. All sample questions and answers provided on the BPS website, designed to familiarise candidates with the structure and format of BPS certification exams, were extracted. A protocol was developed to guide the AIs in responding to the questions, instructing them to rely on high-quality references and to refrain from generating answers not based on data. A total of three tests were conducted for each AI, with each test being administered by a different researcher. In cases of insufficient information or uncertainty, they were encouraged to opt for ‘DK/NR’ (<i>Doesn’t Know/No Response</i>). Six researchers independently administered the test to each AI. The Chi-Squared test was used to compare the total proportions of correct answers across the different AIs. The Kappa index, along with Altman’s criteria, was applied to assess the concordance of responses from each AI in comparison to the various researchers. <h3>Results</h3> A total of 137 questions were asked. The proportion of correct answers for each test administered by the researchers were as follows: ChatGPT 3.5: 83.2%, 76.6%, and 83.9%. Mean: 81.3%. ChatGPT 4.0: 86.1%, 83.9%, and 73.7%. Mean: 81.3%. Gemini: 65.0%, 59.1%, and 65.0%. Mean: 63.0%. Statistically significant differences were found by ChatGPT 4.0 and ChatGPT 3.5 (81.3%) compared to Gemini (63.0%) (p < 0.01). No statistically differences were found between ChatGPT 3.5 and 4.0. The Kappa indices and their mean for each AI were: ChatGPT 3.5: 0.773, 0.862, and 0.792 (mean 0.809; excellent agreement). ChatGPT 4.0: 0.686, 0.941, and 0.676 (mean 0.809; excellent agreement). Gemini: 0.548, 0.621, and 0.584 (mean 0.572; moderate agreement). <h3>Conclusion and Relevance</h3> ChatGPT 3.5 and 4.0 show comparable performance with excellent agreement, while Gemini has significantly lower accuracy and consistency. <h3>References and/or Acknowledgements</h3> <h3>Conflict of Interest</h3> No conflict of interest
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.635 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.543 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.051 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.844 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.