OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 10.05.2026, 01:56

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

6ER-037 Performance and concordance of artificial intelligence in the Board of Pharmacy Specialties

2025·0 ZitationenOpen Access
Volltext beim Verlag öffnen

0

Zitationen

11

Autoren

2025

Jahr

Abstract

<h3>Background and Importance</h3> Artificial Intelligence (AI) is increasingly assuming a pivotal role in modern society. Its diverse applications are transforming numerous tasks, including those within hospital pharmacy. However, the development of robust AI evaluation tools is essential to ensure their effective integration into professional workflows. <h3>Aim and Objectives</h3> To assess the performance and concordance of three AI systems (ChatGPT 3.5, ChatGPT 4.0, and Gemini) in addressing Board of Pharmacy Specialties (BPS) examination questions. <h3>Material and Methods</h3> Observational and cross-sectional study conducted in August 2024. All sample questions and answers provided on the BPS website, designed to familiarise candidates with the structure and format of BPS certification exams, were extracted. A protocol was developed to guide the AIs in responding to the questions, instructing them to rely on high-quality references and to refrain from generating answers not based on data. A total of three tests were conducted for each AI, with each test being administered by a different researcher. In cases of insufficient information or uncertainty, they were encouraged to opt for ‘DK/NR’ (<i>Doesn’t Know/No Response</i>). Six researchers independently administered the test to each AI. The Chi-Squared test was used to compare the total proportions of correct answers across the different AIs. The Kappa index, along with Altman’s criteria, was applied to assess the concordance of responses from each AI in comparison to the various researchers. <h3>Results</h3> A total of 137 questions were asked. The proportion of correct answers for each test administered by the researchers were as follows: ChatGPT 3.5: 83.2%, 76.6%, and 83.9%. Mean: 81.3%. ChatGPT 4.0: 86.1%, 83.9%, and 73.7%. Mean: 81.3%. Gemini: 65.0%, 59.1%, and 65.0%. Mean: 63.0%. Statistically significant differences were found by ChatGPT 4.0 and ChatGPT 3.5 (81.3%) compared to Gemini (63.0%) (p &lt; 0.01). No statistically differences were found between ChatGPT 3.5 and 4.0. The Kappa indices and their mean for each AI were: ChatGPT 3.5: 0.773, 0.862, and 0.792 (mean 0.809; excellent agreement). ChatGPT 4.0: 0.686, 0.941, and 0.676 (mean 0.809; excellent agreement). Gemini: 0.548, 0.621, and 0.584 (mean 0.572; moderate agreement). <h3>Conclusion and Relevance</h3> ChatGPT 3.5 and 4.0 show comparable performance with excellent agreement, while Gemini has significantly lower accuracy and consistency. <h3>References and/or Acknowledgements</h3> <h3>Conflict of Interest</h3> No conflict of interest

Ähnliche Arbeiten