Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Comparative Evaluation of Four Large Language Models in Turkish Dentistry Specialization Exam
1
Zitationen
1
Autoren
2025
Jahr
Abstract
Background The aim of the study is to evaluate the performance of four leading Large Language Models (LLMs) in the 2021 Dentistry Specialization Training Exam (DSE). Methods A total of 112 questions were used, including 39 questions in basic sciences and 73 questions in clinical sciences, which did not include the figures and graphs asked in the 2021 DSE. The study evaluated the performance of four LLMs: Claude-3.5 Haiku, GPT-3.5, Co-pilot, and Gemini-1.5. Results In basic sciences, Claude-3.5 Haiku and GPT-3.5 answered all questions correctly by 100%, while Gemini-1.5 answered by 94.9% and Co-pilot by 92.3%. In clinical sciences, Claude-3.5 Haiku showed an overall correct answer rate of 89%, Co-pilot 80.9%, GPT-3.5 79.7% and Gemini-1.5 65.7%. For all questions, Claude-3.5 Haiku showed a correct answer rate of 92.85%, GPT-3.5 86.6%, Co-pilot 84.8% and Gemini-1.5 75.9%. While the performance of LLMs in basic sciences was similar (p=0.134), there was a statistically significant difference between the performances of LLMs in clinical sciences and all questions (p=0.007 and p=0.005, respectively). Conclusion In all questions and clinical sciences, Claude-3.5 Haiku performed best, Gemini-1.5 performed worst, and GPT-3.5 and Co-pilot performed similarly. The 4 LLM models examined showed a higher success rate in basic sciences than in clinical sciences. The results showed that AI-based LLMs can perform well in knowledge-based questions such as basic sciences but perform poorly in questions that require knowledge as well as clinical reasoning, discussion, and interpretation, such as clinical sciences. Keywords Artificial intelligence, Dentistry, Dentistry specialization training, Large language model
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.707 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.613 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.159 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.875 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.