OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 12.05.2026, 07:07

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Accuracy, consistency, and contextual understanding of large language models in restorative dentistry and endodontics

2025·5 Zitationen·Journal of DentistryOpen Access
Volltext beim Verlag öffnen

5

Zitationen

4

Autoren

2025

Jahr

Abstract

OBJECTIVE: This study aimed to evaluate and compare the performance of several large language models (LLMs) in the context of restorative dentistry and endodontics, focusing on their accuracy, consistency, and contextual understanding. METHODS: The dataset was extracted from the national educational archives of the Collège National des Enseignants en Odontologie Conservatrice (CNEOC) and includes all chapters from the reference manual for dental residency applicants. Multiple-choice questions (MCQs) were selected following a review by three independent academic experts. Four LLMs were assessed: ChatGPT-3.5, ChatGPT-4 (OpenAI), Claude-3 (Anthropic), and Mistral 7B (Mistral AI). Model accuracy was determined by comparing responses with expert-provided answers. Consistency was measured through robustness (the ability to provide identical responses to paraphrased questions) and repeatability (the ability to provide identical responses to the same question). Contextual understanding was evaluated based on the model's ability to categorise questions correctly and infer terms from definitions. Additionally, accuracy was reassessed after providing the LLMs with the relevant full course chapter. RESULTS: A total of 517 MCQs and 539 definitions were included. ChatGPT-4 and Claude-3 demonstrated significantly higher accuracy and repeatability than Mistral 7B, with ChatGPT-4 showing the greater robustness. Advanced LLMs displayed high accuracy in presenting dental content, although performance varied on closely related concepts. Supplying course chapters generally improved response accuracy, though inconsistently across topics. CONCLUSION: Even the most advanced LLMs, such as ChatGPT-4 and Claude 3, achieve moderate performance and require cautious use due to inconsistencies in robustness. Future studies should focus on integrating validated content and refining prompt engineering to enhance the educational and clinical utility of LLMs. CLINICAL SIGNIFICANCE: The findings underscore the potential of advanced LLMs and context-based prompting in restorative dentistry and endodontics.

Ähnliche Arbeiten