Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Examiner stratification reveals clinically relevant variability in large language model answers to endodontic patient questions

2026·0 Zitationen·Frontiers in MedicineOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

Introduction: Large language models (LLMs) are increasingly used by patients seeking endodontic information, yet their clinical reliability and safety in patient-centred communication remain uncertain. Methods: This study evaluated the clinical reliability and safety of three contemporary LLMs (ChatGPT GPT-4o, Claude Sonnet 4.5, and Gemini 3 Flash) using 50 patient-centred endodontic questions (35 frequently asked questions and 15 scenario-based prompts). Each question was submitted six times per model in independent sessions. Responses were anonymised and independently assessed by four examiners using a structured Clinical Reliability and Safety Framework. Due to poor inter-examiner agreement, analyses were conducted using examiner stratification. Reproducibility was assessed using word count variability, embedding-based semantic similarity, and lexical distance metrics. Results: Statistically significant differences in clinical reliability were observed across all examiners. ChatGPT consistently received the lowest scores, whereas Gemini most frequently achieved the highest ratings. Model differentiation was clearer for structured frequently asked questions and selected clinical domains than for scenario-based prompts. All models demonstrated stable response lengths across repeated runs. Gemini showed the highest semantic consistency despite greater surface-level rewording. Discussion: Contemporary LLMs demonstrate clinically meaningful variability beyond factual accuracy, particularly in safety framing and clinical actionability. Reliability is influenced by question structure and clinical context. Multidimensional, examiner-aware evaluation frameworks are necessary to meaningfully assess safety and support responsible integration of LLMs into endodontic patient communication.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationTopic ModelingMachine Learning in Healthcare

Volltext beim Verlag öffnen

Examiner stratification reveals clinically relevant variability in large language model answers to endodontic patient questions

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen