Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Performance of large language models on family medicine licensing exams
5
Zitationen
5
Autoren
2025
Jahr
Abstract
BACKGROUND AND AIM: Large language models (LLMs) have shown promise in specialized medical exams but remain less explored in family medicine and primary care. This study evaluated eight state-of-the-art LLMs on the official Israeli primary care licensing exam, focusing on prompt design and explanation quality. METHODS: Two hundred multiple-choice questions were tested using simple and few-shot Chain-of-Thought prompts (prompts that include examples which illustrate reasoning). Performance differences were assessed with Cochran's Q and pairwise McNemar tests. A stress test of the top performer (openAI's o1-preview) examined 30 selected questions, with two physicians scoring explanations for accuracy, logic, and hallucinations (extra or fabricated information not supported by the question). RESULTS: Five models exceeded the 65% passing threshold under simple prompts; seven did so with few-shot prompts. o1-preview reached 85.5%. In the stress test, explanations were generally coherent and accurate, with 5 of 120 flagged for hallucinations. Inter-rater agreement on explanation scoring was high (weighted kappa 0.773; Intraclass Correlation Coefficient (ICC) 0.776). CONCLUSIONS: Most tested models performed well on an official family medicine exam, especially with structured prompts. Nonetheless, multiple-choice formats cannot address broader clinical competencies such as physical exams and patient rapport. Future efforts should refine these models to eliminate hallucinations, test for socio-demographic biases, and ensure alignment with real-world demands.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.774 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.685 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.244 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.898 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.