Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Psychometric properties and detectability of GPT-4o–generated multiple-choice questions compared with human-authored items across imaging specialties
0
Zitationen
10
Autoren
2026
Jahr
Abstract
Large language models (LLMs) have the potential to scale assessment in medical education, but their psychometric equivalence to expert-written items and the detectability of their origin remain uncertain. In a preregistered, single-center, blinded observational, within-subject comparison, we evaluated 24 GPT-4o-generated versus 24 human-authored topic-matched multiple-choice questions (MCQs) across radiation oncology, radiology, and nuclear medicine. Medical students (n = 82) and physicians (n = 46) completed an identical 48-item formative mock examination, with item origin masked. Item difficulty (human: mean 0.65 [SD 0.22] vs LLM: 0.67 [0.20]) and discrimination (0.27 [0.12] vs 0.29 [0.12]) did not differ significantly; participants did not identify item origin above chance (0.50). Expert ratings of appropriateness and didactic quality showed low interrater agreement (ICC = 0.07-0.18). In this expert-reviewed, human-in-the-loop workflow, the item difficulty and discriminatory power of MCQs generated with GPT-4o did not differ significantly from those of expert-authored items, and were not reliably recognized as AI-generated by examinees. These findings delineate a feasible pathway for responsibly scaling formative assessment content in imaging-focused medical education, while underscoring the need for explicit educational policies regarding oversight, transparency, and fairness.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.490 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.376 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.832 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.553 Zit.