Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Psychometric properties and detectability of GPT-4o–generated multiple-choice questions compared with human-authored items across imaging specialties

2026·0 Zitationen·npj Digital MedicineOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

Large language models (LLMs) have the potential to scale assessment in medical education, but their psychometric equivalence to expert-written items and the detectability of their origin remain uncertain. In a preregistered, single-center, blinded observational, within-subject comparison, we evaluated 24 GPT-4o-generated versus 24 human-authored topic-matched multiple-choice questions (MCQs) across radiation oncology, radiology, and nuclear medicine. Medical students (n = 82) and physicians (n = 46) completed an identical 48-item formative mock examination, with item origin masked. Item difficulty (human: mean 0.65 [SD 0.22] vs LLM: 0.67 [0.20]) and discrimination (0.27 [0.12] vs 0.29 [0.12]) did not differ significantly; participants did not identify item origin above chance (0.50). Expert ratings of appropriateness and didactic quality showed low interrater agreement (ICC = 0.07-0.18). In this expert-reviewed, human-in-the-loop workflow, the item difficulty and discriminatory power of MCQs generated with GPT-4o did not differ significantly from those of expert-authored items, and were not reliably recognized as AI-generated by examinees. These findings delineate a feasible pathway for responsibly scaling formative assessment content in imaging-focused medical education, while underscoring the need for explicit educational policies regarding oversight, transparency, and fairness.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationRadiology practices and educationInnovations in Medical Education

Volltext beim Verlag öffnen

Psychometric properties and detectability of GPT-4o–generated multiple-choice questions compared with human-authored items across imaging specialties

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen