OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 20.04.2026, 10:37

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Psychometric properties and detectability of GPT-4o–generated multiple-choice questions compared with human-authored items across imaging specialties

2026·0 Zitationen·npj Digital MedicineOpen Access
Volltext beim Verlag öffnen

0

Zitationen

10

Autoren

2026

Jahr

Abstract

Large language models (LLMs) have the potential to scale assessment in medical education, but their psychometric equivalence to expert-written items and the detectability of their origin remain uncertain. In a preregistered, single-center, blinded observational, within-subject comparison, we evaluated 24 GPT-4o-generated versus 24 human-authored topic-matched multiple-choice questions (MCQs) across radiation oncology, radiology, and nuclear medicine. Medical students (n = 82) and physicians (n = 46) completed an identical 48-item formative mock examination, with item origin masked. Item difficulty (human: mean 0.65 [SD 0.22] vs LLM: 0.67 [0.20]) and discrimination (0.27 [0.12] vs 0.29 [0.12]) did not differ significantly; participants did not identify item origin above chance (0.50). Expert ratings of appropriateness and didactic quality showed low interrater agreement (ICC = 0.07-0.18). In this expert-reviewed, human-in-the-loop workflow, the item difficulty and discriminatory power of MCQs generated with GPT-4o did not differ significantly from those of expert-authored items, and were not reliably recognized as AI-generated by examinees. These findings delineate a feasible pathway for responsibly scaling formative assessment content in imaging-focused medical education, while underscoring the need for explicit educational policies regarding oversight, transparency, and fairness.

Ähnliche Arbeiten