OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 27.05.2026, 07:28

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

GPT-4 versus human authors in clinically complex MCQ creation: A blinded analysis of item quality

2025·8 Zitationen·Medical Teacher
Volltext beim Verlag öffnen

8

Zitationen

6

Autoren

2025

Jahr

Abstract

PURPOSE: To compare the structural quality of multiple choice questions (MCQs) generated by a large language model, a type of artificial intelligence (AI), GPT-4, against human-authored items at both novice and expert level. METHODS: We conducted a blinded analysis of 124 MCQs: 40 generated by GPT-4, 39 from human item-writers at Novice level, and 45 from human item-writers at Expert level. A generic prompt for GPT-4 was engineered, which included item-writing guidance, example MCQs, and key learning points. A standardized scoring system was developed including content validity, scope, item anatomy, cognitive skill level, item-writing flaws, feedback comprehensiveness, veracity and adequacy of clinical reasoning, and global impression of fitness for use. A consensus panel objectively evaluated each item, blinded to the author, using the scoring system. RESULTS: Analysis showed that all groups (Novice, Expert, and AI) were able to generate items within scope. Expert items performed better than Novice items in all categories. There was no difference in the global impressions of Expert and AI items, which suggests overall comparability. A statistically significant, albeit small, difference was observed with Expert items performing better than AI items in the specific domains of content validity, feedback veracity and clinical reasoning, and testing at higher order cognitive skill levels. However, both groups met acceptable standards in these domains. AI items had a higher rate than Expert items of being deemed unfit for use requiring major revision, indicating erroneous correct answers, and displaying biased answer positioning. CONCLUSIONS: GPT-4 can produce MCQs testing clinically complex concepts for medical assessment. While the structural quality of AI-generated MCQs is comparable to experts overall, human oversight is necessary to ensure content validity and optimize item quality.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationClinical Reasoning and Diagnostic SkillsRadiomics and Machine Learning in Medical Imaging
Volltext beim Verlag öffnen