Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Pushing the boundaries of generative AI: multiple-choice question generation and assessment performance within medical education
0
Zitationen
3
Autoren
2026
Jahr
Abstract
Aims: The aim of this study is to systematically evaluate the performances of large language model-based generative Artificial Intelligence (Gen-AI) tools, Gemini and Copilot, in the generation and assessment of multiple-choice questions (MCQs) for use in medical education.Methods: A total of 335 MCQs were generated from two virtual patient cases using standardized prompts. Gen-AI tools selected the 56 best-quality items based on criteria encompassing the intended distributions regarding acceptable level of performance (ALP), Miller's competency pyramid (Miller) and Bloom's revised taxonomy (Bloom) levels, as well as alignment with learning objectives (LOs). Expert medical educators and current Gen-AI tools assessed these items based on the identification of misleading/confusing distractor(s) for borderline candidates -minimally competent examinees- (to calculate ALP values) and the identification of key(s), as well as Miller and Bloom levels, LO alignment, stem appropriateness, and technical item flaws. "AI-extended consensus" served as intersubjective consensus model (the gold standard). Generation performance was quantified by alignment with this consensus, and assessment performance by the degree to which Gen-AIs shifted or preserved Expert assessments. Analyses included ICC for reliability, Po/Cohen’s/Fleiss’ Kappa for categorical agreement, and inferential tests (Exact McNemar and Wilcoxon signed-rank) for detecting systematic bias and directional shifts.Results: Gen-AIs demonstrated markedly different performance patterns in assigning cognitive levels. For Miller, Gemini generated MCQs exhibited superior consistency with the intersubjective consensus (ICC(2,k)=0.82), whereas for Bloom, Copilot-generated MCQs demonstrated this superiority (ICC(2,k)=0.97). Both tools performed well in LO alignment and key identification, but their approaches to stem structure diverged substantially. Experts perceived the MCQs to be easier than the Gen-AIs claimed, and the current Gen-AI versions found them even easier than both the generating versions and the Experts did. In terms of assessment behaviour, Gen-AIs showed a systematic stringency tendency in Miller classifications, statistically significantly shifting Expert consensus from 'knows' to 'knows how' (p
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.402 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.270 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.702 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.507 Zit.