Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
AI-generated biochemistry test item parameters in MST test conditions
0
Zitationen
2
Autoren
2025
Jahr
Abstract
BACKGROUND: This study investigated whether ChatGPT 4o could accurately estimate the difficulty of medical assessment items by comparing its predictions with empirically-derived parameters from multistage testing simulations. METHODS: Using a hybrid simulation-validation design, the researchers had ChatGPT 4o generate 80 multiple-choice biochemistry questions with difficulty estimates (b-parameters), which were then administered via simulated multistage testing to 5,000 virtual examinees. RESULTS: The analysis revealed moderate agreement between AI-generated and simulation-derived difficulty parameters (r = 0.612, 95% CI [0.472, 0.725]), though ChatGPT systematically overestimated item difficulty with a mean bias of 0.240 (SD = 0.503). While the mean absolute error was relatively modest at 0.447, with 91% of items showing errors below 1.0 logits, the AI's estimates were particularly inaccurate for very easy items, where 83% exhibited absolute errors exceeding 0.5 logits compared to only 29% for medium difficulty items. These findings suggest that while ChatGPT 4o demonstrates promise as a tool for preliminary item generation in medical education assessment, it requires empirical calibration and expert oversight before operational implementation, as the systematic bias indicates the AI lacks access to real-world performance feedback. CONCLUSIONS: The study's conclusions are tempered by important limitations, including its reliance on simulation-based validation rather than actual student performance data and its single-institution sample, underscoring the need for rigorous psychometric validation when integrating artificial intelligence into medical education assessment.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.646 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.554 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.071 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.851 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.