OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 22.05.2026, 01:51

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Using Generative <scp>AI</scp> to Appraise the Quality of Medical Education Research Studies: Agreement Between <scp>AI</scp> ‐Generated and Human <scp>MERSQI</scp> Scores

2026·0 Zitationen·AEM Education and TrainingOpen Access
Volltext beim Verlag öffnen

0

Zitationen

8

Autoren

2026

Jahr

Abstract

Objectives: The increasing volume of medical education research necessitates efficient, reliable, and scalable methods for conducting quality appraisals. The Medical Education Research Study Quality Instrument (MERSQI) is a widely used tool, although its manual scoring process remains resource-intensive. This study evaluated how well large language models (LLMs) appraise medical education research using the MERSQI tool in comparison with human judges. Methods: Three LLMs (GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro) assigned MERSQI domain scores to 1423 medical education research articles. The authors compared AI-generated scores with human-generated scores using intraclass correlation coefficients (ICCs) across the six MERSQI domains. They evaluated the agreement between AI- and human-generated MERSQI composite scores using Bland-Altman plots. Results: Domain-level ICC values ranged from fair (0.24) to near perfect (0.81), with the lowest agreement observed in the 'sampling,' 'validity evidence,' and 'data analysis' domains. No single LLM consistently outperformed the others across all domains. Composite score agreement with human ratings was substantial and similar across LLMs (ICC range: 0.65-0.69). GPT-5 produced slightly lower composite scores than humans, while Claude Sonnet 4 and Gemini 2.5 Pro produced higher scores, with Gemini showing the largest deviation. The Bland-Altman plots for Gemini 2.5 Pro suggested proportional bias, indicating its agreement with human scores varied across the range of study quality. Conclusions: These LLMs demonstrated substantial agreement with human raters for MERSQI composite scores, but domain-level agreement varied. Systematic differences in scoring patterns highlight the need for human oversight and additional calibrations before integrating LLMs into systematic review appraisal workflows.

Ähnliche Arbeiten