Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
MedEvalArena: A Self-Generated, Peer-Judged Benchmark for Medical Reasoning (Preprint)
0
Zitationen
8
Autoren
2026
Jahr
Abstract
<sec> <title>BACKGROUND</title> Large Language Models (LLMs) demonstrate strong performance at medical specialty board multiple choice question (MCQ) answering, however, underperform in more complex medical reasoning scenarios. This gap indicates a need for improving both LLM medical reasoning and evaluation paradigms. </sec> <sec> <title>OBJECTIVE</title> To develop an automated framework to evaluate LLM capabilities in medical reasoning. </sec> <sec> <title>METHODS</title> MedEvalArena is an automated framework in which LLMs engage in a symmetric round-robin format: each model generates challenging board-style medical MCQs, then serves in an ensemble LLM-as-judge bench to adjudicate validity of generated questions, and finally completes the validated exam as an examinee. We compared performance of leading LLMs across the OpenAI, Grok, Gemini, Claude, Kimi, and DeepSeek families on both question generation validity and exam taking performance. </sec> <sec> <title>RESULTS</title> Across frontier models, we observe no statistically significant differences in exam-taking performance with mean accuracies 85.7-91.7%, suggesting convergence in medical reasoning ability across frontier LLMs for question-answering tasks. LLM accuracy was comparable to mean human physician accuracy of 85.6% (95% CI: 79.4%-91.7%) and the differences were not statistically significant. We found significant differences between models in question validity rate, with higher question validity rates in questions generated by OpenAI, Gemini, and Claude frontier models (83.3%-94.8%), compared to Kimi, Grok, and DeepSeek models (46.0-63.8%). When jointly considering accuracy and inference cost, multiple frontier models lie on the Pareto frontier with no single model dominating across both dimensions. </sec> <sec> <title>CONCLUSIONS</title> MedEvalArena provides an automated framework for benchmarking LLM medical reasoning, identifying valid question generation as a more discriminative task compared to question answering. </sec>
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.652 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.567 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.083 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.856 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.