Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
MAIA: A Multidimensional Benchmark for Assessing Medical AI Agents
0
Zitationen
9
Autoren
2026
Jahr
Abstract
Large language models show remarkable potential in medical scenarios, especially as autonomous agents for complex clinical reasoning. Rigorous evaluation is essential to ensure their reliability in real-world healthcare applications. However, existing medical benchmarks suffer from narrow task scopes, dependence on public datasets prone to data leakage, and limited coverage of diverse agent capabilities. To address these gaps, we introduce Medical AI Assessment (MAIA), a comprehensive benchmark evaluating medical agents along three dimensions: retrieval-based medical questions generated through biomedical APIs, multi-hop reasoning tasks derived from curated biomedical knowledge graphs, clinical-pathway reasoning questions constructed from authoritative guidelines. MAIA leverages large language models for automatic question generation, reducing manual effort while maintaining clinical fidelity and reasoning depth. Experiments across base and reasoning models reveal both strengths and gaps, underscoring MAIA’s value for advancing medical agent evaluation. MAIA is publicly available at https://huggingface.co/datasets/DiligentDing/MAIA.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.764 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.674 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.234 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.898 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.