Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
The Crisis of Ground Truth in Medical AI: Evaluating the Tools Used to Detect LLM Hallucinations
0
Zitationen
1
Autoren
2026
Jahr
Abstract
The growing use of Large Language Models (LLMs) in healthcare has produced meaningful advances in diagnostic assistance and clinical documentation. However, the persistent risk of medical hallucinations remains a serious barrier to broader adoption. Detecting fabricated or harmful clinical out- puts requires a reliable foundation of factual correctness, and establishing that foundation in medicine is far more difficult than it first appears. This paper examines what we call the “crisis of ground truth” in medical AI evaluation. We review the tools and methods used to verify AI outputs, organizing the literature around four interconnected themes: the limitations of traditional lexical metrics, the circular reasoning problem in LLM-as-a-judge setups, the challenges of building useful domain-specific benchmarks, and the need to rethink what clinical truth actually means for evaluation purposes. Static benchmarks are highly susceptible to data contamination and struggle to capture multi-turn clinical reasoning. Scalable au- tomated alternatives that use models to judge other models risk validating outputs against themselves rather than against verified medical knowledge. Through thematic analysis of current work, including frameworks such as CLEVER, MedHallBench, and risk-sensitive evaluation methods, we show that automated evaluators can catch obvious factual errors but consistently miss the subtle reasoning failures and safety-critical gaps that clinical environments require. We argue that resolving the ground truth problem requires hybrid evaluation architectures that combine high-throughput automated checks with structured, expert-led human review at key decision points
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.652 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.567 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.083 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.856 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.