Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Beyond Accuracy: Assessing LLMS' Ability to Recognize Their Limits in Medical Decision-Making
0
Zitationen
4
Autoren
2025
Jahr
Abstract
While Large Language Models (LLMs) demonstrate impressive medical capabilities through Retrieval-Augmented Generation (RAG) and domain optimization, a critical question remains: can LLMs autonomously recognize when to seek external help rather than provide independent medical recommendations? This metacognitive capability is essential for safe healthcare deployment. To address this gap, we introduce a novel evaluation framework assessing LLMs' autonomous help-seeking behavior through three workflows: Force-RAG (mandated external retrieval), No-RAG (internal knowledge only), and Auto-RAG (autonomous decision-making). Our comprehensive evaluation of 13 LLMs configurations across six clinical departments using 954 real-world cases reveals three key insights: (1) larger models don't necessarily exhibit superior help-seeking calibration; (2) reasoning strategies significantly impact metacognitive performance across medical domains; (3) proprietary models demonstrate superior autonomy in balancing self-reliance with appropriate help-seeking. These findings challenge conventional scaling assumptions and establish help-seeking behavior as fundamental to medical AI reliability.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.402 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.270 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.702 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.507 Zit.