Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Medical Context Distorts Decisions in Clinical Vision Language Models
0
Zitationen
5
Autoren
2026
Jahr
Abstract
Vision-language models (VLMs) are increasingly proposed for clinical decision support, yet their reliability in real-world scenarios that require integrating both visual and textual context from medical records remains poorly characterized. This paper identifies three failure modes: (1) modality over-reliance on text over images, (2) spurious reliance on irrelevant clinical history, and (3) prompt sensitivity across semantically equivalent inputs. We evaluate a diverse set of general-domain and medically-tuned open and closed VLMs on chest x-ray tasks using MIMIC-CXR. By systematically manipulating image-text alignment, clinical history, and prompt formulations, we found that VLM decisions are dominated by the text modality, even when visual evidence is available. Moreover, we observed that VLMs are heavily influenced by irrelevant reports, while minor prompt changes can reverse correct image-based predictions. Our findings underscore the need for explicit safeguards and stress-testing before considering the use of these models in clinical practice.
Ähnliche Arbeiten
MizAR 60 for Mizar 50
2023 · 75.815 Zit.
ImageNet: A large-scale hierarchical image database
2009 · 61.482 Zit.
Microsoft COCO: Common Objects in Context
2014 · 41.934 Zit.
Fully convolutional networks for semantic segmentation
2015 · 36.757 Zit.
Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization
2017 · 21.167 Zit.