Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Medical Image Spatial Grounding with Semantic Sampling
0
Zitationen
6
Autoren
2026
Jahr
Abstract
Vision language models (VLMs) have shown significant promise in visual grounding for images as well as videos. In medical imaging research, VLMs represent a bridge between object detection and segmentation, and report understanding and generation. However, spatial grounding of anatomical structures in the three-dimensional space of medical images poses many unique challenges. In this study, we examine image modalities, slice directions, and coordinate systems as differentiating factors for vision components of VLMs, and the use of anatomical, directional, and relational terminology as factors for the language components. We then demonstrate that visual and textual prompting systems such as labels, bounding boxes, and mask overlays have varying effects on the spatial grounding ability of VLMs. To enable measurement and reproducibility, we introduce MIS-Ground, a benchmark that comprehensively tests a VLM for vulnerabilities against specific modes of Medical Image Spatial Grounding. We release MIS-Ground to the public at https://anonymous.4open.science/r/mis-ground. In addition, we present MIS-SemSam, a low-cost, inference-time, and model-agnostic optimization of VLMs that improve their spatial grounding ability with the use of Semantic Sampling. We find that MIS-SemSam improves the accuracy of Qwen3-VL-32B on MIS-Ground by 13.06%.
Ähnliche Arbeiten
MizAR 60 for Mizar 50
2023 · 74.805 Zit.
ImageNet: A large-scale hierarchical image database
2009 · 60.816 Zit.
Microsoft COCO: Common Objects in Context
2014 · 41.410 Zit.
Fully convolutional networks for semantic segmentation
2015 · 36.503 Zit.
Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization
2017 · 20.634 Zit.