Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Detecting prohibit items in x-ray imagery with multimodal large language models
0
Zitationen
6
Autoren
2024
Jahr
Abstract
Recent Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in vision-language tasks, such as image captioning and question answering. However, they lack the essential perception ability, namely object detection. In this work, we focus on detecting prohibited items and discuss the possibility of integrating multimodal LLMs into the detection process. Our method first performs image captioning on the x-ray prohibited item image, followed by creating instructions to prompt the multimodal LLMs to identify the prohibited item. Our approach leverages the contextual understanding and language processing strengths of MLLMs. While current methods in real-time object detection having high accuracy, they often require extensive training on large datasets specific to the prohibited items. In contrast, MLLMs can understand and generate detailed descriptions, which can be advantageous in scenarios where prohibited items may not be well-represented in training data or exhibit significant variability in appearance. Our results suggest that MLLMs can complement traditional methods by providing a more nuanced understanding of prohibited items through their ability to interpret and respond to complex queries, potentially improving detection rates in challenging environments.