Skip to main navigation Skip to search Skip to main content

Abstract

Fine-grained visual referring and grounding are critical for enhancing scene understanding and enabling various real-world vision-language applications. Although recent studies have extended multimodal large language models (MLLMs) to these tasks, they still face significant challenges in fine-grained multi-target scenarios. To address this, we propose MTRAG, a pixel-level multi-target referring and grounding framework that leverages semantic-spatial collaboration. Specifically, we introduce a Channel Extension Mechanism (CEM) that enables a global image encoder to extract global semantics and multi-region representations while retaining background context, without extra region feature extractors. Moreover, we introduce a grounding branch for pixel-level grounding and design a Hybrid Adapter (HA) to fuse semantic features from the MLLM branch with spatial information from the grounding branch, thereby enhancing the semantic-spatial alignment. For training, we meticulously curate MTRAG-D, a dataset comprising single- and multi-target referring and grounding samples derived from existing datasets and newly synthesized free-form multi-target referring instruction-following data. We also present MTR-Bench, a benchmark for systematic evaluation of multi-target referring. Extensive experiments across five core tasks, including single- and multi-target referring and grounding as well as image-level captioning, show that MTRAG consistently outperforms strong baselines on both multi- and single-target tasks, while maintaining competitive image-level understanding. The code is available at https://github.com/deng-ai-lab/MTRAG

Original languageEnglish
Pages (from-to)2167-2181
Number of pages15
JournalIEEE Transactions on Image Processing
Volume35
DOIs
StatePublished - 2026

Keywords

  • Multimodal large language models (MLLMs)
  • visual grounding
  • visual referring
  • visual-language learning

Fingerprint

Dive into the research topics of 'MTRAG: Multi-Target Referring and Grounding via Hybrid Semantic-Spatial Integration'. Together they form a unique fingerprint.

Cite this