Abstract
Current vision-language trackers often struggle to fuse multimodal information comprehensively and effectively, leading to suboptimal performance in multimodal tasks. This study introduces LGTrack, a novel language-guided visual tracking framework designed to achieve a more comprehensive and efficient fusion of vision and language information. In the encoding stage, an Enhanced Multimodal Interaction Module is proposed to achieve full multimodal fusion, and it is used to construct Early Language Multilevel-guided Multimodal Encoding, which leverages deep semantic information for early and multilevel guidance of vision encoding. In the decoding stage, a multimodal decoding based on Joint Query is proposed, utilizing global features from both vision and language modalities, guiding the efficient operation of the decoding layers. These innovations achieve a more comprehensive fusion of multimodal information. Additionally, a contrastive learning strategy is introduced to align vision-language features in the semantic space, further enhancing the fusion effectiveness. Extensive experiments on multiple benchmarks such as LaSOT, , TNL2K, and OTB99-Lang demonstrate that our approach outperforms existing state-of-the-art trackers.
| Original language | English |
|---|---|
| Article number | 290 |
| Journal | ACM Transactions on Multimedia Computing, Communications and Applications |
| Volume | 21 |
| Issue number | 10 |
| DOIs | |
| State | Published - 15 Oct 2025 |
Keywords
- Joint Query
- Vision-language tracking
- early language multilevel guidance
- multimodal alignment
Fingerprint
Dive into the research topics of 'Language-guided Visual Tracking: Comprehensive and Effective Multimodal Information Fusion'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver