跳到主要导航 跳到搜索 跳到主要内容

Language-guided Visual Tracking: Comprehensive and Effective Multimodal Information Fusion

  • Beihang University

科研成果: 期刊稿件文章同行评审

摘要

Current vision-language trackers often struggle to fuse multimodal information comprehensively and effectively, leading to suboptimal performance in multimodal tasks. This study introduces LGTrack, a novel language-guided visual tracking framework designed to achieve a more comprehensive and efficient fusion of vision and language information. In the encoding stage, an Enhanced Multimodal Interaction Module is proposed to achieve full multimodal fusion, and it is used to construct Early Language Multilevel-guided Multimodal Encoding, which leverages deep semantic information for early and multilevel guidance of vision encoding. In the decoding stage, a multimodal decoding based on Joint Query is proposed, utilizing global features from both vision and language modalities, guiding the efficient operation of the decoding layers. These innovations achieve a more comprehensive fusion of multimodal information. Additionally, a contrastive learning strategy is introduced to align vision-language features in the semantic space, further enhancing the fusion effectiveness. Extensive experiments on multiple benchmarks such as LaSOT, , TNL2K, and OTB99-Lang demonstrate that our approach outperforms existing state-of-the-art trackers.

源语言英语
文章编号290
期刊ACM Transactions on Multimedia Computing, Communications and Applications
21
10
DOI
出版状态已出版 - 15 10月 2025

指纹

探究 'Language-guided Visual Tracking: Comprehensive and Effective Multimodal Information Fusion' 的科研主题。它们共同构成独一无二的指纹。

引用此