TY - JOUR
T1 - Language-guided Visual Tracking
T2 - Comprehensive and Effective Multimodal Information Fusion
AU - Song, Jianbo
AU - Zhang, Hong
AU - Feng, Yachun
AU - Liu, Hanyang
AU - Yang, Yifan
N1 - Publisher Copyright:
© 2025 Copyright held by the owner/author(s). Publication rights licensed to ACM.
PY - 2025/10/15
Y1 - 2025/10/15
N2 - Current vision-language trackers often struggle to fuse multimodal information comprehensively and effectively, leading to suboptimal performance in multimodal tasks. This study introduces LGTrack, a novel language-guided visual tracking framework designed to achieve a more comprehensive and efficient fusion of vision and language information. In the encoding stage, an Enhanced Multimodal Interaction Module is proposed to achieve full multimodal fusion, and it is used to construct Early Language Multilevel-guided Multimodal Encoding, which leverages deep semantic information for early and multilevel guidance of vision encoding. In the decoding stage, a multimodal decoding based on Joint Query is proposed, utilizing global features from both vision and language modalities, guiding the efficient operation of the decoding layers. These innovations achieve a more comprehensive fusion of multimodal information. Additionally, a contrastive learning strategy is introduced to align vision-language features in the semantic space, further enhancing the fusion effectiveness. Extensive experiments on multiple benchmarks such as LaSOT, , TNL2K, and OTB99-Lang demonstrate that our approach outperforms existing state-of-the-art trackers.
AB - Current vision-language trackers often struggle to fuse multimodal information comprehensively and effectively, leading to suboptimal performance in multimodal tasks. This study introduces LGTrack, a novel language-guided visual tracking framework designed to achieve a more comprehensive and efficient fusion of vision and language information. In the encoding stage, an Enhanced Multimodal Interaction Module is proposed to achieve full multimodal fusion, and it is used to construct Early Language Multilevel-guided Multimodal Encoding, which leverages deep semantic information for early and multilevel guidance of vision encoding. In the decoding stage, a multimodal decoding based on Joint Query is proposed, utilizing global features from both vision and language modalities, guiding the efficient operation of the decoding layers. These innovations achieve a more comprehensive fusion of multimodal information. Additionally, a contrastive learning strategy is introduced to align vision-language features in the semantic space, further enhancing the fusion effectiveness. Extensive experiments on multiple benchmarks such as LaSOT, , TNL2K, and OTB99-Lang demonstrate that our approach outperforms existing state-of-the-art trackers.
KW - Joint Query
KW - Vision-language tracking
KW - early language multilevel guidance
KW - multimodal alignment
UR - https://www.scopus.com/pages/publications/105019640462
U2 - 10.1145/3757322
DO - 10.1145/3757322
M3 - 文章
AN - SCOPUS:105019640462
SN - 1551-6857
VL - 21
JO - ACM Transactions on Multimedia Computing, Communications and Applications
JF - ACM Transactions on Multimedia Computing, Communications and Applications
IS - 10
M1 - 290
ER -