Abstract
Referring multi-object tracking (RMOT) aims to identify specific targets based on sentence descriptions. To enhance multi-modal learning, previous works typically rely on a simple fusion module at early or late stages. However, those methods frequently underutilize textual semantics and struggle to model the relationships between region-level features and word-level features. To address these limitations, we propose CGATracker, a correlation-aware graph alignment method for RMOT, which facilitates precise relationship modeling through relational scoring. Specifically, we design a Language-driven Relational Alignment (LRA) module, which establishes two connection graphs to generate positive and negative samples for the visual-textual alignment. Additionally, to effectively leverage referring information, we introduce a Semantic Clarify Booster (SCBooster) module based on a semantic infusion mechanism and a bias-aware verification mechanism for interactions with different modalities. Moreover, by designing a Multi-level Cross-modal Fusion (MCF) module, our method aggregates contextual features at multiple depths to enable the creation of the enriched correlation-aware graph. Extensive experiments conducted on the Refer-KITTI and Refer-KITTI-V2 datasets demonstrate the effectiveness of CGATracker.
| Original language | English |
|---|---|
| Pages (from-to) | 11337-11349 |
| Number of pages | 13 |
| Journal | IEEE Transactions on Circuits and Systems for Video Technology |
| Volume | 35 |
| Issue number | 11 |
| DOIs | |
| State | Published - 2025 |
Keywords
- Referring multi-object tracking
- end-to-end tracking
- multi-modal learning
Fingerprint
Dive into the research topics of 'CGATracker: Correlation-Aware Graph Alignment for Referring Multi-Object Tracking'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver