CGATracker: Correlation-Aware Graph Alignment for Referring Multi-Object Tracking

  • Siping Zhuang
  • , Guangyao Li
  • , Qiangqiang Wu
  • , Yang Lu
  • , Hai Miao Hu
  • , Hanzi Wang*
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Referring multi-object tracking (RMOT) aims to identify specific targets based on sentence descriptions. To enhance multi-modal learning, previous works typically rely on a simple fusion module at early or late stages. However, those methods frequently underutilize textual semantics and struggle to model the relationships between region-level features and word-level features. To address these limitations, we propose CGATracker, a correlation-aware graph alignment method for RMOT, which facilitates precise relationship modeling through relational scoring. Specifically, we design a Language-driven Relational Alignment (LRA) module, which establishes two connection graphs to generate positive and negative samples for the visual-textual alignment. Additionally, to effectively leverage referring information, we introduce a Semantic Clarify Booster (SCBooster) module based on a semantic infusion mechanism and a bias-aware verification mechanism for interactions with different modalities. Moreover, by designing a Multi-level Cross-modal Fusion (MCF) module, our method aggregates contextual features at multiple depths to enable the creation of the enriched correlation-aware graph. Extensive experiments conducted on the Refer-KITTI and Refer-KITTI-V2 datasets demonstrate the effectiveness of CGATracker.

Original languageEnglish
Pages (from-to)11337-11349
Number of pages13
JournalIEEE Transactions on Circuits and Systems for Video Technology
Volume35
Issue number11
DOIs
StatePublished - 2025

Keywords

  • Referring multi-object tracking
  • end-to-end tracking
  • multi-modal learning

Fingerprint

Dive into the research topics of 'CGATracker: Correlation-Aware Graph Alignment for Referring Multi-Object Tracking'. Together they form a unique fingerprint.

Cite this