跳到主要导航 跳到搜索 跳到主要内容

CFMMC-Align: Coarse-Fine Multi-Modal Contrastive Alignment Network for Traffic Event Video Question Answering

  • Beihang University
  • Beijing University of Technology
  • The University of Sydney

科研成果: 期刊稿件文章同行评审

摘要

Traffic video question answering (TrafficVQA) constitutes a specialized VideoQA task designed to enhance the basic comprehension and intricate reasoning capacities of videos, specifically focusing on traffic events. Recent VideoQA models employ pretrained visual and textual encoder models to bridge the feature space gap between visual and textual data. However, in addressing the unique challenges inherent to the TrafficVQA task, three pivotal issues must be addressed: (i) Dimension Gap: Between the pretrained image (appearance feature) and video (motion feature) models, there exists a conspicuous dimension difference in static and dynamic visual data; (ii) Scene Gap: The common real-world datasets and the traffic event datasets differ in visual scene content; (iii) Modality Gap: A pronounced feature distribution discrepancy emerges between traffic video and text data. To alleviate these challenges, we introduce the coarse-fine multimodal contrastive alignment network (CFMMC-Align). This model leverages sequence-level and token-level multimodal features, grounded in an unsupervised visual multimodal contrastive loss to mitigate dimension and scene gaps and a supervised visual-textual contrastive loss to alleviate modality discrepancies. Finally, the model is validated on the challenging public TrafficVQA dataset SUTD-TrafficQA and outperforms the state-of-the-art method by a substantial margin (50.2% compared to 46.0%). The code is available at https://github.com/guokan987/CFMMC-Align.

源语言英语
页(从-至)10538-10550
页数13
期刊IEEE Transactions on Circuits and Systems for Video Technology
34
11
DOI
出版状态已出版 - 2024

指纹

探究 'CFMMC-Align: Coarse-Fine Multi-Modal Contrastive Alignment Network for Traffic Event Video Question Answering' 的科研主题。它们共同构成独一无二的指纹。

引用此