Skip to main navigation Skip to search Skip to main content

Hierarchical bi-directional conceptual interaction for text-video retrieval

  • Wenpeng Han
  • , Guanglin Niu
  • , Mingliang Zhou
  • , Xiaowei Zhang*
  • *Corresponding author for this work
  • Qingdao University
  • Chongqing University

Research output: Contribution to journalArticlepeer-review

Abstract

The large pre-trained vision-language models (VLMs) utilized in text-video retrieval have demonstrated strong cross image-text understanding ability. Existing works leverage VLMs to extract features and design fine-grained uni-directional interaction from text to video to enhance the visual understanding ability of the model. However, the vast cross-modal gap makes it difficult to fully match video-text mutual information solely through uni-directional cross-modal interaction techniques. To this end, we propose a novel hierarchical bi-directional conceptual interaction (HBCI) method, which utilizes multi-granularity video-text decoupled features mutual attention to enhance cross-modal alignment. Firstly, we introduce the text-guided attention to extract visual representations among hierarchical concepts, and decouple the multi-granularity features from video and text to find representation subspaces with maximal relevance to each other. Furthermore, we construct an iterative bi-directional conceptual interaction (BCI) module to reason semantic associations across text and video modalities, which generates attention weights adaptively based on video-text decoupled concepts and projects them into the other modality to facilitate profound cross-modal interaction. Finally, we implement the cross-level similarity distillation to progressively propagate the knowledge-aware similarity. Extensive experiments consistently deliver exceptional performance of our proposed HBCI across MSR-VTT, DiDeMo and ActivityNet datasets.

Original languageEnglish
Article number317
JournalMultimedia Systems
Volume30
Issue number6
DOIs
StatePublished - Dec 2024

Keywords

  • Bi-directional interaction
  • CLIP
  • Cross-modal alignment
  • Text-video retrieval

Fingerprint

Dive into the research topics of 'Hierarchical bi-directional conceptual interaction for text-video retrieval'. Together they form a unique fingerprint.

Cite this