Learning Unified Video-Language Representations via Joint Modeling and Contrastive Learning for Natural Language Video Localization

  • Chenhao Cui
  • , Xinnian Liang
  • , Shuangzhi Wu
  • , Zhoujun Li*
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Natural language video localization (NLVL) aims to locate the matching span relevant to a given query sentence from an untrimmed video. This task requires not only understanding video and text but also aligning the semantics between video and language. Existing methods obtain vision-language representations via separate encoders, cross-modal interactions are not fine-grained enough, and the semantics are not fully aligned. In this paper, we address the vision-language alignment via joint modeling and contrastive learning. We propose a unified Video-Language Representation Network (UniNet), employing a transformer encoder to learn vision-language representations aligned. Simultaneously taking video and text as input, the encoder jointly learns the representations of both and captures the inter-relations between video and text. Then the representations are used by the predictor to locate the grounding video span. Besides, we train our model with contrastive learning to enhance vision-language representations in the training stage. Experiments on three benchmark datasets show that UniNet outperforms the baseline methods and adopting unified representation and contrastive learning can improve vision-language semantic alignment.

Original languageEnglish
Title of host publicationIJCNN 2023 - International Joint Conference on Neural Networks, Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781665488679
DOIs
StatePublished - 2023
Event2023 International Joint Conference on Neural Networks, IJCNN 2023 - Gold Coast, Australia
Duration: 18 Jun 202323 Jun 2023

Publication series

NameProceedings of the International Joint Conference on Neural Networks
Volume2023-June

Conference

Conference2023 International Joint Conference on Neural Networks, IJCNN 2023
Country/TerritoryAustralia
CityGold Coast
Period18/06/2323/06/23

Fingerprint

Dive into the research topics of 'Learning Unified Video-Language Representations via Joint Modeling and Contrastive Learning for Natural Language Video Localization'. Together they form a unique fingerprint.

Cite this