Skip to main navigation Skip to search Skip to main content

Bidirectional Maximum Entropy Training with Word Co-Occurrence for Video Captioning

  • Beihang University

Research output: Contribution to journalArticlepeer-review

Abstract

Video captioning aims to generate natural language descriptions for a given video, which is a more challenging task than static image captioning since it requires a more diverse and exhaustive result. Meanwhile, it is also important that the generated captions should be consistent with the language habits of people at a fine granularity. In this work, unlike most recent works enhancing performance with additional data modalities or complex model designs, we focus on optimizing the training process of video captioning models. Firstly, to generate a more diverse video caption, we propose the bidirectional maximum entropy (BME) training, which directly optimizes the probability distribution of neighboring words under a reinforcement learning (RL) framework. Secondly, to search for more human-like captions in the larger search space created by BME, we introduce the word co-occurrence (WCO) weighting. It adaptively guides RL algorithms with co-occurrence statistics in the training corpus. Our method can be deployed on existing captioning models in a plug-and-play manner without introducing any extra parameters. Experimental results show that our method yields up to 5.8% and 7.0% improvements considering the CIDEr score on MSVD and MSR-VTT, respectively.

Original languageEnglish
Pages (from-to)4494-4507
Number of pages14
JournalIEEE Transactions on Multimedia
Volume25
DOIs
StatePublished - 2023

Keywords

  • Video captioning
  • bidirectional maximum entropy
  • word co-occurrence

Fingerprint

Dive into the research topics of 'Bidirectional Maximum Entropy Training with Word Co-Occurrence for Video Captioning'. Together they form a unique fingerprint.

Cite this