Abstract
Video captioning aims to generate natural language descriptions for a given video, which is a more challenging task than static image captioning since it requires a more diverse and exhaustive result. Meanwhile, it is also important that the generated captions should be consistent with the language habits of people at a fine granularity. In this work, unlike most recent works enhancing performance with additional data modalities or complex model designs, we focus on optimizing the training process of video captioning models. Firstly, to generate a more diverse video caption, we propose the bidirectional maximum entropy (BME) training, which directly optimizes the probability distribution of neighboring words under a reinforcement learning (RL) framework. Secondly, to search for more human-like captions in the larger search space created by BME, we introduce the word co-occurrence (WCO) weighting. It adaptively guides RL algorithms with co-occurrence statistics in the training corpus. Our method can be deployed on existing captioning models in a plug-and-play manner without introducing any extra parameters. Experimental results show that our method yields up to 5.8% and 7.0% improvements considering the CIDEr score on MSVD and MSR-VTT, respectively.
| Original language | English |
|---|---|
| Pages (from-to) | 4494-4507 |
| Number of pages | 14 |
| Journal | IEEE Transactions on Multimedia |
| Volume | 25 |
| DOIs | |
| State | Published - 2023 |
Keywords
- Video captioning
- bidirectional maximum entropy
- word co-occurrence
Fingerprint
Dive into the research topics of 'Bidirectional Maximum Entropy Training with Word Co-Occurrence for Video Captioning'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver