TY - JOUR
T1 - Bidirectional Maximum Entropy Training with Word Co-Occurrence for Video Captioning
AU - Liu, Sheng
AU - Li, Annan
AU - Wang, Jiahao
AU - Wang, Yunhong
N1 - Publisher Copyright:
© 1999-2012 IEEE.
PY - 2023
Y1 - 2023
N2 - Video captioning aims to generate natural language descriptions for a given video, which is a more challenging task than static image captioning since it requires a more diverse and exhaustive result. Meanwhile, it is also important that the generated captions should be consistent with the language habits of people at a fine granularity. In this work, unlike most recent works enhancing performance with additional data modalities or complex model designs, we focus on optimizing the training process of video captioning models. Firstly, to generate a more diverse video caption, we propose the bidirectional maximum entropy (BME) training, which directly optimizes the probability distribution of neighboring words under a reinforcement learning (RL) framework. Secondly, to search for more human-like captions in the larger search space created by BME, we introduce the word co-occurrence (WCO) weighting. It adaptively guides RL algorithms with co-occurrence statistics in the training corpus. Our method can be deployed on existing captioning models in a plug-and-play manner without introducing any extra parameters. Experimental results show that our method yields up to 5.8% and 7.0% improvements considering the CIDEr score on MSVD and MSR-VTT, respectively.
AB - Video captioning aims to generate natural language descriptions for a given video, which is a more challenging task than static image captioning since it requires a more diverse and exhaustive result. Meanwhile, it is also important that the generated captions should be consistent with the language habits of people at a fine granularity. In this work, unlike most recent works enhancing performance with additional data modalities or complex model designs, we focus on optimizing the training process of video captioning models. Firstly, to generate a more diverse video caption, we propose the bidirectional maximum entropy (BME) training, which directly optimizes the probability distribution of neighboring words under a reinforcement learning (RL) framework. Secondly, to search for more human-like captions in the larger search space created by BME, we introduce the word co-occurrence (WCO) weighting. It adaptively guides RL algorithms with co-occurrence statistics in the training corpus. Our method can be deployed on existing captioning models in a plug-and-play manner without introducing any extra parameters. Experimental results show that our method yields up to 5.8% and 7.0% improvements considering the CIDEr score on MSVD and MSR-VTT, respectively.
KW - Video captioning
KW - bidirectional maximum entropy
KW - word co-occurrence
UR - https://www.scopus.com/pages/publications/85130838130
U2 - 10.1109/TMM.2022.3177308
DO - 10.1109/TMM.2022.3177308
M3 - 文章
AN - SCOPUS:85130838130
SN - 1520-9210
VL - 25
SP - 4494
EP - 4507
JO - IEEE Transactions on Multimedia
JF - IEEE Transactions on Multimedia
ER -