跳到主要导航 跳到搜索 跳到主要内容

Parallel Dense Video Caption Generation with Multi-Modal Features

  • Xuefei Huang
  • , Ka Hou Chan
  • , Wei Ke*
  • , Hao Sheng
  • *此作品的通讯作者
  • Macao Polytechnic University

科研成果: 期刊稿件文章同行评审

摘要

The task of dense video captioning is to generate detailed natural-language descriptions for an original video, which requires deep analysis and mining of semantic captions to identify events in the video. Existing methods typically follow a localisation-then-captioning sequence within given frame sequences, resulting in caption generation that is highly dependent on which objects have been detected. This work proposes a parallel-based dense video captioning method that can simultaneously address the mutual constraint between event proposals and captions. Additionally, a deformable Transformer framework is introduced to reduce or free manual threshold of hyperparameters in such methods. An information transfer station is also added as a representation organisation, which receives the hidden features extracted from a frame and implicitly generates multiple event proposals. The proposed method also adopts LSTM (Long short-term memory) with deformable attention as the main layer for caption generation. Experimental results show that the proposed method outperforms other methods in this area to a certain degree on the ActivityNet Caption dataset, providing competitive results.

源语言英语
文章编号3685
期刊Mathematics
11
17
DOI
出版状态已出版 - 9月 2023

指纹

探究 'Parallel Dense Video Caption Generation with Multi-Modal Features' 的科研主题。它们共同构成独一无二的指纹。

引用此