TY - GEN
T1 - TopicDVC
T2 - 10th IEEE International Conference on Edge Computing and Scalable Cloud, EdgeCom 2024
AU - Chen, Wei
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Dense video captioning involves detecting and describing multiple events within a video coherently. Events within a video typically share a common topic, and incorporating this topic information into the model can enhance the quality and coherence of the generated captions. However, existing dense video captioning datasets lack explicit topic annotations. To address this, we design a topic generator that utilizes a diffusion model with strong generative capabilities to generate topic information from a given video. During the training stage, we use the features of ground-truth captions as pseudo-topic labels. The video topics diffuse from the features of groundtruth captions to a random distribution, and the model learns to reverse this noising process conditioned on video features. During inference, the model iteratively denoises the Gaussian noise into topic features conditioned on video features. In this paper, we propose TopicDVC, a novel dense video captioning framework. TopicDVC applies the topic information generated by the topic generator to guide the model in generating more coherent captions. Experiments on the ActivityNet Captions dataset demonstrate that leveraging the topics generated by the diffusion model significantly improves the performance of dense video captioning, producing more accurate and coherent captions.
AB - Dense video captioning involves detecting and describing multiple events within a video coherently. Events within a video typically share a common topic, and incorporating this topic information into the model can enhance the quality and coherence of the generated captions. However, existing dense video captioning datasets lack explicit topic annotations. To address this, we design a topic generator that utilizes a diffusion model with strong generative capabilities to generate topic information from a given video. During the training stage, we use the features of ground-truth captions as pseudo-topic labels. The video topics diffuse from the features of groundtruth captions to a random distribution, and the model learns to reverse this noising process conditioned on video features. During inference, the model iteratively denoises the Gaussian noise into topic features conditioned on video features. In this paper, we propose TopicDVC, a novel dense video captioning framework. TopicDVC applies the topic information generated by the topic generator to guide the model in generating more coherent captions. Experiments on the ActivityNet Captions dataset demonstrate that leveraging the topics generated by the diffusion model significantly improves the performance of dense video captioning, producing more accurate and coherent captions.
KW - coherent
KW - dense video captioning
KW - diffusion model
KW - pseudo label
KW - topic
UR - https://www.scopus.com/pages/publications/85202436765
U2 - 10.1109/EdgeCom62867.2024.00020
DO - 10.1109/EdgeCom62867.2024.00020
M3 - 会议稿件
AN - SCOPUS:85202436765
T3 - Proceedings - 2024 IEEE 10th International Conference on Edge Computing and Scalable Cloud, EdgeCom 2024
SP - 82
EP - 87
BT - Proceedings - 2024 IEEE 10th International Conference on Edge Computing and Scalable Cloud, EdgeCom 2024
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 28 June 2024 through 30 June 2024
ER -