跳到主要导航 跳到搜索 跳到主要内容

STVGBert: A Visual-linguistic Transformer based Framework for Spatio-temporal Video Grounding

  • Rui Su
  • , Qian Yu
  • , Dong Xu*
  • *此作品的通讯作者

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

Spatio-temporal video grounding (STVG) aims to localize a spatio-temporal tube of a target object in an untrimmed video based on a query sentence. In this work, we propose a one-stage visual-linguistic transformer based framework called STVGBert for the STVG task, which can simultaneously localize the target object in both spatial and temporal domains. Specifically, without resorting to pre-generated object proposals, our STVGBert directly takes a video and a query sentence as the input, and then produces the cross-modal features by using the newly introduced cross-modal feature learning module ST-ViLBert. Based on the cross-modal features, our method then generates bounding boxes and predicts the starting and ending frames to produce the predicted object tube. To the best of our knowledge, our STVGBert is the first one-stage method, which can handle the STVG task without relying on any pre-trained object detectors. Comprehensive experiments demonstrate our newly proposed framework outperforms the state-ofthe-art multi-stage methods on two benchmark datasets VidSTG and HC-STVG.

源语言英语
主期刊名Proceedings - 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021
出版商Institute of Electrical and Electronics Engineers Inc.
1513-1522
页数10
ISBN(电子版)9781665428125
DOI
出版状态已出版 - 2021
活动18th IEEE/CVF International Conference on Computer Vision, ICCV 2021 - Virtual, Online, 加拿大
期限: 11 10月 202117 10月 2021

出版系列

姓名Proceedings of the IEEE International Conference on Computer Vision
ISSN(印刷版)1550-5499

会议

会议18th IEEE/CVF International Conference on Computer Vision, ICCV 2021
国家/地区加拿大
Virtual, Online
时期11/10/2117/10/21

指纹

探究 'STVGBert: A Visual-linguistic Transformer based Framework for Spatio-temporal Video Grounding' 的科研主题。它们共同构成独一无二的指纹。

引用此