跳到主要导航 跳到搜索 跳到主要内容

TASTA: Text-Assisted Spatial and Temporal Attention Network for Video Question Answering

  • Beihang University
  • Fujian Normal University
  • Université de technologie de Troyes

科研成果: 期刊稿件文章同行评审

摘要

Video question answering (VideoQA) is a typical task that integrates language and vision. The key for VideoQA is to extract relevant and effective visual information for answering a specific question. Information selection is believed to be necessary for this task due to the large amount of irrelevant information in the video, and explicitly learning an attention model can be a reasonable and effective solution for the selection. Herein, a novel VideoQA model called Text-Assisted Spatial and Temporal Attention Network (TASTA) is proposed, which shows the great potential of explicitly modeling attention. TASTA is made to be simple, small, clean, and efficient for clear performance justification and possible easy extension. Its success is mainly from two new strategies of better using the textual information. Experimental results on a large and most representative dataset, TGIF-QA, show the significant superiority of TASTA w.r.t. the state-of-the-art and demonstrate the effectiveness of its key components via ablation studies.

源语言英语
文章编号2200131
期刊Advanced Intelligent Systems
5
4
DOI
出版状态已出版 - 4月 2023

指纹

探究 'TASTA: Text-Assisted Spatial and Temporal Attention Network for Video Question Answering' 的科研主题。它们共同构成独一无二的指纹。

引用此