TY - JOUR
T1 - TASTA
T2 - Text-Assisted Spatial and Temporal Attention Network for Video Question Answering
AU - Wang, Tian
AU - Hou, Boyao
AU - Li, Jiakun
AU - Shi, Peng
AU - Zhang, Baochang
AU - Snoussi, Hichem
N1 - Publisher Copyright:
© 2023 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH.
PY - 2023/4
Y1 - 2023/4
N2 - Video question answering (VideoQA) is a typical task that integrates language and vision. The key for VideoQA is to extract relevant and effective visual information for answering a specific question. Information selection is believed to be necessary for this task due to the large amount of irrelevant information in the video, and explicitly learning an attention model can be a reasonable and effective solution for the selection. Herein, a novel VideoQA model called Text-Assisted Spatial and Temporal Attention Network (TASTA) is proposed, which shows the great potential of explicitly modeling attention. TASTA is made to be simple, small, clean, and efficient for clear performance justification and possible easy extension. Its success is mainly from two new strategies of better using the textual information. Experimental results on a large and most representative dataset, TGIF-QA, show the significant superiority of TASTA w.r.t. the state-of-the-art and demonstrate the effectiveness of its key components via ablation studies.
AB - Video question answering (VideoQA) is a typical task that integrates language and vision. The key for VideoQA is to extract relevant and effective visual information for answering a specific question. Information selection is believed to be necessary for this task due to the large amount of irrelevant information in the video, and explicitly learning an attention model can be a reasonable and effective solution for the selection. Herein, a novel VideoQA model called Text-Assisted Spatial and Temporal Attention Network (TASTA) is proposed, which shows the great potential of explicitly modeling attention. TASTA is made to be simple, small, clean, and efficient for clear performance justification and possible easy extension. Its success is mainly from two new strategies of better using the textual information. Experimental results on a large and most representative dataset, TGIF-QA, show the significant superiority of TASTA w.r.t. the state-of-the-art and demonstrate the effectiveness of its key components via ablation studies.
KW - attention mechanism
KW - video question answering
KW - visual question answering
UR - https://www.scopus.com/pages/publications/85165798498
U2 - 10.1002/aisy.202200131
DO - 10.1002/aisy.202200131
M3 - 文章
AN - SCOPUS:85165798498
SN - 2640-4567
VL - 5
JO - Advanced Intelligent Systems
JF - Advanced Intelligent Systems
IS - 4
M1 - 2200131
ER -