TY - GEN
T1 - Examine before you answer
T2 - 26th ACM Multimedia conference, MM 2018
AU - Gao, Lianli
AU - Zeng, Pengpeng
AU - Song, Jingkuan
AU - Liu, Xianglong
AU - Shen, Heng Tao
N1 - Publisher Copyright:
© 2018 Association for Computing Machinery.
PY - 2018/10/15
Y1 - 2018/10/15
N2 - Multiple-choice (MC) Visual Question Answering (VQA) is a similar but essentially different task to open-ended VQA because the answer options are provided. Most of existing works tackle them in a unified pipeline by solving a multi-class problem to infer the best answer from a predefined answer set. The option that matches the best answer is selected for MC VQA. Nevertheless, this violates human thinking logics. Normally, people examine the questions, answer options and the reference image before inferring a MC VQA. For MC VQA, human either rely on the question and answer options to directly deduce a correct answer if the question is not image-related, or read the question and answer options and then purposefully search for answers in a reference image. Therefore, we propose a novel approach, namely Multi-task Learning with Adaptive-attention (MTA), to simulate human logics for MC VQA. Specifically, we first fuse the answer options and question features, and then adaptively attend to the visual features for inferring a MC VQA. Furthermore, we design our model as a multi-task learning architecture by integrating the open-ended VQA task to further boost the performance of MC VQA. We evaluate our approach on two standard benchmark datasets: VQA and Visual7W and our approach sets new records on both datasets for MC VQA task, reaching 73.5% and 65.9% average accuracy respectively.
AB - Multiple-choice (MC) Visual Question Answering (VQA) is a similar but essentially different task to open-ended VQA because the answer options are provided. Most of existing works tackle them in a unified pipeline by solving a multi-class problem to infer the best answer from a predefined answer set. The option that matches the best answer is selected for MC VQA. Nevertheless, this violates human thinking logics. Normally, people examine the questions, answer options and the reference image before inferring a MC VQA. For MC VQA, human either rely on the question and answer options to directly deduce a correct answer if the question is not image-related, or read the question and answer options and then purposefully search for answers in a reference image. Therefore, we propose a novel approach, namely Multi-task Learning with Adaptive-attention (MTA), to simulate human logics for MC VQA. Specifically, we first fuse the answer options and question features, and then adaptively attend to the visual features for inferring a MC VQA. Furthermore, we design our model as a multi-task learning architecture by integrating the open-ended VQA task to further boost the performance of MC VQA. We evaluate our approach on two standard benchmark datasets: VQA and Visual7W and our approach sets new records on both datasets for MC VQA task, reaching 73.5% and 65.9% average accuracy respectively.
UR - https://www.scopus.com/pages/publications/85058210960
U2 - 10.1145/3240508.3240687
DO - 10.1145/3240508.3240687
M3 - 会议稿件
AN - SCOPUS:85058210960
T3 - MM 2018 - Proceedings of the 2018 ACM Multimedia Conference
SP - 1742
EP - 1750
BT - MM 2018 - Proceedings of the 2018 ACM Multimedia Conference
PB - Association for Computing Machinery, Inc
Y2 - 22 October 2018 through 26 October 2018
ER -