TY - GEN
T1 - Joint Visual Perception and Linguistic Commonsense for Daily Events Causality Reasoning
AU - Ma, Bole
AU - Tong, Chao
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - Multimodal networks that juxtapose visual and linguistic modalities are currently widely adopted for solving vision-and-language tasks. They perform well in simple and intu-itive tasks, but are prone to mistakes in tasks involving latent or implicit details, due to the difficulty of capturing crucial but imperceptible visual signals in the real world. Perception errors lead to nonsensical results, but can be corrected by commonsense knowledge. To this end, we combine visual perception and linguistic commonsense to solve the challenging daily events causality reasoning task. We propose a novel Object-Aware Reasoning Network to focus on object inter-action while ignoring distracting information to refine visual perception. Further, a language branch with an independent prediction head is supervised to learn causality commonsense to help correct obvious perception errors, resulting in more plausible conclusions. Extensive experiments demonstrate that our method achieves new state-of-the-art results on Vis-Causal dataset.
AB - Multimodal networks that juxtapose visual and linguistic modalities are currently widely adopted for solving vision-and-language tasks. They perform well in simple and intu-itive tasks, but are prone to mistakes in tasks involving latent or implicit details, due to the difficulty of capturing crucial but imperceptible visual signals in the real world. Perception errors lead to nonsensical results, but can be corrected by commonsense knowledge. To this end, we combine visual perception and linguistic commonsense to solve the challenging daily events causality reasoning task. We propose a novel Object-Aware Reasoning Network to focus on object inter-action while ignoring distracting information to refine visual perception. Further, a language branch with an independent prediction head is supervised to learn causality commonsense to help correct obvious perception errors, resulting in more plausible conclusions. Extensive experiments demonstrate that our method achieves new state-of-the-art results on Vis-Causal dataset.
KW - Causality reasoning
KW - commonsense knowledge
KW - multimodal net-work
KW - visual perception
UR - https://www.scopus.com/pages/publications/85137730195
U2 - 10.1109/ICME52920.2022.9859679
DO - 10.1109/ICME52920.2022.9859679
M3 - 会议稿件
AN - SCOPUS:85137730195
T3 - Proceedings - IEEE International Conference on Multimedia and Expo
BT - ICME 2022 - IEEE International Conference on Multimedia and Expo 2022, Proceedings
PB - IEEE Computer Society
T2 - 2022 IEEE International Conference on Multimedia and Expo, ICME 2022
Y2 - 18 July 2022 through 22 July 2022
ER -