TY - GEN
T1 - Global Context Enhanced Multi-modal Fusion for Referring Image Segmentation
AU - Yang, Jianhua
AU - Huang, Yan
AU - Huang, Linjiang
AU - Wang, Yunbo
AU - Ma, Zhanyu
AU - Wang, Liang
N1 - Publisher Copyright:
© 2020, Springer Nature Switzerland AG.
PY - 2020
Y1 - 2020
N2 - The referring image segmentation is a challenging task which aims to segment the object of interest in an image according to a natural language expression. Most existing works directly concatenate the global language representation with local visual features, and follow by a convolutional operation to fuse two modalities. These works ignore that the global contextual information from vision is essential for vision-language fusing and inferring the referred objects. The global context can establish a perception of the full image, thus it’s fusion with global language representation is beneficial to reduce mislabeled pixels of similar objects in an image. To address aforementioned issue, we propose a global fusion network (GFNet), which is composed of visual guided global fusion module and language guided global fusion module. By modeling the expression-region interactions, two modules can aggregate the expression-related visual contextual information and fuse it with global representation of language expression. Moreover, to alleviate the distribution differences between two modalities, we introduce a channel-wise self-gate on visual-language concatenated features. We validate the proposed network on four standard datasets, the experimental results show that our approach outperforms state-of-the-art methods.
AB - The referring image segmentation is a challenging task which aims to segment the object of interest in an image according to a natural language expression. Most existing works directly concatenate the global language representation with local visual features, and follow by a convolutional operation to fuse two modalities. These works ignore that the global contextual information from vision is essential for vision-language fusing and inferring the referred objects. The global context can establish a perception of the full image, thus it’s fusion with global language representation is beneficial to reduce mislabeled pixels of similar objects in an image. To address aforementioned issue, we propose a global fusion network (GFNet), which is composed of visual guided global fusion module and language guided global fusion module. By modeling the expression-region interactions, two modules can aggregate the expression-related visual contextual information and fuse it with global representation of language expression. Moreover, to alleviate the distribution differences between two modalities, we introduce a channel-wise self-gate on visual-language concatenated features. We validate the proposed network on four standard datasets, the experimental results show that our approach outperforms state-of-the-art methods.
KW - Attention mechanism
KW - Natural language expression
KW - Semantic segmentation
UR - https://www.scopus.com/pages/publications/85093835175
U2 - 10.1007/978-3-030-60633-6_36
DO - 10.1007/978-3-030-60633-6_36
M3 - 会议稿件
AN - SCOPUS:85093835175
SN - 9783030606329
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 434
EP - 446
BT - Pattern Recognition and Computer Vision - 3rd Chinese Conference, PRCV 2020, Proceedings
A2 - Peng, Yuxin
A2 - Zha, Hongbin
A2 - Liu, Qingshan
A2 - Lu, Huchuan
A2 - Sun, Zhenan
A2 - Liu, Chenglin
A2 - Chen, Xilin
A2 - Yang, Jian
PB - Springer Science and Business Media Deutschland GmbH
T2 - 3rd Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2020
Y2 - 16 October 2020 through 18 October 2020
ER -