TY - GEN
T1 - Context and Apparent Features Aggregation Network for Semantic Segmentation
AU - Dong, Lusen
AU - Wang, Fei
AU - Zheng, Jin
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - Convolution neural network (CNN) has local receptive field and struggles to build long-range spatial dependency, while vision transformer (ViT) has the capacity to capture long-range context dependency but suffers from local detailed feature loss during the token embedding procedure. Aggregating the apparent features and context information is helpful for semantic segmentation. In this paper, the roles of low-level apparent features and context information in semantic segmentation are carefully analyzed, and layer attention module is proposed to finely aggregate low-level features and context information. First, we propose various CNN branches to extract shallow features from an input image, such as edge, texture. Meanwhile, we use ViT backbone to extract rich context information. Second, we integrate CNN branch and ViT in a united network, and propose a layer attention module to fuse the context information and low-level detailed features. Based on the united network, which implies ViT enhanced with low-level convolution, the accurate semantic segmentation is achieved. We test our method on public Cityscapes datasets. Numerate experiments shows our method achieves competitive results. Code is available at: https://github.com/cocolord/Degraded_image_segmentation.
AB - Convolution neural network (CNN) has local receptive field and struggles to build long-range spatial dependency, while vision transformer (ViT) has the capacity to capture long-range context dependency but suffers from local detailed feature loss during the token embedding procedure. Aggregating the apparent features and context information is helpful for semantic segmentation. In this paper, the roles of low-level apparent features and context information in semantic segmentation are carefully analyzed, and layer attention module is proposed to finely aggregate low-level features and context information. First, we propose various CNN branches to extract shallow features from an input image, such as edge, texture. Meanwhile, we use ViT backbone to extract rich context information. Second, we integrate CNN branch and ViT in a united network, and propose a layer attention module to fuse the context information and low-level detailed features. Based on the united network, which implies ViT enhanced with low-level convolution, the accurate semantic segmentation is achieved. We test our method on public Cityscapes datasets. Numerate experiments shows our method achieves competitive results. Code is available at: https://github.com/cocolord/Degraded_image_segmentation.
UR - https://www.scopus.com/pages/publications/85143620336
U2 - 10.1109/ICPR56361.2022.9956731
DO - 10.1109/ICPR56361.2022.9956731
M3 - 会议稿件
AN - SCOPUS:85143620336
T3 - Proceedings - International Conference on Pattern Recognition
SP - 3858
EP - 3864
BT - 2022 26th International Conference on Pattern Recognition, ICPR 2022
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 26th International Conference on Pattern Recognition, ICPR 2022
Y2 - 21 August 2022 through 25 August 2022
ER -