TY - JOUR
T1 - RSBEV-Mamba
T2 - 3-D BEV Sequence Modeling for Multiview Remote Sensing Scene Segmentation
AU - Lin, Baihong
AU - Zou, Zhengxia
AU - Shi, Zhenwei
N1 - Publisher Copyright:
© 2025 IEEE. All rights reserved.
PY - 2025
Y1 - 2025
N2 - Multiview collaborative perception has been demonstrated to be highly effective in extracting 3-D information from remote sensing scenes by remote sensing bird’s-eye-view (RSBEV). However, inherent depth uncertainty in purely visual methods limits view fusion accuracy, and high computational complexity makes it challenging to model long sequences efficiently. To address these issues, we reformulate the BEV segmentation problem as a 3-D sequence modeling task and propose RSBEV-Mamba, a novel framework comprising a 3-D BEV module, a 3-D VMamba module, and a dense BEV contrastive learning module. The 3-D BEV module projects multiview 2-D image features into 3-D world coordinates, thus establishing a foundation for accurate spatial representation. The 3-D VMamba module, based on state-space models (SSMs), optimizes the processing of densely projected features with linear computational complexity in global 3-D spatial modeling. It incorporates a 3-D selective scanning strategy (SS3D) block with 16 scanning strategies, transforming previously ignored projections at different heights into valid 3-D sequences and enriching the contextual depth and precision of BEV encoding. By employing a contrastive learning strategy with the CLIP model, we align BEV and ground truth (GT) features within the same dimensional framework, ensuring spatial integrity after side-view projection. Our approach achieves a 4% improvement mIoU, thus reaching a score of 0.7368 on LEVIR-MDS and surpassing previous state-of-the-art methods. This establishes the 3-D VMamba module as a general model for 3-D perception tasks and sets a new benchmark in remote sensing technology.
AB - Multiview collaborative perception has been demonstrated to be highly effective in extracting 3-D information from remote sensing scenes by remote sensing bird’s-eye-view (RSBEV). However, inherent depth uncertainty in purely visual methods limits view fusion accuracy, and high computational complexity makes it challenging to model long sequences efficiently. To address these issues, we reformulate the BEV segmentation problem as a 3-D sequence modeling task and propose RSBEV-Mamba, a novel framework comprising a 3-D BEV module, a 3-D VMamba module, and a dense BEV contrastive learning module. The 3-D BEV module projects multiview 2-D image features into 3-D world coordinates, thus establishing a foundation for accurate spatial representation. The 3-D VMamba module, based on state-space models (SSMs), optimizes the processing of densely projected features with linear computational complexity in global 3-D spatial modeling. It incorporates a 3-D selective scanning strategy (SS3D) block with 16 scanning strategies, transforming previously ignored projections at different heights into valid 3-D sequences and enriching the contextual depth and precision of BEV encoding. By employing a contrastive learning strategy with the CLIP model, we align BEV and ground truth (GT) features within the same dimensional framework, ensuring spatial integrity after side-view projection. Our approach achieves a 4% improvement mIoU, thus reaching a score of 0.7368 on LEVIR-MDS and surpassing previous state-of-the-art methods. This establishes the 3-D VMamba module as a general model for 3-D perception tasks and sets a new benchmark in remote sensing technology.
KW - Bird’s-eye-view (BEV) representation
KW - multiview collaborative segmentation
KW - remote sensing
KW - semantic segmentation
UR - https://www.scopus.com/pages/publications/105001082604
U2 - 10.1109/TGRS.2025.3543200
DO - 10.1109/TGRS.2025.3543200
M3 - 文章
AN - SCOPUS:105001082604
SN - 0196-2892
VL - 63
JO - IEEE Transactions on Geoscience and Remote Sensing
JF - IEEE Transactions on Geoscience and Remote Sensing
M1 - 5613213
ER -