TY - GEN
T1 - STEM-DETR
T2 - AOPC 2025: Optical Sensing, Imaging, Communications, Display, and Biomedical Optics
AU - Zhou, Yukai
AU - Li, Na
AU - Zhao, Huijie
AU - Wang, Haining
AU - Ou, Wen
N1 - Publisher Copyright:
© 2025 SPIE.
PY - 2025/10/28
Y1 - 2025/10/28
N2 - Object detection remains a key task in remote sensing image processing. When applying the task to multimodal and sequential images, the problem of fusion of reciprocal information from multimodal sources, as well as the problem of temporal information extraction from sequential images, remain difficult yet rewarding. Current object detection networks mostly cannot process images that are both multimodal and sequential. The processed result also suffers from the variant object sizes found in remote sensing images, the unalignment and redundancy between modalities, and the difficulty in preserving long-range temporal information. Designed to tackle these problems, this research proposes a multimodal remote sensing object detection method based on improved spatial-temporal feature enhancement. The model proposed, called Spatial-Temporal Enhanced Multimodal DETR or STEM-DETR, supports object detection on RGB-T multimodal sequential images. We iterated on the typical end-to-end object detection pipeline of DETR by designing two unique modules, namely the RGB-T mixed attention merging module and the global spatial-temporal enhancement module. The RGB-T mixed attention merging module facilitates feature-level fusion between modalities, while the global spatial-temporal enhancement module builds on the concept of object queries by filtering high-confidence ones in the temporal sequence to enhance others. To validate the effectiveness of our method, thorough ablation study and comparison experiments are conducted. Within experiments, STEM-DETR achieved a maximum of 75.3 AP50 on our custom dataset, surpassing that of YOLOV++, SuperYOLO and TransVOD. These statistics are also supported by visual representations of the model's output. The results show that our method is both effective and adaptable.
AB - Object detection remains a key task in remote sensing image processing. When applying the task to multimodal and sequential images, the problem of fusion of reciprocal information from multimodal sources, as well as the problem of temporal information extraction from sequential images, remain difficult yet rewarding. Current object detection networks mostly cannot process images that are both multimodal and sequential. The processed result also suffers from the variant object sizes found in remote sensing images, the unalignment and redundancy between modalities, and the difficulty in preserving long-range temporal information. Designed to tackle these problems, this research proposes a multimodal remote sensing object detection method based on improved spatial-temporal feature enhancement. The model proposed, called Spatial-Temporal Enhanced Multimodal DETR or STEM-DETR, supports object detection on RGB-T multimodal sequential images. We iterated on the typical end-to-end object detection pipeline of DETR by designing two unique modules, namely the RGB-T mixed attention merging module and the global spatial-temporal enhancement module. The RGB-T mixed attention merging module facilitates feature-level fusion between modalities, while the global spatial-temporal enhancement module builds on the concept of object queries by filtering high-confidence ones in the temporal sequence to enhance others. To validate the effectiveness of our method, thorough ablation study and comparison experiments are conducted. Within experiments, STEM-DETR achieved a maximum of 75.3 AP50 on our custom dataset, surpassing that of YOLOV++, SuperYOLO and TransVOD. These statistics are also supported by visual representations of the model's output. The results show that our method is both effective and adaptable.
KW - Image processing
KW - Multimodal fusion
KW - Object detection
KW - Spatial-temporal feature enhancement
KW - Transformer architecture
UR - https://www.scopus.com/pages/publications/105025949666
U2 - 10.1117/12.3082917
DO - 10.1117/12.3082917
M3 - 会议稿件
AN - SCOPUS:105025949666
T3 - Proceedings of SPIE - The International Society for Optical Engineering
BT - AOPC 2025
A2 - Jiang, Yadong
PB - SPIE
Y2 - 24 June 2025 through 27 June 2025
ER -