TY - GEN
T1 - Multimodal Monocular Dense Depth Estimation with Event-Frame Fusion Using Transformer
AU - Xiao, Baihui
AU - Xu, Jingzehua
AU - Zhang, Zekai
AU - Xing, Tianyu
AU - Wang, Jingjing
AU - Ren, Yong
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024.
PY - 2024
Y1 - 2024
N2 - Frame cameras struggle to estimate depth maps accurately under abnormal lighting conditions. In contrast, event cameras, with their high temporal resolution and high dynamic range, can capture sparse, asynchronous event streams that record pixel brightness changes, addressing the limitations of frame cameras. However, the potential of asynchronous events remains underexploited, which hinders the ability of event cameras to predict dense depth maps effectively. Integrating event streams with frame data can significantly enhance the monocular depth estimation accuracy, especially in complex scenarios. In this study, we introduce a novel depth estimation framework that combines event and frame data using a transformer-based model. Our proposed framework contains two primary components: a multimodal encoder and a joint decoder. The multimodal encoder employs self-attention mechanisms to analyze the interactions between frame patches and event tensors, mapping out dependencies across local and global spatiotemporal events. This multi-scale fusion approach maximizes the benefits of both event and frame inputs. The joint decoder incorporates a dual-phase, triple-scale feature fusion module, which extracts contextual information and delivers detailed depth prediction results. Our experimental results on the EventScape and MVSEC datasets affirm that our method sets a new benchmark in performance.
AB - Frame cameras struggle to estimate depth maps accurately under abnormal lighting conditions. In contrast, event cameras, with their high temporal resolution and high dynamic range, can capture sparse, asynchronous event streams that record pixel brightness changes, addressing the limitations of frame cameras. However, the potential of asynchronous events remains underexploited, which hinders the ability of event cameras to predict dense depth maps effectively. Integrating event streams with frame data can significantly enhance the monocular depth estimation accuracy, especially in complex scenarios. In this study, we introduce a novel depth estimation framework that combines event and frame data using a transformer-based model. Our proposed framework contains two primary components: a multimodal encoder and a joint decoder. The multimodal encoder employs self-attention mechanisms to analyze the interactions between frame patches and event tensors, mapping out dependencies across local and global spatiotemporal events. This multi-scale fusion approach maximizes the benefits of both event and frame inputs. The joint decoder incorporates a dual-phase, triple-scale feature fusion module, which extracts contextual information and delivers detailed depth prediction results. Our experimental results on the EventScape and MVSEC datasets affirm that our method sets a new benchmark in performance.
KW - Event Camera
KW - Frame Camera
KW - Monocular depth estimation
KW - Multi-modal Fusion
KW - Transformer self-attention
UR - https://www.scopus.com/pages/publications/85205867697
U2 - 10.1007/978-3-031-72335-3_29
DO - 10.1007/978-3-031-72335-3_29
M3 - 会议稿件
AN - SCOPUS:85205867697
SN - 9783031723346
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 419
EP - 433
BT - Artificial Neural Networks and Machine Learning – ICANN 2024 - 33rd International Conference on Artificial Neural Networks, Proceedings
A2 - Wand, Michael
A2 - Schmidhuber, Jürgen
A2 - Wand, Michael
A2 - Malinovská, Kristína
A2 - Schmidhuber, Jürgen
A2 - Tetko, Igor V.
A2 - Tetko, Igor V.
PB - Springer Science and Business Media Deutschland GmbH
T2 - 33rd International Conference on Artificial Neural Networks, ICANN 2024
Y2 - 17 September 2024 through 20 September 2024
ER -