Skip to main navigation Skip to search Skip to main content

Multimodal Monocular Dense Depth Estimation with Event-Frame Fusion Using Transformer

  • Baihui Xiao
  • , Jingzehua Xu
  • , Zekai Zhang
  • , Tianyu Xing
  • , Jingjing Wang*
  • , Yong Ren
  • *Corresponding author for this work
  • Tsinghua University

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Frame cameras struggle to estimate depth maps accurately under abnormal lighting conditions. In contrast, event cameras, with their high temporal resolution and high dynamic range, can capture sparse, asynchronous event streams that record pixel brightness changes, addressing the limitations of frame cameras. However, the potential of asynchronous events remains underexploited, which hinders the ability of event cameras to predict dense depth maps effectively. Integrating event streams with frame data can significantly enhance the monocular depth estimation accuracy, especially in complex scenarios. In this study, we introduce a novel depth estimation framework that combines event and frame data using a transformer-based model. Our proposed framework contains two primary components: a multimodal encoder and a joint decoder. The multimodal encoder employs self-attention mechanisms to analyze the interactions between frame patches and event tensors, mapping out dependencies across local and global spatiotemporal events. This multi-scale fusion approach maximizes the benefits of both event and frame inputs. The joint decoder incorporates a dual-phase, triple-scale feature fusion module, which extracts contextual information and delivers detailed depth prediction results. Our experimental results on the EventScape and MVSEC datasets affirm that our method sets a new benchmark in performance.

Original languageEnglish
Title of host publicationArtificial Neural Networks and Machine Learning – ICANN 2024 - 33rd International Conference on Artificial Neural Networks, Proceedings
EditorsMichael Wand, Jürgen Schmidhuber, Michael Wand, Kristína Malinovská, Jürgen Schmidhuber, Igor V. Tetko, Igor V. Tetko
PublisherSpringer Science and Business Media Deutschland GmbH
Pages419-433
Number of pages15
ISBN (Print)9783031723346
DOIs
StatePublished - 2024
Event33rd International Conference on Artificial Neural Networks, ICANN 2024 - Lugano, Switzerland
Duration: 17 Sep 202420 Sep 2024

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume15017 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference33rd International Conference on Artificial Neural Networks, ICANN 2024
Country/TerritorySwitzerland
CityLugano
Period17/09/2420/09/24

Keywords

  • Event Camera
  • Frame Camera
  • Monocular depth estimation
  • Multi-modal Fusion
  • Transformer self-attention

Fingerprint

Dive into the research topics of 'Multimodal Monocular Dense Depth Estimation with Event-Frame Fusion Using Transformer'. Together they form a unique fingerprint.

Cite this