Skip to main navigation Skip to search Skip to main content

Sparse agent transformer for unified voxel and image feature extraction and fusion

  • Hong Zhang
  • , Jiaxu Wan
  • , Ziqi He
  • , Jianbo Song
  • , Yifan Yang
  • , Ding Yuan*
  • *Corresponding author for this work
  • Beihang University

Research output: Contribution to journalReview articlepeer-review

Abstract

Current 3D multi-modal perception methods have a shortage of the capability to efficiently summarize and simplify information when extracting features from extensive sparse 3D data, which results in challenges in achieving a balance between accuracy and speed. In this paper, we propose a novel multi-modal transformer backbone named Sparse Agent Transformer (SAT), which is based on an agent-based approach from the perspective of information abstraction and interaction. In the context of extracting sparse features from a single modality, we suggest a sparse agent attention approach that does not rely on conventional grouping token attention. This method initially compresses features from the token to the agent, followed by interactions between the agents and feedback to the token. To speed up the process of merging cross-model data, we investigated the use of agent-based cross-modal fusion techniques between voxels and images, which uses agent-based cross-modal fusion techniques instead of using tokens directly to speed up the fusion process. Extensive experiments on the Nuscenes dataset show that our model achieves state-of-the-art performance in 3D detection and bird's eye view (BEV) segmentation.

Original languageEnglish
Article number102455
JournalInformation Fusion
Volume110
DOIs
StatePublished - Oct 2024

Keywords

  • 3D feature extraction
  • Multi-modal perception
  • Sparse agent
  • Transformer

Fingerprint

Dive into the research topics of 'Sparse agent transformer for unified voxel and image feature extraction and fusion'. Together they form a unique fingerprint.

Cite this