跳到主要导航 跳到搜索 跳到主要内容

Sparse agent transformer for unified voxel and image feature extraction and fusion

  • Beihang University

科研成果: 期刊稿件文献综述同行评审

摘要

Current 3D multi-modal perception methods have a shortage of the capability to efficiently summarize and simplify information when extracting features from extensive sparse 3D data, which results in challenges in achieving a balance between accuracy and speed. In this paper, we propose a novel multi-modal transformer backbone named Sparse Agent Transformer (SAT), which is based on an agent-based approach from the perspective of information abstraction and interaction. In the context of extracting sparse features from a single modality, we suggest a sparse agent attention approach that does not rely on conventional grouping token attention. This method initially compresses features from the token to the agent, followed by interactions between the agents and feedback to the token. To speed up the process of merging cross-model data, we investigated the use of agent-based cross-modal fusion techniques between voxels and images, which uses agent-based cross-modal fusion techniques instead of using tokens directly to speed up the fusion process. Extensive experiments on the Nuscenes dataset show that our model achieves state-of-the-art performance in 3D detection and bird's eye view (BEV) segmentation.

源语言英语
文章编号102455
期刊Information Fusion
110
DOI
出版状态已出版 - 10月 2024

指纹

探究 'Sparse agent transformer for unified voxel and image feature extraction and fusion' 的科研主题。它们共同构成独一无二的指纹。

引用此