摘要
Current 3D multi-modal perception methods have a shortage of the capability to efficiently summarize and simplify information when extracting features from extensive sparse 3D data, which results in challenges in achieving a balance between accuracy and speed. In this paper, we propose a novel multi-modal transformer backbone named Sparse Agent Transformer (SAT), which is based on an agent-based approach from the perspective of information abstraction and interaction. In the context of extracting sparse features from a single modality, we suggest a sparse agent attention approach that does not rely on conventional grouping token attention. This method initially compresses features from the token to the agent, followed by interactions between the agents and feedback to the token. To speed up the process of merging cross-model data, we investigated the use of agent-based cross-modal fusion techniques between voxels and images, which uses agent-based cross-modal fusion techniques instead of using tokens directly to speed up the fusion process. Extensive experiments on the Nuscenes dataset show that our model achieves state-of-the-art performance in 3D detection and bird's eye view (BEV) segmentation.
| 源语言 | 英语 |
|---|---|
| 文章编号 | 102455 |
| 期刊 | Information Fusion |
| 卷 | 110 |
| DOI | |
| 出版状态 | 已出版 - 10月 2024 |
指纹
探究 'Sparse agent transformer for unified voxel and image feature extraction and fusion' 的科研主题。它们共同构成独一无二的指纹。引用此
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver