TY - JOUR
T1 - Fast-BEV
T2 - A Fast and Strong Bird's-Eye View Perception Baseline
AU - Li, Yangguang
AU - Huang, Bin
AU - Chen, Zeren
AU - Cui, Yufeng
AU - Liang, Feng
AU - Shen, Mingzhu
AU - Liu, Fenggang
AU - Xie, Enze
AU - Sheng, Lu
AU - Ouyang, Wanli
AU - Shao, Jing
N1 - Publisher Copyright:
© 1979-2012 IEEE.
PY - 2024
Y1 - 2024
N2 - Recently, perception task based on Bird's-Eye View (BEV) representation has drawn more and more attention, and BEV representation is promising as the foundation for next-generation Autonomous Vehicle (AV) perception. However, most existing BEV solutions either require considerable resources to execute on-vehicle inference or suffer from modest performance. This paper proposes a simple yet effective framework, termed Fast-BEV, which is capable of performing faster BEV perception on the on-vehicle chips. Towards this goal, we first empirically find that the BEV representation can be sufficiently powerful without expensive transformer based transformation or depth representation. Our Fast-BEV consists of five parts, we innovatively propose (1) a lightweight deployment-friendly view transformation which fast transfers 2D image features to 3D voxel space, (2) a multi-scale image encoder which leverages multi-scale information for better performance, (3) an efficient BEV encoder which is particularly designed to speed up on-vehicle inference. We further introduce (4) a strong data augmentation strategy for both image and BEV space to avoid over-fitting, (5) a multi-frame feature fusion mechanism to leverage the temporal information. Among them, (1) and (3) enable Fast-BEV to be fast inference and deployment friendly on the on-vehicle chips, (2), (4) and (5) ensure that Fast-BEV has competitive performance. All these make Fast-BEV a solution with high performance, fast inference speed, and deployment-friendly on the on-vehicle chips of autonomous driving. Through experiments, on 2080Ti platform, our R50 model can run 52.6 FPS with 47.3% NDS on the nuScenes validation set, exceeding the 41.3 FPS and 47.5% NDS of the BEVDepth-R50 model (Li et al. 2022) and 30.2 FPS and 45.7% NDS of the BEVDet4D-R50 model (J. Huang and G. Huang, 2022). Our largest model (R101@900×1600) establishes a competitive 53.5% NDS on the nuScenes validation set. We further develop a benchmark with considerable accuracy and efficiency on current popular on-vehicle chips.
AB - Recently, perception task based on Bird's-Eye View (BEV) representation has drawn more and more attention, and BEV representation is promising as the foundation for next-generation Autonomous Vehicle (AV) perception. However, most existing BEV solutions either require considerable resources to execute on-vehicle inference or suffer from modest performance. This paper proposes a simple yet effective framework, termed Fast-BEV, which is capable of performing faster BEV perception on the on-vehicle chips. Towards this goal, we first empirically find that the BEV representation can be sufficiently powerful without expensive transformer based transformation or depth representation. Our Fast-BEV consists of five parts, we innovatively propose (1) a lightweight deployment-friendly view transformation which fast transfers 2D image features to 3D voxel space, (2) a multi-scale image encoder which leverages multi-scale information for better performance, (3) an efficient BEV encoder which is particularly designed to speed up on-vehicle inference. We further introduce (4) a strong data augmentation strategy for both image and BEV space to avoid over-fitting, (5) a multi-frame feature fusion mechanism to leverage the temporal information. Among them, (1) and (3) enable Fast-BEV to be fast inference and deployment friendly on the on-vehicle chips, (2), (4) and (5) ensure that Fast-BEV has competitive performance. All these make Fast-BEV a solution with high performance, fast inference speed, and deployment-friendly on the on-vehicle chips of autonomous driving. Through experiments, on 2080Ti platform, our R50 model can run 52.6 FPS with 47.3% NDS on the nuScenes validation set, exceeding the 41.3 FPS and 47.5% NDS of the BEVDepth-R50 model (Li et al. 2022) and 30.2 FPS and 45.7% NDS of the BEVDet4D-R50 model (J. Huang and G. Huang, 2022). Our largest model (R101@900×1600) establishes a competitive 53.5% NDS on the nuScenes validation set. We further develop a benchmark with considerable accuracy and efficiency on current popular on-vehicle chips.
KW - autonomous driving
KW - bird's-eye view (BEV) representation
KW - Multi-camera
UR - https://www.scopus.com/pages/publications/85196054258
U2 - 10.1109/TPAMI.2024.3414835
DO - 10.1109/TPAMI.2024.3414835
M3 - 文章
C2 - 38875097
AN - SCOPUS:85196054258
SN - 0162-8828
VL - 46
SP - 8665
EP - 8679
JO - IEEE Transactions on Pattern Analysis and Machine Intelligence
JF - IEEE Transactions on Pattern Analysis and Machine Intelligence
IS - 12
ER -