TY - GEN
T1 - Sharing Attention Mechanism in V-SLAM
T2 - 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2024
AU - Dai, Dun
AU - Quan, Quan
AU - Cai, Kai Yuan
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - In V-SLAM, the estimation of relative camera pose is crucial to determine the spatial relationship between consecutive camera images, helping to accurately track the movement of the camera in its environment. In small indoor scenes, when the training set is limited, which is very common in robot SLAM, learning-based methods may fail to converge, especially the Transformer architecture, which requires a more substantial dataset to match the performance of the CNN architecture model. This work addresses this problem with the sharing attention mechanism, building on recent improvements in solving visual Transformer architectures on small datasets while incorporating messenger tokens. Besides, double-embedding is introduced to capture the spatial of images and order of images. In summary, we introduce an intuitive end-to-end relative pose estimation solution and prove its accuracy on the two smallest sub-datasets of 7Scenes. The proposed method is tested with a set of comparison experiments conducted across CNN-based, Transformer-based end-to-end relative pose estimation models, and the robust feature-matching non-learning method. Our model outperforms in all comparisons. Furthermore, ablation studies clearly illustrate that these innovations are crucial for the accuracy of relative pose estimation on small datasets.
AB - In V-SLAM, the estimation of relative camera pose is crucial to determine the spatial relationship between consecutive camera images, helping to accurately track the movement of the camera in its environment. In small indoor scenes, when the training set is limited, which is very common in robot SLAM, learning-based methods may fail to converge, especially the Transformer architecture, which requires a more substantial dataset to match the performance of the CNN architecture model. This work addresses this problem with the sharing attention mechanism, building on recent improvements in solving visual Transformer architectures on small datasets while incorporating messenger tokens. Besides, double-embedding is introduced to capture the spatial of images and order of images. In summary, we introduce an intuitive end-to-end relative pose estimation solution and prove its accuracy on the two smallest sub-datasets of 7Scenes. The proposed method is tested with a set of comparison experiments conducted across CNN-based, Transformer-based end-to-end relative pose estimation models, and the robust feature-matching non-learning method. Our model outperforms in all comparisons. Furthermore, ablation studies clearly illustrate that these innovations are crucial for the accuracy of relative pose estimation on small datasets.
UR - https://www.scopus.com/pages/publications/85216495238
U2 - 10.1109/IROS58592.2024.10801926
DO - 10.1109/IROS58592.2024.10801926
M3 - 会议稿件
AN - SCOPUS:85216495238
T3 - IEEE International Conference on Intelligent Robots and Systems
SP - 7878
EP - 7884
BT - 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2024
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 14 October 2024 through 18 October 2024
ER -