TY - GEN
T1 - Target-Driven Structured Transformer Planner for Vision-Language Navigation
AU - Zhao, Yusheng
AU - Chen, Jinyu
AU - Gao, Chen
AU - Wang, Wenguan
AU - Yang, Lirong
AU - Ren, Haibing
AU - Xia, Huaxia
AU - Liu, Si
N1 - Publisher Copyright:
© 2022 ACM.
PY - 2022/10/10
Y1 - 2022/10/10
N2 - Vision-language navigation is the task of directing an embodied agent to navigate in 3D scenes with natural language instructions. For the agent, inferring the long-term navigation target from visual-linguistic clues is crucial for reliable path planning, which, however, has rarely been studied before in literature. In this article, we propose a Target-Driven Structured Transformer Planner (TD-STP) for long-horizon goal-guided and room layout-aware navigation. Specifically, we devise an Imaginary Scene Tokenization mechanism for explicit estimation of the long-term target (even located in unexplored environments). In addition, we design a Structured Transformer Planner which elegantly incorporates the explored room layout into a neural attention architecture for structured and global planning. Experimental results demonstrate that our TD-STP substantially improves previous best methods' success rate by 2% and 5% on the test set of R2R and REVERIE benchmarks, respectively. Our code is available at https://github.com/YushengZhao/TD-STP.
AB - Vision-language navigation is the task of directing an embodied agent to navigate in 3D scenes with natural language instructions. For the agent, inferring the long-term navigation target from visual-linguistic clues is crucial for reliable path planning, which, however, has rarely been studied before in literature. In this article, we propose a Target-Driven Structured Transformer Planner (TD-STP) for long-horizon goal-guided and room layout-aware navigation. Specifically, we devise an Imaginary Scene Tokenization mechanism for explicit estimation of the long-term target (even located in unexplored environments). In addition, we design a Structured Transformer Planner which elegantly incorporates the explored room layout into a neural attention architecture for structured and global planning. Experimental results demonstrate that our TD-STP substantially improves previous best methods' success rate by 2% and 5% on the test set of R2R and REVERIE benchmarks, respectively. Our code is available at https://github.com/YushengZhao/TD-STP.
KW - imaginary scene tokenization
KW - structured transformer
KW - target-driven planner
KW - vision-language navigation
UR - https://www.scopus.com/pages/publications/85144332543
U2 - 10.1145/3503161.3548281
DO - 10.1145/3503161.3548281
M3 - 会议稿件
AN - SCOPUS:85144332543
T3 - MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia
SP - 4194
EP - 4203
BT - MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia
PB - Association for Computing Machinery, Inc
T2 - 30th ACM International Conference on Multimedia, MM 2022
Y2 - 10 October 2022 through 14 October 2022
ER -