跳到主要导航 跳到搜索 跳到主要内容

Target-Driven Structured Transformer Planner for Vision-Language Navigation

  • Yusheng Zhao
  • , Jinyu Chen
  • , Chen Gao
  • , Wenguan Wang*
  • , Lirong Yang
  • , Haibing Ren
  • , Huaxia Xia
  • , Si Liu
  • *此作品的通讯作者
  • Beihang University
  • University of Technology Sydney
  • Meituan

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

Vision-language navigation is the task of directing an embodied agent to navigate in 3D scenes with natural language instructions. For the agent, inferring the long-term navigation target from visual-linguistic clues is crucial for reliable path planning, which, however, has rarely been studied before in literature. In this article, we propose a Target-Driven Structured Transformer Planner (TD-STP) for long-horizon goal-guided and room layout-aware navigation. Specifically, we devise an Imaginary Scene Tokenization mechanism for explicit estimation of the long-term target (even located in unexplored environments). In addition, we design a Structured Transformer Planner which elegantly incorporates the explored room layout into a neural attention architecture for structured and global planning. Experimental results demonstrate that our TD-STP substantially improves previous best methods' success rate by 2% and 5% on the test set of R2R and REVERIE benchmarks, respectively. Our code is available at https://github.com/YushengZhao/TD-STP.

源语言英语
主期刊名MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia
出版商Association for Computing Machinery, Inc
4194-4203
页数10
ISBN(电子版)9781450392037
DOI
出版状态已出版 - 10 10月 2022
活动30th ACM International Conference on Multimedia, MM 2022 - Lisboa, 葡萄牙
期限: 10 10月 202214 10月 2022

出版系列

姓名MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia

会议

会议30th ACM International Conference on Multimedia, MM 2022
国家/地区葡萄牙
Lisboa
时期10/10/2214/10/22

指纹

探究 'Target-Driven Structured Transformer Planner for Vision-Language Navigation' 的科研主题。它们共同构成独一无二的指纹。

引用此