跳到主要导航 跳到搜索 跳到主要内容

AerialVLA: A Vision-Language-Action Model for Aerial Navigation with Online Dialogue

  • Beihang University

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

Visual Dialogue Navigation (VDN) aims to enable agents to reach target locations through dialogue with humans. The integration of VDN into Unmanned Aerial Vehicle (UAV) systems enhances human-machine interaction by enabling intuitive, hands-free operation, thereby unlocking vast applications. However, existing VDN models for UAVs can only perform navigation based on dialogue history, lacking proactive interaction capabilities to correct trajectories. Moreover, their sequential observation history recording mechanism struggles to accurately localize landmarks observed in the historical context, leading to ineffective utilization of referential information in new user instructions. To address these, we present AerialVLA, an end-to-end UAV navigation framework integrating dialogue comprehension, action decision-making, and navigational question generation. AerialVLA comprises three core components: i) we propose the Progress-Driven Navigation-Query Alternation mechanism to determine optimal questioning timing through navigation progress estimation autonomously. ii) To effectively model long-horizon history observation sequences, we develop the History Spatial-Temporal Fusion module that extracts discriminative spatial-temporal representations from historical observations. iii) Furthermore, to overcome data scarcity in training, we devise the Online Task-Driven Augmentation strategy that enhances learning through action-conditioned data augmentation. Experimental results demonstrate that AerialVLA achieves state-of-the-art navigation performance while exhibiting effective dialogue capabilities. Moreover, to better evaluate the agent’s proactive dialogue and navigation abilities, our evaluation benchmark, named UAV Navigation with Online Dialogue (UNOD), incorporates an online dialogue interaction module. The UNOD assesses UAV agents’ real-time questioning capabilities by leveraging an Air Commander Large Language Model to simulate human-UAV interactions during testing.

源语言英语
主期刊名Proceedings of the AAAI Conference on Artificial Intelligence
编辑Sven Koenig, Chad Jenkins, Matthew E. Taylor
出版商Association for the Advancement of Artificial Intelligence
18161-18169
页数9
版本22
ISBN(印刷版)9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067
DOI
出版状态已出版 - 2026
活动40th AAAI Conference on Artificial Intelligence, AAAI 2026 - Singapore, 新加坡
期限: 20 1月 202627 1月 2026

出版系列

姓名Proceedings of the AAAI Conference on Artificial Intelligence
编号22
40
ISSN(印刷版)2159-5399
ISSN(电子版)2374-3468

会议

会议40th AAAI Conference on Artificial Intelligence, AAAI 2026
国家/地区新加坡
Singapore
时期20/01/2627/01/26

指纹

探究 'AerialVLA: A Vision-Language-Action Model for Aerial Navigation with Online Dialogue' 的科研主题。它们共同构成独一无二的指纹。

引用此