TY - GEN
T1 - AerialVLA
T2 - 40th AAAI Conference on Artificial Intelligence, AAAI 2026
AU - Chen, Jinyu
AU - Li, Hongyu
AU - Tang, Zongheng
AU - Li, Xiaoduo
AU - Wu, Wenjun
AU - Liu, Si
N1 - Publisher Copyright:
© 2026, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
PY - 2026
Y1 - 2026
N2 - Visual Dialogue Navigation (VDN) aims to enable agents to reach target locations through dialogue with humans. The integration of VDN into Unmanned Aerial Vehicle (UAV) systems enhances human-machine interaction by enabling intuitive, hands-free operation, thereby unlocking vast applications. However, existing VDN models for UAVs can only perform navigation based on dialogue history, lacking proactive interaction capabilities to correct trajectories. Moreover, their sequential observation history recording mechanism struggles to accurately localize landmarks observed in the historical context, leading to ineffective utilization of referential information in new user instructions. To address these, we present AerialVLA, an end-to-end UAV navigation framework integrating dialogue comprehension, action decision-making, and navigational question generation. AerialVLA comprises three core components: i) we propose the Progress-Driven Navigation-Query Alternation mechanism to determine optimal questioning timing through navigation progress estimation autonomously. ii) To effectively model long-horizon history observation sequences, we develop the History Spatial-Temporal Fusion module that extracts discriminative spatial-temporal representations from historical observations. iii) Furthermore, to overcome data scarcity in training, we devise the Online Task-Driven Augmentation strategy that enhances learning through action-conditioned data augmentation. Experimental results demonstrate that AerialVLA achieves state-of-the-art navigation performance while exhibiting effective dialogue capabilities. Moreover, to better evaluate the agent’s proactive dialogue and navigation abilities, our evaluation benchmark, named UAV Navigation with Online Dialogue (UNOD), incorporates an online dialogue interaction module. The UNOD assesses UAV agents’ real-time questioning capabilities by leveraging an Air Commander Large Language Model to simulate human-UAV interactions during testing.
AB - Visual Dialogue Navigation (VDN) aims to enable agents to reach target locations through dialogue with humans. The integration of VDN into Unmanned Aerial Vehicle (UAV) systems enhances human-machine interaction by enabling intuitive, hands-free operation, thereby unlocking vast applications. However, existing VDN models for UAVs can only perform navigation based on dialogue history, lacking proactive interaction capabilities to correct trajectories. Moreover, their sequential observation history recording mechanism struggles to accurately localize landmarks observed in the historical context, leading to ineffective utilization of referential information in new user instructions. To address these, we present AerialVLA, an end-to-end UAV navigation framework integrating dialogue comprehension, action decision-making, and navigational question generation. AerialVLA comprises three core components: i) we propose the Progress-Driven Navigation-Query Alternation mechanism to determine optimal questioning timing through navigation progress estimation autonomously. ii) To effectively model long-horizon history observation sequences, we develop the History Spatial-Temporal Fusion module that extracts discriminative spatial-temporal representations from historical observations. iii) Furthermore, to overcome data scarcity in training, we devise the Online Task-Driven Augmentation strategy that enhances learning through action-conditioned data augmentation. Experimental results demonstrate that AerialVLA achieves state-of-the-art navigation performance while exhibiting effective dialogue capabilities. Moreover, to better evaluate the agent’s proactive dialogue and navigation abilities, our evaluation benchmark, named UAV Navigation with Online Dialogue (UNOD), incorporates an online dialogue interaction module. The UNOD assesses UAV agents’ real-time questioning capabilities by leveraging an Air Commander Large Language Model to simulate human-UAV interactions during testing.
UR - https://www.scopus.com/pages/publications/105035088022
U2 - 10.1609/aaai.v40i22.38878
DO - 10.1609/aaai.v40i22.38878
M3 - 会议稿件
AN - SCOPUS:105035088022
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
T3 - Proceedings of the AAAI Conference on Artificial Intelligence
SP - 18161
EP - 18169
BT - Proceedings of the AAAI Conference on Artificial Intelligence
A2 - Koenig, Sven
A2 - Jenkins, Chad
A2 - Taylor, Matthew E.
PB - Association for the Advancement of Artificial Intelligence
Y2 - 20 January 2026 through 27 January 2026
ER -