TY - GEN
T1 - STRUCTURED INSTRUCTION PARSING AND SCENE ALIGNMENT FOR UAV VISION-LANGUAGE NAVIGATION
AU - Zhou, Liangyu
AU - Xue, Rui
AU - Luo, Xiaoyan
N1 - Publisher Copyright:
©2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Recent advances in aerial Vision-and-Language Navigation (VLN) have introduced a more meaningful and practical paradigm of VLN by considering significantly longer paths and more complex spatial reasoning compared to ground-based VLN. However, the larger scale and increased complexity of outdoor environments in aerial VLN present substantial challenges in establishing accurate correspondence between textual instructions and visual scenes. In this work, we propose to incorporate Large Language Models (LLMs) to extract key components from navigation instructions and construct the corresponding subtasks. This structured instruction parsing module ensures the appropriate granularity of navigation instructions, enabling more precise alignment between language and visual cues. To further enhance the integration of multi-modal information and cross-modal understanding, we introduce a scene-based subtask alignment policy that effectively associates each parsed subtask with corresponding visual observations along the navigation path. Combined, the proposed approach significantly outperforms current state-of-the-art methods on the AerialVLN dataset.
AB - Recent advances in aerial Vision-and-Language Navigation (VLN) have introduced a more meaningful and practical paradigm of VLN by considering significantly longer paths and more complex spatial reasoning compared to ground-based VLN. However, the larger scale and increased complexity of outdoor environments in aerial VLN present substantial challenges in establishing accurate correspondence between textual instructions and visual scenes. In this work, we propose to incorporate Large Language Models (LLMs) to extract key components from navigation instructions and construct the corresponding subtasks. This structured instruction parsing module ensures the appropriate granularity of navigation instructions, enabling more precise alignment between language and visual cues. To further enhance the integration of multi-modal information and cross-modal understanding, we introduce a scene-based subtask alignment policy that effectively associates each parsed subtask with corresponding visual observations along the navigation path. Combined, the proposed approach significantly outperforms current state-of-the-art methods on the AerialVLN dataset.
KW - Cross-Modal Attention (CMA)
KW - Large Language Models (LLMs)
KW - Vision-and-Language Navigation (VLN)
UR - https://www.scopus.com/pages/publications/105028621671
U2 - 10.1109/ICIP55913.2025.11084696
DO - 10.1109/ICIP55913.2025.11084696
M3 - 会议稿件
AN - SCOPUS:105028621671
T3 - Proceedings - International Conference on Image Processing, ICIP
SP - 2600
EP - 2605
BT - 2025 IEEE International Conference on Image Processing, ICIP 2025 - Proceedings
PB - IEEE Computer Society
T2 - 32nd IEEE International Conference on Image Processing, ICIP 2025
Y2 - 14 September 2025 through 17 September 2025
ER -