TY - JOUR
T1 - A Hierarchical Vision-Language and Reinforcement Learning Framework for Robotic Task and Motion Planning in Collaborative Manipulation
AU - Zhang, Junnan
AU - Mu, Chaoxu
AU - Xu, Xin
AU - Ren, Lei
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2026
Y1 - 2026
N2 - Vision-language-action models (VLAs) use an end-to-end learning architecture, which can realize the integration of visual perception, semantic understanding and motion control. However, when tackling with the dynamic or long-horizon tasks, VLAs have poor robustness and real-time adjustment ability against changes in target objects, instructions, and environments. To handles these limitations, we propose VL-RL, a hierarchical framework that consists of the vision-language (VL) planner that owns excellent VL information understanding and high-level task planning abilities and reinforcement learning (RL)-based low-level motion planner with enhanced flexibility and broader applicability. If the environmental state changes during task execution, the RL planner in VL-RL will directly make dynamic adjustments at the subtask level based on visual feedback to achieve the task goals, without the need for time-consuming information processing from VL planner. Experiments demonstrate that VL-RL can more efficiently and stably complete dual-robot collaborative manipulation tasks. Finally, our work is verified by dynamic grasping tasks and long-horizon complex tasks.
AB - Vision-language-action models (VLAs) use an end-to-end learning architecture, which can realize the integration of visual perception, semantic understanding and motion control. However, when tackling with the dynamic or long-horizon tasks, VLAs have poor robustness and real-time adjustment ability against changes in target objects, instructions, and environments. To handles these limitations, we propose VL-RL, a hierarchical framework that consists of the vision-language (VL) planner that owns excellent VL information understanding and high-level task planning abilities and reinforcement learning (RL)-based low-level motion planner with enhanced flexibility and broader applicability. If the environmental state changes during task execution, the RL planner in VL-RL will directly make dynamic adjustments at the subtask level based on visual feedback to achieve the task goals, without the need for time-consuming information processing from VL planner. Experiments demonstrate that VL-RL can more efficiently and stably complete dual-robot collaborative manipulation tasks. Finally, our work is verified by dynamic grasping tasks and long-horizon complex tasks.
KW - Large language models
KW - multi-robot systems
KW - reinforcement learning
KW - task and motion planning
UR - https://www.scopus.com/pages/publications/105021118316
U2 - 10.1109/LRA.2025.3629984
DO - 10.1109/LRA.2025.3629984
M3 - 文章
AN - SCOPUS:105021118316
SN - 2377-3766
VL - 11
SP - 65
EP - 72
JO - IEEE Robotics and Automation Letters
JF - IEEE Robotics and Automation Letters
IS - 1
ER -