TY - JOUR
T1 - VLM-MSGraph
T2 - Vision Language Model-enabled Multi-hierarchical Scene Graph for robotic assembly
AU - Li, Shufei
AU - Yan, Zhijie
AU - Wang, Zuoxu
AU - Gao, Yiping
N1 - Publisher Copyright:
© 2025
PY - 2025/8
Y1 - 2025/8
N2 - Intelligent robotic assembly is becoming a pivotal component of the manufacturing sector, driven by growing demands for flexibility, sustainability, and resilience. Robots in manufacturing environments need perception, decision-making, and manipulation skills to support the flexible production of diverse products. However, traditional robotic assembly systems typically rely on time-consuming training processes specific to fixed settings, lacking generalization and zero-shot learning capabilities. To address these challenges, this paper introduces a Vision Language Model-enabled Multi-hierarchical Scene Graph (VLM-MSGraph) approach for robotic assembly, featuring generalized assembly sequence learning and 3D manipulation in open scenarios. The MSGraph incorporates high-level task planning structured as triplets, organized by multiple VLM agents. At a low level, the MSGraph retains 3D spatial relationships between industrial parts, enabling the robot to perform assembly tasks while accounting for object geometry for effective manipulation. Assembly drawings, physics simulations, and assembly tasks in a laboratory setting are used to evaluate the proposed system, advancing flexible automation in robotics.
AB - Intelligent robotic assembly is becoming a pivotal component of the manufacturing sector, driven by growing demands for flexibility, sustainability, and resilience. Robots in manufacturing environments need perception, decision-making, and manipulation skills to support the flexible production of diverse products. However, traditional robotic assembly systems typically rely on time-consuming training processes specific to fixed settings, lacking generalization and zero-shot learning capabilities. To address these challenges, this paper introduces a Vision Language Model-enabled Multi-hierarchical Scene Graph (VLM-MSGraph) approach for robotic assembly, featuring generalized assembly sequence learning and 3D manipulation in open scenarios. The MSGraph incorporates high-level task planning structured as triplets, organized by multiple VLM agents. At a low level, the MSGraph retains 3D spatial relationships between industrial parts, enabling the robot to perform assembly tasks while accounting for object geometry for effective manipulation. Assembly drawings, physics simulations, and assembly tasks in a laboratory setting are used to evaluate the proposed system, advancing flexible automation in robotics.
KW - Flexible automation
KW - Robotic assembly
KW - Scene graph
KW - Smart manufacturing
KW - Vision language model
UR - https://www.scopus.com/pages/publications/85217773222
U2 - 10.1016/j.rcim.2025.102978
DO - 10.1016/j.rcim.2025.102978
M3 - 文章
AN - SCOPUS:85217773222
SN - 0736-5845
VL - 94
JO - Robotics and Computer-Integrated Manufacturing
JF - Robotics and Computer-Integrated Manufacturing
M1 - 102978
ER -