TY - GEN
T1 - Imitation Learning Based on Visual-text Fusion for Robotic Sorting Tasks
AU - Shi, Meiyan
AU - Dai, Shuling
AU - Zhao, Yongjia
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - In this paper, we propose an imitation learning method based visual-text fusion for manipulation task. Manipulation is predicted based on text instructions by abstracting the manipulation into text instructions, learning the semantic concepts in the text instructions, and combining them with spatial features for visual inference. The construction process and demonstration content of the expert demonstration dataset is described in detail, which is focused on the process of decomposing the operation task through text. In addition, we present the learning process and demonstrate the network structure of functional modules to highlight the fusion of text features with visual features. The effectiveness of this method is verified by a simulated learning experiment on a multi-step manipulation task. The results show that the behavioral strategy achieved a 92.19% task completion rate on known objects and 80.03% on unknown objects. It is proved that, owing to the introduction of text, the decomposition of the operational task in terms of abstract semantics is realized and the difficulty of learning is reduced. Meanwhile, the behavioral strategy can perform accurate spatial location inference based on text features, thereby achieving accurate action prediction.
AB - In this paper, we propose an imitation learning method based visual-text fusion for manipulation task. Manipulation is predicted based on text instructions by abstracting the manipulation into text instructions, learning the semantic concepts in the text instructions, and combining them with spatial features for visual inference. The construction process and demonstration content of the expert demonstration dataset is described in detail, which is focused on the process of decomposing the operation task through text. In addition, we present the learning process and demonstrate the network structure of functional modules to highlight the fusion of text features with visual features. The effectiveness of this method is verified by a simulated learning experiment on a multi-step manipulation task. The results show that the behavioral strategy achieved a 92.19% task completion rate on known objects and 80.03% on unknown objects. It is proved that, owing to the introduction of text, the decomposition of the operational task in terms of abstract semantics is realized and the difficulty of learning is reduced. Meanwhile, the behavioral strategy can perform accurate spatial location inference based on text features, thereby achieving accurate action prediction.
KW - imitation learning
KW - language grounding for robotics
KW - vision-based manipu-lation
KW - visual-text fusion
UR - https://www.scopus.com/pages/publications/85144623708
U2 - 10.1109/FAIML57028.2022.00038
DO - 10.1109/FAIML57028.2022.00038
M3 - 会议稿件
AN - SCOPUS:85144623708
T3 - Proceedings - 2022 International Conference on Frontiers of Artificial Intelligence and Machine Learning, FAIML 2022
SP - 157
EP - 163
BT - Proceedings - 2022 International Conference on Frontiers of Artificial Intelligence and Machine Learning, FAIML 2022
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2022 International Conference on Frontiers of Artificial Intelligence and Machine Learning, FAIML 2022
Y2 - 19 July 2022 through 21 July 2022
ER -