TY - JOUR
T1 - MASGC
T2 - Hybrid attention and synchronous graph learning for monocular 3D pose estimation
AU - Li, Shengjie
AU - Wang, Jin
AU - Niu, Jianwei
AU - Wang, Yuanhang
AU - Zhang, Haiyun
AU - Lu, Guodong
AU - Yang, Jingru
AU - Yu, Xiaolong
AU - Hou, Renluan
N1 - Publisher Copyright:
© 2025 Elsevier B.V.
PY - 2026/4
Y1 - 2026/4
N2 - Occlusion and depth ambiguity pose significant challenges to the accuracy of monocular 3D human pose estimation. To tackle these issues, this study presents a two-stage pose estimation method based on Multi-Attention and Synchronous-Graph-Convolution (MASGC). In the first stage (2D pose estimation), a feature pyramid convolutional attention (FPCA) module is designed based on a multiresolution feature pyramid (MFP) and a convolutional attention triplet (CAT), which integrates channel, coordinate, and spatial attention, enabling the model to focus on the most salient features and mitigate location information loss caused by global pooling, thereby improving estimation accuracy. In the second stage (lifting to 3D), a temporal synchronous graph convolutional network (TSGCN) is designed. By incorporating multi-head attention and expanding the receptive field of end keypoints through topological temporal convolutions, TSGCN effectively addresses the challenges of occlusion and depth ambiguity in monocular 3D human pose estimation. Experimental results show that MASGC outperforms the compared baseline methods on benchmark datasets, including Human3.6 M and a custom dual-arm dataset, while reducing computational complexity compared to mainstream models. The code is available at https://github.com/JasonLi-30/MASGC.
AB - Occlusion and depth ambiguity pose significant challenges to the accuracy of monocular 3D human pose estimation. To tackle these issues, this study presents a two-stage pose estimation method based on Multi-Attention and Synchronous-Graph-Convolution (MASGC). In the first stage (2D pose estimation), a feature pyramid convolutional attention (FPCA) module is designed based on a multiresolution feature pyramid (MFP) and a convolutional attention triplet (CAT), which integrates channel, coordinate, and spatial attention, enabling the model to focus on the most salient features and mitigate location information loss caused by global pooling, thereby improving estimation accuracy. In the second stage (lifting to 3D), a temporal synchronous graph convolutional network (TSGCN) is designed. By incorporating multi-head attention and expanding the receptive field of end keypoints through topological temporal convolutions, TSGCN effectively addresses the challenges of occlusion and depth ambiguity in monocular 3D human pose estimation. Experimental results show that MASGC outperforms the compared baseline methods on benchmark datasets, including Human3.6 M and a custom dual-arm dataset, while reducing computational complexity compared to mainstream models. The code is available at https://github.com/JasonLi-30/MASGC.
KW - 3D pose estimation
KW - Graph convolution network
KW - Monocular RGB vision
KW - Multiresolution feature pyramid
KW - Topological temporal convolution
UR - https://www.scopus.com/pages/publications/105022257096
U2 - 10.1016/j.displa.2025.103284
DO - 10.1016/j.displa.2025.103284
M3 - 文章
AN - SCOPUS:105022257096
SN - 0141-9382
VL - 92
JO - Displays
JF - Displays
M1 - 103284
ER -