CLIP-Hand: CLIP-based regressor for hand pose estimation and mesh recovery

  • Feng Zhou
  • , Shuang Ji
  • , Pei Shen
  • , Ju Dai*
  • , Junjun Pan
  • , Yu Kun Lai
  • , Paul L. Rosin
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Despite significant advancements in 3D hand pose estimation, it still faces challenges due to self-occlusion and complex backgrounds. To tackle those issues, we propose a CLIP-based Regressor for Hand Pose Estimation and Mesh Recovery (CLIP-Hand) from a single RGB image. Specifically, we propose an innovative method that combines high-resolution feature aggregation with contrastive language-image pre-trained model (CLIP) to enhance feature representations through language-guided visual prompts. Our approach utilizes a multi-layer Transformer encoder-decoder module to improve the prediction accuracy of hand meshing and joint points. To boost the performance, a predefined 3D joint module and a text dataset are proposed to augment the training data and improve the model’s generalization ability across different scenarios. Extensive experiments on datasets such as FreiHAND, RHD, and Dexter+Object demonstrate the effectiveness of our approach, showing improved performance in terms of accuracy and robustness compared to existing methods. The source code and data will be released once the paper is accepted.

Original languageEnglish
Article number43
JournalVisual Computer
Volume42
Issue number1
DOIs
StatePublished - Jan 2026

Keywords

  • CLIP
  • Hand pose
  • Heatmap
  • Human–computer interaction
  • Mesh recovery

Fingerprint

Dive into the research topics of 'CLIP-Hand: CLIP-based regressor for hand pose estimation and mesh recovery'. Together they form a unique fingerprint.

Cite this