Abstract
Despite significant advancements in 3D hand pose estimation, it still faces challenges due to self-occlusion and complex backgrounds. To tackle those issues, we propose a CLIP-based Regressor for Hand Pose Estimation and Mesh Recovery (CLIP-Hand) from a single RGB image. Specifically, we propose an innovative method that combines high-resolution feature aggregation with contrastive language-image pre-trained model (CLIP) to enhance feature representations through language-guided visual prompts. Our approach utilizes a multi-layer Transformer encoder-decoder module to improve the prediction accuracy of hand meshing and joint points. To boost the performance, a predefined 3D joint module and a text dataset are proposed to augment the training data and improve the model’s generalization ability across different scenarios. Extensive experiments on datasets such as FreiHAND, RHD, and Dexter+Object demonstrate the effectiveness of our approach, showing improved performance in terms of accuracy and robustness compared to existing methods. The source code and data will be released once the paper is accepted.
| Original language | English |
|---|---|
| Article number | 43 |
| Journal | Visual Computer |
| Volume | 42 |
| Issue number | 1 |
| DOIs | |
| State | Published - Jan 2026 |
Keywords
- CLIP
- Hand pose
- Heatmap
- Human–computer interaction
- Mesh recovery
Fingerprint
Dive into the research topics of 'CLIP-Hand: CLIP-based regressor for hand pose estimation and mesh recovery'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver