Abstract
3D human pose estimation (3D HPE) is an important computer vision task with various practical applications. However, 3D pose estimation for multi-person from a monocular video (3DMPPE) is particularly challenging. Recent transformer-based approaches focus on capturing the spatial-temporal information from sequential 2D poses, which unfortunately loses the visual feature relevant for 3D pose estimation. In this paper, we propose an end-to-end framework called Event Guided Video Transformer (EVT) which predicts 3D poses directly from video frames by learning spatial-temporal contextual information from visual features effectively. In addition, our design is the first that incorporates event features to help guide 3D pose estimation. EVT first decouples persons into different instance-aware feature maps from video frames. These features containing specific clues of body structure information are then fed together with event features into an attention based Event-Aware Embedding Module. Next, the fused features for each instance are then fed into an intra-human relation extraction module and subsequently to a temporal transformer to extract inter-frame relationship. Finally, the extracted features are fed into a decoder for 3D pose estimation. Experiments using three widely used 3D pose estimation benchmarks show that our proposed EVT achieves better performance than state-of-the-art models.
| Original language | English |
|---|---|
| Title of host publication | Proceedings - 2025 IEEE Winter Conference on Applications of Computer Vision, WACV 2025 |
| Publisher | Institute of Electrical and Electronics Engineers Inc. |
| Pages | 5114-5124 |
| Number of pages | 11 |
| ISBN (Electronic) | 9798331510831 |
| DOIs | |
| State | Published - 2025 |
| Externally published | Yes |
| Event | 2025 IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2025 - Tucson, United States Duration: 28 Feb 2025 → 4 Mar 2025 |
Publication series
| Name | Proceedings - 2025 IEEE Winter Conference on Applications of Computer Vision, WACV 2025 |
|---|
Conference
| Conference | 2025 IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2025 |
|---|---|
| Country/Territory | United States |
| City | Tucson |
| Period | 28/02/25 → 4/03/25 |
UN SDGs
This output contributes to the following UN Sustainable Development Goals (SDGs)
-
SDG 3 Good Health and Well-being
Fingerprint
Dive into the research topics of 'Event-Guided Video Transformer for End-to-End 3D Human Pose Estimation'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver