Skip to main navigation Skip to search Skip to main content

Event-Guided Video Transformer for End-to-End 3D Human Pose Estimation

  • Lehigh University

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

3D human pose estimation (3D HPE) is an important computer vision task with various practical applications. However, 3D pose estimation for multi-person from a monocular video (3DMPPE) is particularly challenging. Recent transformer-based approaches focus on capturing the spatial-temporal information from sequential 2D poses, which unfortunately loses the visual feature relevant for 3D pose estimation. In this paper, we propose an end-to-end framework called Event Guided Video Transformer (EVT) which predicts 3D poses directly from video frames by learning spatial-temporal contextual information from visual features effectively. In addition, our design is the first that incorporates event features to help guide 3D pose estimation. EVT first decouples persons into different instance-aware feature maps from video frames. These features containing specific clues of body structure information are then fed together with event features into an attention based Event-Aware Embedding Module. Next, the fused features for each instance are then fed into an intra-human relation extraction module and subsequently to a temporal transformer to extract inter-frame relationship. Finally, the extracted features are fed into a decoder for 3D pose estimation. Experiments using three widely used 3D pose estimation benchmarks show that our proposed EVT achieves better performance than state-of-the-art models.

Original languageEnglish
Title of host publicationProceedings - 2025 IEEE Winter Conference on Applications of Computer Vision, WACV 2025
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages5114-5124
Number of pages11
ISBN (Electronic)9798331510831
DOIs
StatePublished - 2025
Externally publishedYes
Event2025 IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2025 - Tucson, United States
Duration: 28 Feb 20254 Mar 2025

Publication series

NameProceedings - 2025 IEEE Winter Conference on Applications of Computer Vision, WACV 2025

Conference

Conference2025 IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2025
Country/TerritoryUnited States
CityTucson
Period28/02/254/03/25

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

  1. SDG 3 - Good Health and Well-being
    SDG 3 Good Health and Well-being

Fingerprint

Dive into the research topics of 'Event-Guided Video Transformer for End-to-End 3D Human Pose Estimation'. Together they form a unique fingerprint.

Cite this