Temporal Group Constrained Transformer with Deformable Landmark Attention for Video Dimensional Emotion Recognition

  • Weixin Li
  • , Xiangjing Meng
  • , Linmei Hu
  • , Xuan Dong*
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Video dimensional emotion recognition aims to map human affect into the dimensional emotion space based on visual signals. Recent works notice that it is beneficial to locate key facial regions related to human emotion perception, as well as establish long-term temporal dependencies. While preliminary attempts have been made, there still exists much space for further improvements. In this paper, to better exploit key facial regions, we propose the Temporal cue guided Deformable Landmark Spatial (TDLS) transformer which attends to key facial regions in a data-dependent manner. We also propose the temporal cue guided frame representation learning to learn the spatial representation of each frame by considering features of other frames together. To better model temporal dependencies, we propose the Multi-layer Group Constrained Temporal (MGCT) transformer to summarize features of frames to multi-layer groups, perform group-to-group communications, and let group-level features guide the frame-level emotion recognition. We also introduce cross-clip representation learning to generate consistent results across different clips and videos. Extensive experiments are conducted on two benchmark datasets and superior results are achieved by our method compared to state-of-the-art approaches.

Original languageEnglish
JournalIEEE Transactions on Affective Computing
DOIs
StateAccepted/In press - 2025

Keywords

  • deformable landmark attention
  • Dimensional emotion recognition
  • group constrained temporal transformer

Fingerprint

Dive into the research topics of 'Temporal Group Constrained Transformer with Deformable Landmark Attention for Video Dimensional Emotion Recognition'. Together they form a unique fingerprint.

Cite this