UniAlign: A Universal Cross-Modality Knowledge Alignment Framework for Fine-Grained Action Recognition

  • Yihan Wang
  • , Baoli Sun
  • , Haojie Li*
  • , Xinzhu Ma
  • , Zhihui Wang
  • , Zhiyong Wang
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

The key to fine-grained video action recognition is identifying subtle differences between action categories. Relying solely on visual features supervised by action labels makes it challenging to characterize robust and discriminative action dynamics from videos. With significant advancements in human pose estimation and the powerful capabilities of Vision-Language Models (VLMs), obtaining reliable and cost-free human pose data and textual semantics has become increasingly feasible, enabling their effective use in fine-grained action recognition. However, the inherent disparities in feature representations across different modalities necessitate a robust alignment strategy to achieve optimal fusion. To address this, we propose a universal cross-modality knowledge alignment framework, namely UniAlign, to transfer the knowledge from such pre-trained multi-modal models into action recognition models. Specifically, UniAlign introduces two additional branches to extract pose features and textual semantics with the pre-trained pose encoder and VLM. To align the action-relevant cues among video features, pose features, and textual semantics, we propose a Cross-Modality Similarity Aggregation module (CMSA) that utilizes the importance of different modal cues while aggregating cross-modal similarities. Additionally, we adopt a fine-tuning mechanism similar to Exponential Moving Average (EMA) to refine the textual semantics, ensuring that the semantic representations encoded by VLMs are preserved while being optimized towards the specific task preferences. Extensive experiments on widely used fine-grained action recognition benchmarks (e.g., FineGym, NTURGB-D, Diving48) and coarse-grained K400 dataset demonstrate the effectiveness of the proposed UniAlign method.

Original languageEnglish
Pages (from-to)891-901
Number of pages11
JournalIEEE Transactions on Multimedia
Volume28
DOIs
StatePublished - 2026

Keywords

  • Fine-grained action recognition
  • cross-modality knowledge alignment
  • exponential moving average

Fingerprint

Dive into the research topics of 'UniAlign: A Universal Cross-Modality Knowledge Alignment Framework for Fine-Grained Action Recognition'. Together they form a unique fingerprint.

Cite this