跳到主要导航 跳到搜索 跳到主要内容

UniAlign: A Universal Cross-Modality Knowledge Alignment Framework for Fine-Grained Action Recognition

  • Yihan Wang
  • , Baoli Sun
  • , Haojie Li*
  • , Xinzhu Ma
  • , Zhihui Wang
  • , Zhiyong Wang
  • *此作品的通讯作者
  • Dalian University of Technology
  • Key Laboratory for Ubiquitous Network and Service Software of Liaoning Province
  • Shandong Science and Technology
  • The University of Sydney

科研成果: 期刊稿件文章同行评审

摘要

The key to fine-grained video action recognition is identifying subtle differences between action categories. Relying solely on visual features supervised by action labels makes it challenging to characterize robust and discriminative action dynamics from videos. With significant advancements in human pose estimation and the powerful capabilities of Vision-Language Models (VLMs), obtaining reliable and cost-free human pose data and textual semantics has become increasingly feasible, enabling their effective use in fine-grained action recognition. However, the inherent disparities in feature representations across different modalities necessitate a robust alignment strategy to achieve optimal fusion. To address this, we propose a universal cross-modality knowledge alignment framework, namely UniAlign, to transfer the knowledge from such pre-trained multi-modal models into action recognition models. Specifically, UniAlign introduces two additional branches to extract pose features and textual semantics with the pre-trained pose encoder and VLM. To align the action-relevant cues among video features, pose features, and textual semantics, we propose a Cross-Modality Similarity Aggregation module (CMSA) that utilizes the importance of different modal cues while aggregating cross-modal similarities. Additionally, we adopt a fine-tuning mechanism similar to Exponential Moving Average (EMA) to refine the textual semantics, ensuring that the semantic representations encoded by VLMs are preserved while being optimized towards the specific task preferences. Extensive experiments on widely used fine-grained action recognition benchmarks (e.g., FineGym, NTURGB-D, Diving48) and coarse-grained K400 dataset demonstrate the effectiveness of the proposed UniAlign method.

源语言英语
页(从-至)891-901
页数11
期刊IEEE Transactions on Multimedia
28
DOI
出版状态已出版 - 2026

指纹

探究 'UniAlign: A Universal Cross-Modality Knowledge Alignment Framework for Fine-Grained Action Recognition' 的科研主题。它们共同构成独一无二的指纹。

引用此