TY - JOUR
T1 - UniAlign
T2 - A Universal Cross-Modality Knowledge Alignment Framework for Fine-Grained Action Recognition
AU - Wang, Yihan
AU - Sun, Baoli
AU - Li, Haojie
AU - Ma, Xinzhu
AU - Wang, Zhihui
AU - Wang, Zhiyong
N1 - Publisher Copyright:
© 1999-2012 IEEE.
PY - 2026
Y1 - 2026
N2 - The key to fine-grained video action recognition is identifying subtle differences between action categories. Relying solely on visual features supervised by action labels makes it challenging to characterize robust and discriminative action dynamics from videos. With significant advancements in human pose estimation and the powerful capabilities of Vision-Language Models (VLMs), obtaining reliable and cost-free human pose data and textual semantics has become increasingly feasible, enabling their effective use in fine-grained action recognition. However, the inherent disparities in feature representations across different modalities necessitate a robust alignment strategy to achieve optimal fusion. To address this, we propose a universal cross-modality knowledge alignment framework, namely UniAlign, to transfer the knowledge from such pre-trained multi-modal models into action recognition models. Specifically, UniAlign introduces two additional branches to extract pose features and textual semantics with the pre-trained pose encoder and VLM. To align the action-relevant cues among video features, pose features, and textual semantics, we propose a Cross-Modality Similarity Aggregation module (CMSA) that utilizes the importance of different modal cues while aggregating cross-modal similarities. Additionally, we adopt a fine-tuning mechanism similar to Exponential Moving Average (EMA) to refine the textual semantics, ensuring that the semantic representations encoded by VLMs are preserved while being optimized towards the specific task preferences. Extensive experiments on widely used fine-grained action recognition benchmarks (e.g., FineGym, NTURGB-D, Diving48) and coarse-grained K400 dataset demonstrate the effectiveness of the proposed UniAlign method.
AB - The key to fine-grained video action recognition is identifying subtle differences between action categories. Relying solely on visual features supervised by action labels makes it challenging to characterize robust and discriminative action dynamics from videos. With significant advancements in human pose estimation and the powerful capabilities of Vision-Language Models (VLMs), obtaining reliable and cost-free human pose data and textual semantics has become increasingly feasible, enabling their effective use in fine-grained action recognition. However, the inherent disparities in feature representations across different modalities necessitate a robust alignment strategy to achieve optimal fusion. To address this, we propose a universal cross-modality knowledge alignment framework, namely UniAlign, to transfer the knowledge from such pre-trained multi-modal models into action recognition models. Specifically, UniAlign introduces two additional branches to extract pose features and textual semantics with the pre-trained pose encoder and VLM. To align the action-relevant cues among video features, pose features, and textual semantics, we propose a Cross-Modality Similarity Aggregation module (CMSA) that utilizes the importance of different modal cues while aggregating cross-modal similarities. Additionally, we adopt a fine-tuning mechanism similar to Exponential Moving Average (EMA) to refine the textual semantics, ensuring that the semantic representations encoded by VLMs are preserved while being optimized towards the specific task preferences. Extensive experiments on widely used fine-grained action recognition benchmarks (e.g., FineGym, NTURGB-D, Diving48) and coarse-grained K400 dataset demonstrate the effectiveness of the proposed UniAlign method.
KW - Fine-grained action recognition
KW - cross-modality knowledge alignment
KW - exponential moving average
UR - https://www.scopus.com/pages/publications/105021885320
U2 - 10.1109/TMM.2025.3632670
DO - 10.1109/TMM.2025.3632670
M3 - 文章
AN - SCOPUS:105021885320
SN - 1520-9210
VL - 28
SP - 891
EP - 901
JO - IEEE Transactions on Multimedia
JF - IEEE Transactions on Multimedia
ER -