Multi-receptive field spatiotemporal network for action recognition

  • Mu Nie
  • , Sen Yang
  • , Zhenhua Wang
  • , Baochang Zhang
  • , Huimin Lu
  • , Wankou Yang*
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Despite the great progress in action recognition made by deep neural networks, visual tempo may be overlooked in the feature learning process of existing methods. The visual tempo is the dynamic and temporal scale variation of actions. Existing models usually understand spatiotemporal scenes using temporal and spatial convolutions, which are limited in both temporal and spatial dimensions, and they cannot cope with differences in visual tempo changes. To address these issues, we propose a multi-receptive field spatiotemporal (MRF-ST) network to effectively model the spatial and temporal information of different receptive fields. In the proposed network, dilated convolution is utilized to obtain different receptive fields. Meanwhile, dynamic weighting for different dilation rates is designed based on the attention mechanism. Thus, the proposed MRF-ST network can directly caption various tempos in the same network layer without any additional cost. Moreover, the network can improve the accuracy of action recognition by learning more visual tempos of different actions. Extensive evaluations show that MRF-ST reaches the state-of-the-art on three popular benchmarks for action recognition: UCF-101, HMDB-51, and Diving-48. Further analysis also indicates that MRF-ST can significantly improve the performance at the scenes with large variances in visual tempo.

Original languageEnglish
Pages (from-to)2439-2453
Number of pages15
JournalInternational Journal of Machine Learning and Cybernetics
Volume14
Issue number7
DOIs
StatePublished - Jul 2023

Keywords

  • Action recognition
  • Multi-receptive field
  • Spatiotemporal
  • Visual tempo

Fingerprint

Dive into the research topics of 'Multi-receptive field spatiotemporal network for action recognition'. Together they form a unique fingerprint.

Cite this