TY - JOUR
T1 - Group Activity Representation Learning With Long-Short States Predictive Transformer
AU - Kong, Longteng
AU - Zhou, Wanting
AU - Pei, Duoxuan
AU - He, Zhaofeng
AU - Huang, Di
N1 - Publisher Copyright:
© 1991-2012 IEEE.
PY - 2023/12/1
Y1 - 2023/12/1
N2 - The research goal of this paper is to learn the group activity representations in a self-supervised fashion instead of through the use of conventional methods that rely on manually annotated labels. It is essential for this task to better describe the complex group states and their future transitions. To this end, we propose a long-short state predictive Transformer (LSSPT), which mines the meaningful spatiotemporal features of group activities by predicting the future group states with long- and short-term historical state dynamics. LSSPT consists of an encoder that models diverse spatiotemporal state representations in the observation, together with a decoder that exploits rich dynamic patterns by attending to both the short-term spatial context and long-term history state evolutions to predict future group states. Furthermore, we consider the distinguishability and consistency of the predicted states and introduce a joint learning mechanism to optimize the models, enabling LSSPT to describe more reliable state transitions. Finally, extensive experiments are carried out to evaluate the learned representation on downstream tasks on the Volleyball, Collective Activity and VolleyTactic datasets, which showcases the method's state-of-the-art performance over the existing self-supervised learning approaches.
AB - The research goal of this paper is to learn the group activity representations in a self-supervised fashion instead of through the use of conventional methods that rely on manually annotated labels. It is essential for this task to better describe the complex group states and their future transitions. To this end, we propose a long-short state predictive Transformer (LSSPT), which mines the meaningful spatiotemporal features of group activities by predicting the future group states with long- and short-term historical state dynamics. LSSPT consists of an encoder that models diverse spatiotemporal state representations in the observation, together with a decoder that exploits rich dynamic patterns by attending to both the short-term spatial context and long-term history state evolutions to predict future group states. Furthermore, we consider the distinguishability and consistency of the predicted states and introduce a joint learning mechanism to optimize the models, enabling LSSPT to describe more reliable state transitions. Finally, extensive experiments are carried out to evaluate the learned representation on downstream tasks on the Volleyball, Collective Activity and VolleyTactic datasets, which showcases the method's state-of-the-art performance over the existing self-supervised learning approaches.
KW - Group activity representation learning
KW - group activity recognition
KW - self-supervised learning
KW - transformer
UR - https://www.scopus.com/pages/publications/85161034460
U2 - 10.1109/TCSVT.2023.3278984
DO - 10.1109/TCSVT.2023.3278984
M3 - 文章
AN - SCOPUS:85161034460
SN - 1051-8215
VL - 33
SP - 7267
EP - 7281
JO - IEEE Transactions on Circuits and Systems for Video Technology
JF - IEEE Transactions on Circuits and Systems for Video Technology
IS - 12
M1 - 3278984
ER -