TY - JOUR
T1 - Progressive prediction
T2 - Video anomaly detection via multi-grained prediction
AU - Zeng, Xianlin
AU - Jiang, Yalong
AU - Wang, Yufeng
AU - Fu, Qiang
AU - Ding, Wenrui
N1 - Publisher Copyright:
© 2024 The Author(s). IET Image Processing published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology.
PY - 2024/8/21
Y1 - 2024/8/21
N2 - Video Anomaly Detection (VAD) has been an active research field for several decades. However, most existing approaches merely extract a single type of feature from videos and define a single paradigm to indicate the extent of abnormalities. A coarse-to-fine three-level prediction is built by integrating different levels of spatio-temporal representations, better highlighting the difference between normal and abnormal behaviors. First, an object-level trajectory prediction is proposed to model human historical position using a graph transformer network. Subsequently, skeleton-level prediction is achieved by incorporating the positional information from the trajectory prediction. More importantly, based on the predicted skeleton, a skeleton-guided pixel-level region prediction is performed. A novel Skeleton Conditioned Generative Adversarial Network (SCGAN) is designed to explore the correlation between skeleton-level and pixel-level motion prediction. Benefiting from SCGAN, the prediction of human regions is contributed by both coarse-grained and fine-grained motion features. This three-level prediction, namely Progressive Prediction Video Anomaly Detection (P3VAD), enlarges the prediction error on irregular motion patterns. Besides, a pixel-level analysis method is proposed to achieve Background-bias Elimination (BE) and denoise the predicted region. Experimental results validate the effectiveness of P3VAD on the four benchmark datasets (ShanghaiTech, CUHK Avenue, IITB-Corridor, and ADOC).
AB - Video Anomaly Detection (VAD) has been an active research field for several decades. However, most existing approaches merely extract a single type of feature from videos and define a single paradigm to indicate the extent of abnormalities. A coarse-to-fine three-level prediction is built by integrating different levels of spatio-temporal representations, better highlighting the difference between normal and abnormal behaviors. First, an object-level trajectory prediction is proposed to model human historical position using a graph transformer network. Subsequently, skeleton-level prediction is achieved by incorporating the positional information from the trajectory prediction. More importantly, based on the predicted skeleton, a skeleton-guided pixel-level region prediction is performed. A novel Skeleton Conditioned Generative Adversarial Network (SCGAN) is designed to explore the correlation between skeleton-level and pixel-level motion prediction. Benefiting from SCGAN, the prediction of human regions is contributed by both coarse-grained and fine-grained motion features. This three-level prediction, namely Progressive Prediction Video Anomaly Detection (P3VAD), enlarges the prediction error on irregular motion patterns. Besides, a pixel-level analysis method is proposed to achieve Background-bias Elimination (BE) and denoise the predicted region. Experimental results validate the effectiveness of P3VAD on the four benchmark datasets (ShanghaiTech, CUHK Avenue, IITB-Corridor, and ADOC).
KW - computer vision
KW - unsupervised learning
KW - video signal processing
KW - video surveillance
UR - https://www.scopus.com/pages/publications/85194927729
U2 - 10.1049/ipr2.13117
DO - 10.1049/ipr2.13117
M3 - 文章
AN - SCOPUS:85194927729
SN - 1751-9659
VL - 18
SP - 2568
EP - 2583
JO - IET Image Processing
JF - IET Image Processing
IS - 10
ER -