TY - JOUR
T1 - Multi-scale supervised attentive encoder-decoder network for crowd counting
AU - Zhang, Anran
AU - Jiang, Xiaolong
AU - Zhang, Baochang
AU - Cao, Xianbin
N1 - Publisher Copyright:
© 2020 ACM.
PY - 2020/4
Y1 - 2020/4
N2 - Crowd counting is a popular topic with widespread applications. Currently, the biggest challenge to crowd counting is large-scale variation in objects. In this article, we focus on overcoming this challenge by proposing a novel Attentive Encoder-Decoder Network (AEDN), which is supervised on multiple feature scales to conduct crowd counting via density estimation. This work has three main contributions. First, we augment the traditional encoder-decoder architecture with our proposed residual attention blocks, which, beyond skip-connected encoded features, further extend the decoded features with attentive features. AEDN is better at establishing long-range dependencies between the encoder and decoder, therefore promoting more effective fusion of multi-scale features for handling scale-variations. Second, we design a new KL-divergence-based distribution loss to supervise the scale-aware structural differences between two density maps, which complements the pixel-isolated MSE loss and better optimizes AEDN to generate high-quality density maps. Third, we adopt a multi-scale supervision scheme, such that multiple KL divergences and MSE losses are deployed at all decoding stages, providing more thorough supervisions for different feature scales. Extensive experimental results on four public datasets, including ShanghaiTech Part A, ShanghaiTech Part B, UCF-CC-50, and UCF-QNRF, reveal the superiority and efficacy of the proposed method, which outperforms most state-of-the-art competitors.
AB - Crowd counting is a popular topic with widespread applications. Currently, the biggest challenge to crowd counting is large-scale variation in objects. In this article, we focus on overcoming this challenge by proposing a novel Attentive Encoder-Decoder Network (AEDN), which is supervised on multiple feature scales to conduct crowd counting via density estimation. This work has three main contributions. First, we augment the traditional encoder-decoder architecture with our proposed residual attention blocks, which, beyond skip-connected encoded features, further extend the decoded features with attentive features. AEDN is better at establishing long-range dependencies between the encoder and decoder, therefore promoting more effective fusion of multi-scale features for handling scale-variations. Second, we design a new KL-divergence-based distribution loss to supervise the scale-aware structural differences between two density maps, which complements the pixel-isolated MSE loss and better optimizes AEDN to generate high-quality density maps. Third, we adopt a multi-scale supervision scheme, such that multiple KL divergences and MSE losses are deployed at all decoding stages, providing more thorough supervisions for different feature scales. Extensive experimental results on four public datasets, including ShanghaiTech Part A, ShanghaiTech Part B, UCF-CC-50, and UCF-QNRF, reveal the superiority and efficacy of the proposed method, which outperforms most state-of-the-art competitors.
KW - Self-attention
KW - representation fusion
KW - supervised method
UR - https://www.scopus.com/pages/publications/85084757935
U2 - 10.1145/3356019
DO - 10.1145/3356019
M3 - 文章
AN - SCOPUS:85084757935
SN - 1551-6857
VL - 16
JO - ACM Transactions on Multimedia Computing, Communications and Applications
JF - ACM Transactions on Multimedia Computing, Communications and Applications
IS - 1s
M1 - 28
ER -