TY - GEN
T1 - On the importance of network architecture in training very deep neural networks
AU - Chi, Zhizhen
AU - Li, Hongyang
AU - Wang, Jingjing
AU - Lu, Huchuan
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2016/11/22
Y1 - 2016/11/22
N2 - Very deep neural networks with hundreds or more layers have achieved significant success in a variety of vision tasks spanning from image classification, detection, to image captioning. However, simply stacking more layers in the convolution operation could suffer from the gradient vanishing problem and thus could not lower down the training loss further. The residual network [1] pushes the model's depth to extremely deep by proposing an identity mapping plus a residual learning term and addresses the gradient back-propagation bottleneck well. In this paper, we investigate the residual module in great extent by analyzing the structure ordering of different blocks and modify them one by one to achieve lower test error on CIFAR-10 dataset. One key observation is that removing the original ReLU activation could facilitate the gradient propagation in the identity mapping path. Moreover, inspired by the ResNet block, we propose a random-jump scheme to skip some residual connections during training, i.e., lower features could jump to any subsequent layers and bypass its transformations directly to the higher level. Such an upgrade to the network structure not only saves training time but also obtains better performance.
AB - Very deep neural networks with hundreds or more layers have achieved significant success in a variety of vision tasks spanning from image classification, detection, to image captioning. However, simply stacking more layers in the convolution operation could suffer from the gradient vanishing problem and thus could not lower down the training loss further. The residual network [1] pushes the model's depth to extremely deep by proposing an identity mapping plus a residual learning term and addresses the gradient back-propagation bottleneck well. In this paper, we investigate the residual module in great extent by analyzing the structure ordering of different blocks and modify them one by one to achieve lower test error on CIFAR-10 dataset. One key observation is that removing the original ReLU activation could facilitate the gradient propagation in the identity mapping path. Moreover, inspired by the ResNet block, we propose a random-jump scheme to skip some residual connections during training, i.e., lower features could jump to any subsequent layers and bypass its transformations directly to the higher level. Such an upgrade to the network structure not only saves training time but also obtains better performance.
UR - https://www.scopus.com/pages/publications/85006915140
U2 - 10.1109/ICSPCC.2016.7753635
DO - 10.1109/ICSPCC.2016.7753635
M3 - 会议稿件
AN - SCOPUS:85006915140
T3 - ICSPCC 2016 - IEEE International Conference on Signal Processing, Communications and Computing, Conference Proceedings
BT - ICSPCC 2016 - IEEE International Conference on Signal Processing, Communications and Computing, Conference Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2016 IEEE International Conference on Signal Processing, Communications and Computing, ICSPCC 2016
Y2 - 5 August 2016 through 8 August 2016
ER -