使用多方知識蒸餾在深度類神經卷積網路進行壓縮視訊動作辨識

簡易檢索 / 詳目顯示

回結果列表

研究生：	吳孟潔 Wu, Meng-Chieh
論文名稱：	使用多方知識蒸餾在深度類神經卷積網路進行壓縮視訊動作辨識 Multi-teacher Knowledge Distillation for Compressed Video Action Recognition on Deep Neural Networks
指導教授：	邱瀞德 Chiu, Ching-Te
口試委員:	張隆紋 Chang, Long-Wen 楊家輝 Yang, Jar-Ferr 范倫達 Van, Lan-Da
學位類別：	碩士 Master
系所名稱：
論文出版年：	2018
畢業學年度：	106
語文別：	英文
論文頁數：	61
中文關鍵詞：	深度類神經網路壓縮、動作辨識、知識蒸餾、遷移學習
外文關鍵詞：	Deep Convolutional Model Compression, Action Recognition, Knowledge Distillation, Transfer Learning
相關次數：	點閱：2 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

人類動作辨識有非常廣泛的應用像是智慧監控、智慧家庭等，將這些應用運行在嵌入式系統上需要實時並且低功耗的限制。近年來深度類神經網路在影像分類上擁有顯著的成果，也漸漸有研究以深度類神經網路進行動作辨識，但是因為動作並非單純只是一張靜態圖像，而是需要多張時間上連續的影像來表達出完整的動作。目前主流的多數方法為了同時學習到空間和時間上的特徵，通常使用多個模型來分別學習圖像及動作的特徵，並在最後融合多個模型的成果。如此使得參數量大幅增加，加上動作上的資訊大多數方法使用光流場，計算量非常龐大，讓整個模型顯得既笨重又緩慢。另外也有方法透過堆疊連續多幀的影像輸入至單一的3D卷積類神經網路來嘗試同時抓空間和時間上的特徵，但是連續幀的影像之間存在大量冗餘資訊，3D卷積運算也使得參數量大幅增加，這些方法皆無法有效率地進行動作辨識。目前效率最佳的方法CoViAR使用壓縮的影片作為輸入，以其所含的動作資訊取代有龐大計算量的光流場，大幅改善運行時間。但此方法擁有大量參數量，需要大量儲存空間約310MB。

我們提出多方知識蒸餾架構來壓縮該模型並提升整體速度，知識蒸餾是一種遷移學習的技術，將所學得的知識轉移到規模較小的模型當中，使其能夠更容易運行在有實時並且低功耗限制的嵌入式系統上。所謂多方知識來自於該模型含有多個卷積類神經網路分別學習影片壓縮技術所含的多種空間及時間上的資訊，該模型結合多個卷積類神經網路的學習成果作為最後的預測。因此我們透過結合多方不同層面的知識能使小模型所接收到的資訊更加全面，並且使用我們的多方知識蒸餾能使小模型學習的效果比單方知識蒸餾更好。最後在UCF-101資料集上驗證，使用我們的方法達到約2.4倍壓縮率，需要儲存空間約125MB，運行時間壓縮約1.9倍，伴隨準確率下降約2.14%。

Human action recognition has been an active research topic in computer vision due to its wide range of applications, such as smart surveillance, smart home and health care monitoring. Implementation of these applications using VLSI or embedded computing systems has low-power and real-time requirements.

Recently, Convolutional Networks have great progress in classifying images, ConvNets have also been considered to solve action recognition problem. While action recognition is different from still image classification, video data contains temporal information which plays an important role in video understanding. Most current approaches for action recognition use multiple CNNs to learn spatial and temporal features respectively, then fuse their results for final prediction. This greatly increases the amount of parameters. Moreover, most of the methods takes dense optical flow as motion representation, hence the computational cost is excessive, making the whole model being cumbersome and slow. There are other approaches to learn spatio-temporal feature by stacking multiple continuous frames into a single 3D ConvNet, but consecutive frames are highly redundant, and 3D convolution causes an explosion of parameters and computation time. These above methods are unable to perform action recognition efficiently. The most efficient method currently trained a deep network directly on the compressed video contains the motion information, replacing the optical flow field with excessive computational cost. However, this method has a large amount of parameters and requires excessive storage space about 300 MB.

We propose a multi-teacher knowledge distillation framework for compressed video action recognition to compress this model. With this framework, the model is compressed by transferring the knowledge of multiple teachers to a single small student model. We integrate the knowledge from different teachers with various input types, and teach the students with this comprehensive knowledge. With multi-teacher knowledge distillation, students learns better than single-teacher knowledge distillation. Experiments show that we can reach a 2.4× compression rate, requiring storage space about 125 MB and 1.2× computation reduction with about 2.14\% loss of accuracy on the UCF-101 dataset.

Introduction 1
1 Motivation and Problem Description . . . . . . . . . . . . . . . . . 1
2 Goal and Contribution . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Related Works 6
1 Action Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1 Two-stream Based Approach . . . . . . . . . . . . . . . . . 7
1.2 3D Convolution Based Approach . . . . . . . . . . . . . . . 9
1.3 Exploring Other Robust Deep Video Representations . . . . 11
2 Model Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Multi-teacher Knowledge Distillation for Compressed Video Action
Recognition on Deep Neural Networks 20
1 Knowledge distillation . . . . . . . . . . . . . . . . . . . . . . . . . 21
2 Distilling on given input I-frame . . . . . . . . . . . . . . . . . . . . 24
3 Distilling on given input P-frame . . . . . . . . . . . . . . . . . . . 33
4 Multi-teacher to multi-student mode . . . . . . . . . . . . . . . . . 36
Experimental Results 38
1 Training Skills . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
1.1 Data preprocessing . . . . . . . . . . . . . . . . . . . . . . . 38
1.2 Data augmentation . . . . . . . . . . . . . . . . . . . . . . . 39
2 Dataset and Evaluation protocol . . . . . . . . . . . . . . . . . . . 41
3 Training Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.1 Evaluation Results for Distilling on given input I-frame . . . 41
3.2 Evaluation Results for Distilling on given input P-frame . . 45
3.3 Evaluation Results for Multi-teacher to Multi-student mode 46
3.4 Network computation complexity comparison . . . . . . . . 49
Conclusions and Future Works 52
1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
                                

[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with
deep convolutional neural networks,” in Advances in neural information processing
systems, 2012, pp. 1097–1105.
[2] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale
image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[3] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings
of the IEEE conference on computer vision and pattern recognition,
2015, pp. 1–9.
[4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer vision and
pattern recognition, 2016, pp. 770–778.
[5] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action
recognition in videos,” in Advances in neural information processing
systems, 2014, pp. 568–576.
[6] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning
spatiotemporal features with 3d convolutional networks,” in Proceedings of
the IEEE international conference on computer vision, 2015, pp. 4489–4497.
[7] B. Zhang, L. Wang, Z. Wang, Y. Qiao, and H. Wang, “Real-time action
recognition with enhanced motion vector cnns,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2016, pp. 2718–
2726.
[8] C.-Y. Wu, M. Zaheer, H. Hu, R. Manmatha, A. J. Smola, and P. Krähenbühl,
“Compressed video action recognition,” in CVPR, 2018.
[9] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural
network,” arXiv preprint arXiv:1503.02531, 2015.
[10] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,”
in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE
Computer Society Conference on, vol. 1. IEEE, 2005, pp. 886–893.
[11] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, “Learning realistic
human actions from movies,” in Computer Vision and Pattern Recognition,
2008. CVPR 2008. IEEE Conference on. IEEE, 2008, pp. 1–8.
[12] X. Peng, C. Zou, Y. Qiao, and Q. Peng, “Action recognition with stacked
fisher vectors,” in European Conference on Computer Vision. Springer, 2014,
pp. 581–595.
[13] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu, “Dense trajectories and motion
boundary descriptors for action recognition,” International journal of
computer vision, vol. 103, no. 1, pp. 60–79, 2013.
[14] H. Wang and C. Schmid, “Action recognition with improved trajectories,” in
Proceedings of the IEEE international conference on computer vision, 2013,
pp. 3551–3558.
[15] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool,
“Temporal segment networks: Towards good practices for deep action recognition,”
in European Conference on Computer Vision. Springer, 2016, pp.
20–36.
[16] Z. Lan, Y. Zhu, A. G. Hauptmann, and S. Newsam, “Deep local video feature
for action recognition,” in Computer Vision and Pattern Recognition Workshops
(CVPRW), 2017 IEEE Conference on. IEEE, 2017, pp. 1219–1225.
[17] J. Zhu, W. Zou, and Z. Zhu, “End-to-end video-level representation learning
for action recognition,” arXiv preprint arXiv:1711.04161, 2017.
[18] A. Diba, V. Sharma, and L. Van Gool, “Deep temporal linear encoding networks,”
in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, vol. 1, 2017.
[19] C. Feichtenhofer, A. Pinz, and R. P. Wildes, “Spatiotemporal multiplier networks
for video action recognition,” in 2017 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR). IEEE, 2017, pp. 7445–7454.
[20] Y. Wang, M. Long, J. Wang, and P. S. Yu, “Spatiotemporal pyramid network
for video action recognition,” in IEEE Conf. on Computer Vision and Pattern
Recognition, 2017.
[21] R. Girdhar and D. Ramanan, “Attentional pooling for action recognition,” in
Advances in Neural Information Processing Systems, 2017, pp. 34–45.
[22] R. Girdhar, D. Ramanan, A. Gupta, J. Sivic, and B. Russell, “Actionvlad:
Learning spatio-temporal aggregation for action classification,” in CVPR,
vol. 2, 2017, p. 3.
[23] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream network
fusion for video action recognition,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2016, pp. 1933–1941.
[24] C. Feichtenhofer, A. Pinz, and R. Wildes, “Spatiotemporal residual networks
for video action recognition,” in Advances in neural information processing
systems, 2016, pp. 3468–3476.
[25] D. Tran, J. Ray, Z. Shou, S.-F. Chang, and M. Paluri, “Convnet architecture
search for spatiotemporal feature learning,” arXiv preprint arXiv:1708.05038,
2017.
[26] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model
and the kinetics dataset,” in Computer Vision and Pattern Recognition
(CVPR), 2017 IEEE Conference on. IEEE, 2017, pp. 4724–4733.
[27] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-
Fei, “Large-scale video classification with convolutional neural networks,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition,
2014, pp. 1725–1732.
[28] Y. Zhu, Z. Lan, S. Newsam, and A. G. Hauptmann, “Hidden two-stream convolutional
networks for action recognition,” arXiv preprint arXiv:1704.00389,
2017.
[29] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections
for efficient neural network,” in Advances in neural information processing
systems, 2015, pp. 1135–1143.
[30] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning filters
for efficient convnets,” arXiv preprint arXiv:1608.08710, 2016.
[31] H. Hu, R. Peng, Y.-W. Tai, and C.-K. Tang, “Network trimming: A datadriven
neuron pruning approach towards efficient deep architectures,” arXiv
preprint arXiv:1607.03250, 2016.
[32] J.-H. Luo, J. Wu, and W. Lin, “Thinet: A filter level pruning method for
deep neural network compression,” arXiv preprint arXiv:1707.06342, 2017.
[33] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, “Exploiting
linear structure within convolutional networks for efficient evaluation,” in
Advances in neural information processing systems, 2014, pp. 1269–1277.
[34] X. Zhang, J. Zou, K. He, and J. Sun, “Accelerating very deep convolutional
networks for classification and detection,” IEEE transactions on pattern analysis
and machine intelligence, vol. 38, no. 10, pp. 1943–1955, 2016.
[35] Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin, “Compression
of deep convolutional neural networks for fast and low power mobile applications,”
arXiv preprint arXiv:1511.06530, 2015.
[36] X. Yu, T. Liu, X. Wang, and D. Tao, “On compressing deep models by low
rank and sparse decomposition,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2017, pp. 7370–7379.
[37] W. Chen, J. Wilson, S. Tyree, K. Weinberger, and Y. Chen, “Compressing
neural networks with the hashing trick,” in International Conference on
Machine Learning, 2015, pp. 2285–2294.
[38] J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng, “Quantized convolutional
neural networks for mobile devices,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2016, pp. 4820–4828.
[39] D. Lopez-Paz, L. Bottou, B. Schölkopf, and V. Vapnik, “Unifying distillation
and privileged information,” arXiv preprint arXiv:1511.03643, 2015.
[40] J. Yim, D. Joo, J. Bae, and J. Kim, “A gift from knowledge distillation:
Fast optimization, network minimization and transfer learning,” in The IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2,
2017.
[41] W. M. Czarnecki, S. Osindero, M. Jaderberg, G. Swirszcz, and R. Pascanu,
“Sobolev training for neural networks,” in Advances in Neural Information
Processing Systems, 2017, pp. 4278–4287.
[42] Z. Xu, Y.-C. Hsu, and J. Huang, “Learning loss for knowledge distillation with
conditional adversarial networks,” arXiv preprint arXiv:1709.00513, 2017.
[43] A. Mishra and D. Marr, “Apprentice: Using knowledge distillation techniques
to improve low-precision network accuracy,” arXiv preprint arXiv:1711.05852,
2017.
[44] A. Polino, R. Pascanu, and D. Alistarh, “Model compression via distillation
and quantization,” arXiv preprint arXiv:1802.05668, 2018.
[45] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:
A large-scale hierarchical image database,” in Computer Vision and Pattern
Recognition, 2009. CVPR 2009. IEEE Conference on. Ieee, 2009, pp. 248–
255.
[46] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin,
A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,”
2017.
[47] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
arXiv preprint arXiv:1412.6980, 2014.
[48] K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human
actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402, 2012.
[49] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “Hmdb: a large
video database for human motion recognition,” in Computer Vision (ICCV),
2011 IEEE International Conference on. IEEE, 2011, pp. 2556–2563.
[50] S. Basu and L. R. Varshney, “Universal source coding of deep neural networks,”
in Data Compression Conference (DCC), 2017. IEEE, 2017, pp.
310–319.
[51] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep learning
with limited numerical precision,” in International Conference on Machine
Learning, 2015, pp. 1737–1746.
[52] T. Dettmers, “8-bit approximations for parallelism in deep learning,” arXiv
preprint arXiv:1511.04561, 2015.
[53] K. Hwang and W. Sung, “Fixed-point feedforward deep neural network design
using weights+ 1, 0, and- 1,” in Signal Processing Systems (SiPS), 2014 IEEE
Workshop on. IEEE, 2014, pp. 1–6.

簡易檢索 / 詳目顯示

相關論文