基於交叉層矩陣和KL散度與離線集合的知識蒸餾之深度類神經網路壓縮

簡易檢索 / 詳目顯示

回結果列表

研究生：	周信閎 Chou, Hsin-Hung
論文名稱：	基於交叉層矩陣和KL散度與離線集合的知識蒸餾之深度類神經網路壓縮 Deep Neural Network Compression with Knowledge Distillation Using Cross-Layer Matrix, KL Divergence and Offline Ensemble
指導教授：	邱瀞德 Chiu, Ching-Te
口試委員:	張隆紋 Chang, Long-Wen 蘇豐文 Soo, Von-Wum
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 通訊工程研究所 Communications Engineering
論文出版年：	2019
畢業學年度：	108
語文別：	英文
論文頁數：	63
中文關鍵詞：	深度類神經網路壓縮、知識蒸餾、遷移學習
外文關鍵詞：	Deep Convolutional Model Compression, Knowledge Distillation, Transfer Learning
相關次數：	點閱：4 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

深度類神經網路已經有強大的能力應付多樣的任務。然而由於擁有許多的
參數與計算量在網路之中，使得巨大的深度類神經網路難以部署在移動裝置之
上。因此，模型壓縮是無可避免的。在眾多的壓縮模型方法之中，大部分的方
法是將一個預先訓練好的模型，以下降不超過Top1 準確率1% 的範圍內，期許
找到一個最小的參數量與計算量的模型。然後，其中有一個較為特別的方法，
知識蒸餾，知識蒸餾是將一個名為老師的模型，利用設計的損失函數，將知識
傳遞給一個名為學生的模型，期許被教導學生的模型相比於沒有教學生的模型
可以擁有較高的Top1 準確率，而壓縮量則為老師的模型與學生的模型之比值。
因此，我們提出了一個有效壓縮模型的方法，這個方法可以拆解成三個子
方法。首先，基於FSP [1] 提出的格拉姆矩陣(Gramian matrix) 產生一個有效壓
縮模型的方法，使能夠從老師的模型提取知識，並且利用我們提出的產生交叉
層矩陣擷取更多的特徵。另外，在老師與學生模型最後輸出的預測，我們在離
線環境中採用基於在線方法的DML [2] 提出的KL Divergence (Kullback-Leibler
Divergence)，促使學生模型可以找到更廣泛的最低限度。最後，我們提出離線
集合將多個預先訓練好的老師模型經過隨機平均計算教導學生模型。除此之外，
我們提出利用1x1 卷積層格拉姆矩陣的維度解決我們提出方法的限制以及提出
兩階段知識蒸餾避免知識的流失。在資料集CIFAR-100 上，我們做了兩種實
驗，分別是VGG 與ResNet。在老師模型為VGG-11 與學生模型為VGG-6 情
況下，Top-1 準確率提升了3.57%、參數量的壓縮率為2.08 倍和計算量的壓縮
率為3.50 倍。在老師模型為Res-32 與學生模型為Res-8 情況下，Top-1 準確率
提升了4.38%、參數量的壓縮率為6.11 倍和計算量的壓縮率為5.27 倍。另外，
在較大的資料集ImageNet64*64 上，在老師模型為MobileNet-16 與學生模型為
MobileNet-9 情況下，Top-1 準確率提升了3.98%、參數量的壓縮率為1.59 倍和
計算量的壓縮率為2.05 倍。

Deep Neural Network (DNN) had solved many tasks, including image classification, object detection, and semantic segmentation. However, there are huge parameters and high computation in these DNN models, it’s difficult to deploy on mobile devices. As a result, it’s necessary to compress DNN models. Among many compression methods, most methods are to get a pre-trained model, decreasing
less than 1% Top-1 Accuracy and to find a DNN model with the smallest parameters
and computation. Besides, there is a special approach, named as knowledge
distillation, using a model named as teacher model and the designed loss function
to transfer knowledge to a model named as a student model. Knowledge Distillation
could get higher Top-1 Accuracy than the student trained from scratch, and
the compression is the ratio of teacher’s model and student’s model.
As a result, we propose an efficient compressed method, the method can be
split into three parts. First, based on the method FSP [1] which adopts the proposed
Gramian matrix to exact knowledge from the teacher’s model, we propose a
cross-layer matrix to extract more features. Second, based on the offline method
DML(Deep Mutual Learning) [2] that uses KL Divergence as the tool to transfer
knowledge, we adopt the KL Divergence in an offline environment to make
student model find a wider robust minimum. Finally, we propose to use offline
ensemble pre-trained teachers to teach a student model. To solve the limitation of
dimension mismatch between a teacher and student model, we adopt 1x1 convolution
and two-stage knowledge distillation to release this constraint. On CIFAR-
100 dataset, we adopt VGG and ResNet models. With VGG-11 as the teacher’s
model and VGG-6 as the student’s model, experiment results show that Top-1
Accuracy gets increase 3.57% with 2.08x compression rate and 3.5x computation
rate. With ResNet-32 as the teacher’s model and ResNet-8 as the student’s model,
experiment results show that Top-1 Accuracy gets increase 4.38% with 6.11x compression
rate and 5.27x computation rate. Besides, we have an experiment on
ImageNet64x64. With MobileNet-16 as the teacher’s model and MobileNet-9 as
the student’s model, experiment results show that Top-1 Accuracy get increase
3.98% with 1.59x compression rate and 2.05x computation rate.

Contents
Introduction 1
1 Motivation and Problem Description . . . . . . . . . . . . . . . . . 1
2 Goal and Contribution . . . . . . . . . . . . . . . . . . . . . . . . . 3
3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Related Works 5
1 Parameters and FLOPs . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Compression methods . . . . . . . . . . . . . . . . . . . . . . . . . 6
Deep Neural Network Compression with Knowledge Distillation
Using Cross-Layer Matrix, KL Divergence and Offline Ensemble 10
1 Proposed Distilled Knowledge . . . . . . . . . . . . . . . . . . . . . 12
2 Mathematical Expression of The Knowledge Distillation . . . . . . 13
3 KD Loss for The Gramian Matrix . . . . . . . . . . . . . . . . . . . 14
4 KL Loss between The Prediction of T-DNN and S-DNN . . . . . . 15
5 Offline Ensemble . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
6 Overall Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . 17
7 Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
7.1 The Constraints of Our Proposed Method . . . . . . . . . . 19
7.2 The Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Experimental Results 22
1 Results on CIFAR-100 . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.1 Implementation Details . . . . . . . . . . . . . . . . . . . . 23
1.2 Final Results . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.3 Ablation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.4 Comparison with other methods on CIFAR-100 with VGG-
vs. VGG-6 and ResNet-32 vs. ResNet-8 . . . . . . . . . . 37
1.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2 Results on ImageNet64*64 . . . . . . . . . . . . . . . . . . . . . . . 41
2.1 Implementation Destails . . . . . . . . . . . . . . . . . . . . 41
2.2 Final Results . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.3 Ablation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.4 Comparison with other methods on ImageNet64*64 with
MobileNet-16 vs. MobileNet-9 . . . . . . . . . . . . . . . . . 51
Conclusions 53
Bibliography 55
                                

[1] J. Yim, D. Joo, J. Bae, and J. Kim, “A gift from knowledge distillation: Fast optimization, network minimization and transfer learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp.4133–4141.
[2] Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu, “Deep mutual learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4320–4328.
[3] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. Ieee, 2009, pp. 248–255.
[4] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman,“The pascal visual object classes (voc) challenge,” International journal of computer vision, vol. 88, no. 2, pp. 303–338, 2010.
[5] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3213–3223.
[6] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese, “3d-r2n2: A unified approach for single and multi-view 3d object reconstruction,” in European conference on computer vision. Springer, 2016, pp. 628–644.
[7] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from rgbd images,” in European Conference on Computer Vision. Springer, 2012, pp. 746–760.
[8] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
[9] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, “Fitnets: Hints for thin deep nets,” arXiv preprint arXiv:1412.6550, 2014.
[10] T. Chen, I. Goodfellow, and J. Shlens, “Net2net: Accelerating learning via knowledge transfer,” arXiv preprint arXiv:1511.05641, 2015.
[11] S. H. Lee, D. H. Kim, and B. C. Song, “Self-supervised knowledge distillation using singular value decomposition,” in European Conference on Computer Vision. Springer, 2018, pp. 339–354.
[12] W. Park, D. Kim, Y. Lu, and M. Cho, “Relational knowledge distillation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 3967–3976.
[13] R. Anil, G. Pereyra, A. Passos, R. Ormandi, G. E. Dahl, and G. E. Hinton, “Large scale distributed neural network training through online distillation,” 2018.
[14] X. Lan, X. Zhu, and S. Gong, “Knowledge distillation by on-the-fly native ensemble,” arXiv preprint arXiv:1806.04606, 2018.
[15] T. Furlanello, Z. C. Lipton, M. Tschannen, L. Itti, and A. Anandkumar, “Born again neural networks,” arXiv preprint arXiv:1805.04770, 2018.
[16] X. Lan, X. Zhu, and S. Gong, “Self-referenced deep learning,” in Asian Conference on Computer Vision. Springer, 2018, pp. 284–300.
[17] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
[18] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[19] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings
of the IEEE conference on computer vision and pattern recognition,
2015, pp. 1–9.
[20] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[21] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700–4708.
[22] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
[23] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in European conference on computer vision. Springer, 2016, pp. 21–37.
[24] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7263–7271.
[25] Y. Liu, K. Chen, C. Liu, Z. Qin, Z. Luo, and J. Wang, “Structured knowledge distillation for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 2604–2613.
[26] M.-C. Wu, C.-T. Chiu, and K.-H. Wu, “Multi-teacher knowledge distillation for compressed video action recognition on deep neural networks,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 2202–2206.
[27] S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks for human action recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 1, pp. 221–231, 2013.
[28] S. J. Hanson and L. Y. Pratt, “Comparing biases for minimal network construction with back-propagation,” in Advances in neural information processing systems, 1989, pp. 177–185.
[29] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” in Advances in neural information processing systems, 2015, pp. 1135–1143.
[30] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” arXiv preprint arXiv:1510.00149, 2015.
[31] M. Lin, Q. Chen, and S. Yan, “Network in network,” arXiv preprint arXiv:1312.4400, 2013.
[32] Y. Guo, A. Yao, and Y. Chen, “Dynamic network surgery for efficient dnns,” in Advances In Neural Information Processing Systems, 2016, pp. 1379–1387.
[33] F. Tung and G. Mori, “Clip-q: Deep network compression learning by inparallel pruning-quantization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7873–7882.
[34] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning filters for efficient convnets,” arXiv preprint arXiv:1608.08710, 2016.
[35] T.-Y. Hsiao, Y.-C. Chang, H.-H. Chou, and C.-T. Chiu, “Filter-based deep compression with global average pooling for convolutional networks,” Journal of Systems Architecture, vol. 95, pp. 9–18, 2019.
[36] H. Hu, R. Peng, Y.-W. Tai, and C.-K. Tang, “Network trimming: A data driven neuron pruning approach towards efficient deep architectures,” arXiv preprint arXiv:1607.03250, 2016.
[37] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz, “Pruning convolutional neural networks for resource efficient transfer learning,” arXiv preprint arXiv:1611.06440, vol. 3, 2016.
[38] M. Denil, B. Shakibi, L. Dinh, N. De Freitas et al., “Predicting parameters in deep learning,” in Advances in neural information processing systems, 2013, pp. 2148–2156.
[39] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, “Exploiting linear structure within convolutional networks for efficient evaluation,” in Advances in neural information processing systems, 2014, pp. 1269–1277.
[40] M. Jaderberg, A. Vedaldi, and A. Zisserman, “Speeding up convolutional neural networks with low rank expansions,” arXiv preprint arXiv:1405.3866, 2014.
[41] X. Zhang, J. Zou, K. He, and J. Sun, “Accelerating very deep convolutional networks for classification and detection,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 10, pp. 1943–1955, 2015.
[42] S. Basu and L. R. Varshney, “Universal source coding of deep neural networks,” in 2017 Data Compression Conference (DCC). IEEE, 2017, pp. 310–319.
[43] S. Gupta, R. Girshick, P. Arbeláez, and J. Malik, “Learning rich features from rgb-d images for object detection and segmentation,” in European Conference on Computer Vision. Springer, 2014, pp. 345–360.
[44] T. Dettmers, “8-bit approximations for parallelism in deep learning,” arXiv preprint arXiv:1511.04561, 2015.
[45] K. Hwang and W. Sung, “Fixed-point feed forward deep neural network design using weights+ 1, 0, and- 1,” in 2014 IEEE Workshop on Signal Processing Systems (SiPS). IEEE, 2014, pp. 1–6.
[46] Z. Ji, I. Ovsiannikov, Y. Wang, L. Shi, and Q. Zhang, “Reducing weight precision of convolutional neural networks towards large-scale on-chip image recognition,” in Independent Component Analyses, Compressive Sampling, Large Data Analyses (LDA), Neural Networks, Biosystems, and Nano engineering XIII, vol. 9496. International Society for Optics and Photonics, 2015, p. 94960A.
[47] D. He, Y. Xia, T. Qin, L. Wang, N. Yu, T.-Y. Liu, and W.-Y. Ma, “Dual learning for machine translation,” in Advances in Neural Information Processing Systems, 2016, pp. 820–828.
[48] T. Batra and D. Parikh, “Cooperative learning with visual attributes,” arXiv preprint arXiv:1705.05512, 2017.
[49] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent neural networks,” 2012.
[50] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” Citeseer, Tech. Rep., 2009.
[51] P. Chrabaszcz, I. Loshchilov, and F. Hutter, “A downsampled variant of imagenet as an alternative to the cifar datasets,” arXiv preprint arXiv: 1707.08819, 2017.
[52] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: a system for large-scale machine learning.” in OSDI, vol. 16, 2016, pp. 265–283.
[53] J. Kiefer, J. Wolfowitz et al., “Stochastic estimation of the maximum of a regression function,” The Annals of Mathematical Statistics, vol. 23, no. 3, pp. 462–466, 1952.
[54] Y. Nesterov, “A method for unconstrained convex minimization problem with the rate of convergence o (1/k^ 2),” in Doklady AN USSR, vol. 269, 1983, pp. 543–547.
[55] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.

簡易檢索 / 詳目顯示

相關論文