簡易檢索 / 詳目顯示

研究生: 翁啟文
Weng, Chi-Wen
論文名稱: 應用於高度平行卷積神經網路加速之具交叉型稀疏內核可重構卷積引擎
Reconfigurable Convolution Engine with Cross-Shaped Sparse Kernels for Highly-Parallel CNN Acceleration
指導教授: 黃朝宗
Huang, Chao-Tsung
口試委員: 呂仁碩
Liu, Ren-Shuo
王家慶
Wang, Jia-Ching
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2021
畢業學年度: 109
語文別: 英文
論文頁數: 46
中文關鍵詞: 高度平行加速卷積神經網路稀疏內核可重構性
外文關鍵詞: Highly-parallel acceleration, CNN, Sparse kernel, Reconfigurability
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 卷積神經網路已經被廣泛地應用於計算影像領域。由於這些卷積神經網路應
    用需要大量的運算,將它們部署到資源有限的邊緣裝置上以實現即時應用是件
    相當具有挑戰性的事。為了解決這個問題,Huang 等人提出區塊式高度平行卷
    積神經網路加速器 eCNN,該加速器能夠以高度平行的運算方式,逐一區塊地
    處理高解析度影像。然而為了能在相同的運算資源下支援更高的規格(例如 4K
    超高解析度每秒 30 張),我們就必須使用較淺的深度來降低模型複雜度,以提
    高像素吞吐量。但是這樣的做法會導致影像品質嚴重下滑。其他細質性節省複
    雜度的技術,譬如權重剪枝能夠剔除不必要的運算,並同時保持模型的深度。
    然而權重剪枝往往會造成模型非結構性稀疏化,而如此不規律的稀疏性會附帶
    額外的指標負擔以及負載不均的問題,因此難以被高度平行的加速器所利用。
    這些問題會隨著平行度的提升而更為嚴重。
    在本篇論文中,我們提出了交叉型稀疏內核的設計,該設計規律地減少模
    型複雜度,並同時維持模型的深度。除此之外,我們提供一個高度平行具可重
    構性卷積引擎設計,該引擎可以根據需求,進行不同稀疏配置的卷積運算。實
    驗結果顯示在不同運算資源限制下,針對目前最先進的 FFDNet、VDSR 以及
    SRResNet 模型進行實驗,我們的方法優於傳統降低深度的方法。最後我們以
    ERNet 模型作為個案研究,驗證我們提出的方法。在 4K 超高解析度每秒 30
    張的規格下,與降低深度的方法相比,我們的方法達到較高的影像品質,針對
    去噪應用、兩倍超解析度成像與四倍超解析度成像,我們分別提升了 0.17 dB、
    0.19 dB 以及 0.07 dB 的峰值訊噪比。同時,合成結果顯示在台積電 40 奈米製
    程技術上,我們的高度平行具可重構性卷積引擎使用了 985 萬個邏輯閘,並僅
    花了 8.4% 的面積負擔與 14.9% 的額外功耗來支援可重構性功能。


    Convolutional neural networks (CNNs) have been extensively applied in computational imaging. It is challenging to deploy them on resource-limited edge devices for real-time applications since they demand a huge amount of computation. To address this issue, Huang et al. proposed a block-based highly-parallel CNN accelerator eCNN, which inferences HD resolution images block-by-block with a highly-parallel computation scheme. However, to support higher specifications (e.g. 4K Ultra-HD 30fps) within the same computational resource, we have to reduce model complexity with shallower depth to increase pixel throughput. But this will lead to severe degradation of image quality. Other fine-grained complexity-saving techniques like weight pruning can eliminate unnecessary computations while keeping model depth. Nonetheless, it tends to cause irregular sparse models, and it is difficult to exploit such irregular sparsity to accelerate the computation scheme for highly-parallel accelerators due to indexing overhead and load imbalance. These issues become even worse as the degree of parallelism increases.
    In this thesis, we propose a cross-shaped sparse kernel design which regularly reduces model complexity without sacrificing model depth. Moreover, we provide a highly-parallel reconfigurable convolution engine design which is able to conduct convolutional operations with different sparsity configurations according to requirements. Experiments show that under different limitations of computational resource, our method surpasses the conventional depth-reduction method on the state-of-the-art FFDNet, VDSR and SRResNet models. Finally, we evaluate our method on the ERNet models as a case study. Under 4K Ultra-HD 30fps, our method achieves higher image quality than the depth-reduction method by 0.17, 0.19 and 0.07 dB in terms of PSNR for denoising, two-times super-resolution and four-times super-resolution applications, respectively. Meanwhile, synthesis results indicate that our highly-parallel reconfigurable convolution engine costs 9.85M gates with only 8.4% area overhead and 14.9% additional power consumption for reconfigurability based on TSMC 40nm technology.

    1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.1 Highly-Parallel Computation Scheme in eCNN . . . . . . . 4 1.2.2 Architectures for Sparse CNNs . . . . . . . . . . . . . . . 4 1.2.3 Complexity-Saving Strategies for CNNs . . . . . . . . . . 6 1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 Cross-Shaped Sparse Kernel Design for CNNs 11 2.1 Sparse Kernel Design for Reconfigurability . . . . . . . . . . . . . 11 2.1.1 Exploiting Kernel Sparsity . . . . . . . . . . . . . . . . . . 12 2.1.2 Center Sharing for Reconfigurability . . . . . . . . . . . . 13 2.2 Evaluation on Modern Denoising and Super-Resolution Models . . 15 2.2.1 Training Settings and Models . . . . . . . . . . . . . . . . 15 2.2.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . 16 2.3 Analysis of Hardware Overhead for Reconfigurability . . . . . . . 21 2.3.1 Highly-Parallel Convolution Engine . . . . . . . . . . . . . 23 2.3.2 Adjustment of Accumulating Pipeline . . . . . . . . . . . . 25 2.3.3 Synthesis Results . . . . . . . . . . . . . . . . . . . . . . . 25 2.4 Quick Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3 Implementation of Highly-Parallel Reconfigurable Convolution Engine for ERNets 31 3.1 Target System and Specification . . . . . . . . . . . . . . . . . . . 31 3.2 Model Structures of ERNets . . . . . . . . . . . . . . . . . . . . . 32 3.3 Implementation Results . . . . . . . . . . . . . . . . . . . . . . . 35 3.3.1 Model Performance . . . . . . . . . . . . . . . . . . . . . . 35 3.3.2 Synthesis Results . . . . . . . . . . . . . . . . . . . . . . . 35 3.3.3 Comparison with Other Works . . . . . . . . . . . . . . . 38 4 Conclusion and Future Work 41 4.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

    [1] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a Gaussian denoiser: Residual learning of deep CNN for image denoising,” IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 3142–3155, 2017.
    [2] K. Zhang, W. Zuo, and L. Zhang, “FFDNet: Toward a fast and flexible solution for CNN-based image denoising,” IEEE Transactions on Image Processing, vol. 27, no. 9, pp. 4608–4622, 2018.
    [3] J. Kim, J. K. Lee, and K. M. Lee, “Accurate image super-resolution using very deep convolutional networks,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1646–1654.
    [4] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi, “Photo-realistic single image super-resolution using a generative adversarial network,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 105–114.
    [5] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee, “Enhanced deep residual networks for single image super-resolution,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2017, pp. 1132–1140.
    [6] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
    [7] C.-T. Huang, Y.-C. Ding, H.-C. Wang, C.-W. Weng, K.-P. Lin, L.-W. Wang, and L.-D. Chen, “eCNN: A block-based and highly-parallel CNN accelerator for edge inference,” in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2019.
    [8] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” in International Conference on Learning Representations (ICLR), 2016.
    [9] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv:1704.04861, 2017.
    [10] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “EIE: Efficient inference engine on compressed deep neural network,” in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), 2016, pp. 243–254.
    [11] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, “SCNN: An accelerator for compressed-sparse convolutional neural networks,” in 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), 2017, pp. 27–40.
    [12] A. Gondimalla, N. Chesnut, M. Thottethodi, and T. N. Vijaykumar, “SparTen: A sparse tensor accelerator for convolutional neural networks,” in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2019.
    [13] S. Anwar, K. Hwang, and W. Sung, “Structured pruning of deep convolutional neural networks,” CoRR, vol. abs/1512.08571, 2015.
    [14] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured sparsity in deep neural networks,” in Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS), 2016.
    [15] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning filters for efficient convnets,” in International Conference on Learning Representations (ICLR), 2017.
    [16] G. Schindler, W. Roth, F. Pernkopf, and H. Fröning, “Parameterized structured pruning for deep neural networks,” CoRR, vol. abs/1906.05180, 2019.
    [17] H. Mao, S. Han, J. Pool, W. Li, X. Liu, Y. Wang, and W. J. Dally, “Exploring the regularity of sparse structure in convolutional neural networks,” CoRR, vol. abs/1705.08922, 2017.
    [18] F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1mb model size,” CoRR, vol. abs/1602.07360, 2016.
    [19] M. Mathieu, M. Henaff, and Y. LeCun, “Fast training of convolutional networks through ffts,” CoRR, vol. abs/1312.5851, 2014.
    [20] A. Lavin and S. Gray, “Fast algorithms for convolutional neural networks,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
    [21] C. Huang, “ERNet family: Hardware-oriented CNN models for computational imaging using block-based inference,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 1643–1647.
    [22] S. Wu, C. Y. Lin, M. C. Chiang, J. J. Liaw, J. Y. Cheng, S. H. Yang, M. Liang, T. Miyashita, C. H. Tsai, B. C. Hsu, H. Y. Chen, T. Yamamoto, S. Y. Chang, V. S. Chang, C. H. Chang, J. H. Chen, H. F. Chen, K. C. Ting, Y. K. Wu, K. H. Pan, R. F. Tsui, C. H. Yao, P. R. Chang, H. M. Lien, T. L. Lee, H. M. Lee, W. Chang, T. Chang, R. Chen, M. Yeh, C. C. Chen, Y. H. Chiu, Y. H. Chen, H. C. Huang, Y. C. Lu, C. W. Chang, M. H. Tsai, C. C. Liu, K. S. Chen, C. C. Kuo, H. T. Lin, S. M. Jang, and Y. Ku, “A 16nm finfet cmos technology for mobile soc and computing applications,” in 2013 IEEE International Electron Devices Meeting, 2013, pp. 9.1.1–9.1.4.
    [23] Shien-Yang Wu, J. J. Liaw, C. Y. Lin, M. C. Chiang, C. K. Yang, J. Y. Cheng, M. H. Tsai, M. Y. Liu, P. H. Wu, C. H. Chang, L. C. Hu, C. I. Lin, H. F. Chen, S. Y. Chang, S. H. Wang, P. Y. Tong, Y. L. Hsieh, K. H. Pan, C. H. Hsieh, C. H. Chen, C. H. Yao, C. C. Chen, T. L. Lee, C. W. Chang, H. J. Lin, S. C. Chen, J. H. Shieh, M. H. Tsai, S. M. Jang, K. S. Chen, Y. Ku, Y. C. See, and W. J. Lo, “A highly manufacturable 28nm cmos low power platform technology with fully functional 64mb sram using dual/tripe gate oxide process,” in 2009 Symposium on VLSI Technology, 2009, pp. 210–211.

    QR CODE