研究生: |
鄭揚翰 Zheng, Yang-Han |
---|---|
論文名稱: |
基於有效位元組合機制及可重構高性能乘法器之高面積效率深度神經網路加速器 An Area-Efficient DNN Accelerator with Effective Bit Combination Mechanism and a Reconfigurable High-Performance Multiplier |
指導教授: |
鄭桂忠
Tang, Kea-Tiong |
口試委員: |
黃朝宗
Huang, Chao-Tsung 呂仁碩 Liu, Ren-Shuo 盧峙丞 Lu, Chih-Cheng |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
論文出版年: | 2022 |
畢業學年度: | 111 |
語文別: | 中文 |
論文頁數: | 51 |
中文關鍵詞: | 深度神經網路加速器 、有效位元組合機制 、乘法器 、高面積效率 |
外文關鍵詞: | DNN accelerator, effective bit combination mechanism, multiplier, area-efficient |
相關次數: | 點閱:3 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
深度神經網路廣泛地應用在各種各樣的任務中,例如圖像分類或者語音辨識等。在將深度神經網路部署到邊緣裝置時,通常會對輸入和權重進行量化處理。因此資料的分佈通常會呈現明顯的規律。大部分的資料中具有大量的冗餘的位元,這些冗餘的位元會導致計算資源的利用率下降。本論文提出了一個基於有效位元組合機制及可重構高性能乘法器的高面積效率深度神經網路加速器,能夠為移動設備提供深度神經網路加速的支援。基於modified Baugh-Wooly的乘法器,本論文提出了一個可以同時實現2個4-bit乘法運算的乘法器,所消耗的面積和功耗僅為傳統乘法器的1.57倍和2.31倍。根據深度神經網路中資料的分佈特性,本論文提出了一種針對0/-1/1的權重的閘控方法,能夠減少34.96%的功耗。本論文提出一種優化的資料流,可以在更小的面積和功耗下對輸入和權重實現更好的重複利用,減少記憶體的訪問。並且本論文進一步提出具有2種策略的高效的卷積方案,能夠有效地提高在各種層配置下的處理單元的利用率。基於所提出的方法,本論文所設計的深度神經網路加速器可以實現243.13 GOPS/mm2的面積效率。
Deep neural networks are widely used in a variety of tasks, such as image classification or speech recognition, etc. When deploying DNN to the edge device, the inputs and weights are usually quantized. And there are obvious rules in the data distribution. Most of the data have a lot of redundant bits, which will reduce the utilization of computation resources. This paper proposed an area-efficient DNN accelerator with effective bit combination mechanism and a reconfigurable high-performance multiplier that can support DNN acceleration for mobile devices. Based on the modified Baugh-Wooly multiplier, this paper proposes a multiplier that can process two 4-bit multiplication operations in one cycle, consuming only the 1.57× area and the 2.31× power consumption of a traditional multiplier. Based on the distribution characteristics of data in DNN, this paper proposes a gating approach for the weights of 0/-1/1, resulting in a 34.96% reduction in power consumption. This paper proposes an optimized data flow that better reuses inputs and weights and reduces memory access in the smaller area and lower power consumption. And this paper further proposes an optimized convolutional scheme with 2 strategies that can effectively improve the utilization of processing elements under various layer configurations. Based on the proposed approaches in this paper, the proposed deep neural networks accelerator can achieve an area efficiency of 243.13 GOPS/mm2.
[1] McCulloch W.S. and Pitts W.. A logical calculus of the ideas immanent in nervous activity. The Bulletin of Mathematical Biophysics, 1943, 5(4): 115-133
[2] Rosenblatt F. The perceptron: a probabilistic model for information storage and organization in the brain[J]. Psychological review, 1958, 65(6): 386.
[3] Rumelhart D E, Hinton G E, Williams R J. Learning representations by back-propagating errors[J]. nature, 1986, 323(6088): 533-536.
[4] Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017, 60(6): 84-90.
[5] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv:1409.1556, 2014.
[6] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778.
[7] Girshick R, Donahue J, Darrell T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2014: 580-587.
[8] Redmon J, Divvala S, Girshick R, et al. You only look once: Unified, real-time object detection[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 779-788.
[9] Redmon J, Farhadi A. YOLO9000: better, faster, stronger[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 7263-7271.
[10] Redmon J, Farhadi A. Yolov3: An incremental improvement[J]. arXiv preprint arXiv:1804.02767, 2018.
[11] Bochkovskiy A, Wang C Y, Liao H Y M. Yolov4: Optimal speed and accuracy of object detection[J]. arXiv preprint arXiv:2004.10934, 2020.
[12] Pouyanfar S, Sadiq S, Yan Y, et al. A survey on deep learning: Algorithms, techniques, and applications[J]. ACM Computing Surveys (CSUR), 2018, 51(5): 1-36.
[13] Wen W, Wu C, Wang Y, et al. Learning structured sparsity in deep neural networks[J]. Advances in neural information processing systems, 2016, 29.
[14] Howard A G, Zhu M, Chen B, et al. Mobilenets: Efficient convolutional neural networks for mobile vision applications[J]. arXiv preprint arXiv:1704.04861, 2017.
[15] Capra M, Bussolino B, Marchisio A, et al. Hardware and software optimizations for accelerating deep neural networks: Survey of current trends, challenges, and the road ahead[J]. IEEE Access, 2020, 8: 225134-225180.
[16] Esmaeilzadeh H, Sampson A, Ceze L, et al. Neural acceleration for general-purpose approximate programs[C]//2012 45th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 2012: 449-460.
[17] Han S, Liu X, Mao H, et al. EIE: Efficient inference engine on compressed deep neural network[J]. ACM SIGARCH Computer Architecture News, 2016, 44(3): 243-254.
[18] Chen F, Song L, Chen Y. ReGAN: A pipelined ReRAM-based accelerator for generative adversarial networks[C]//2018 23rd Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE, 2018: 178-183.
[19] Han S, Kang J, Mao H, et al. Ese: Efficient speech recognition engine with sparse lstm on fpga[C]//Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 2017: 75-84.
[20] S. -H. Sie et al., "MARS: Multimacro Architecture SRAM CIM-Based Accelerator With Co-Designed Compressed Neural Networks," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 41, no. 5, pp. 1550-1562, May 2022, doi: 10.1109/TCAD.2021.3082107.
[21] Chen J, Ran X. Deep learning with edge computing: A review[J]. Proceedings of the IEEE, 2019, 107(8): 1655-1674.
[22] Ma Y, Cao Y, Vrudhula S, et al. Optimizing the convolution operation to accelerate deep neural networks on FPGA[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2018, 26(7): 1354-1367.
[23] Judd P, Albericio J, Hetherington T, et al. Stripes: Bit-serial deep neural network computing[C]//2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2016: 1-12.
[24] Sharma H, Park J, Suda N, et al. Bit fusion: Bit-level dynamically composable architecture for accelerating deep neural network[C]//2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2018: 764-775.
[25] Wang Y, Qin Y, Deng D, et al. A 28nm 27.5 TOPS/W Approximate-Computing-Based Transformer Processor with Asymptotic Sparsity Speculating and Out-of-Order Computing[C]//2022 IEEE International Solid-State Circuits Conference (ISSCC). IEEE, 2022, 65: 1-3.
[26] J. Yue et al., "15.2 A 2.75-to-75.9TOPS/W Computing-in-Memory NN Processor Supporting Set-Associate Block-Wise Zero Skipping and Ping-Pong CIM with Simultaneous Computation and Weight Updating," 2021 IEEE International Solid- State Circuits Conference (ISSCC), 2021, pp. 238-240, doi: 10.1109/ISSCC42613.2021.9365958.
[27] Zhao X, Wang Y, Cai X, et al. Linear symmetric quantization of neural networks for low-precision integer hardware[J]. 2020.
[28] C. -H. Lin et al., "7.1 A 3.4-to-13.3TOPS/W 3.6TOPS Dual-Core Deep-Learning Accelerator for Versatile AI Applications in 7nm 5G Smartphone SoC," 2020 IEEE International Solid-State Circuits Conference - (ISSCC), 2020, pp. 134-136.
[29] J. -S. Park et al., "9.5 A 6K-MAC Feature-Map-Sparsity-Aware Neural Processing Unit in 5nm Flagship Mobile SoC," 2021 IEEE International Solid-State Circuits Conference (ISSCC), 2021, pp. 152-154.
[30] J. -S. Park et al., "A Multi-Mode 8K-MAC HW-Utilization-Aware Neural Processing Unit with a Unified Multi-Precision Datapath in 4nm Flagship Mobile SoC," 2022 IEEE International Solid-State Circuits Conference (ISSCC), 2022, pp. 246-248.
[31] S. K. Lee et al., "A 7-nm Four-Core Mixed-Precision AI Chip With 26.2-TFLOPS Hybrid-FP8 Training, 104.9-TOPS INT4 Inference, and Workload-Aware Throttling," in IEEE Journal of Solid-State Circuits, vol. 57, no. 1, pp. 182-197, Jan. 2022.