基於 SRAM 存取優化方法以及可重構處理單元陣列的 1.93TOPS/W 卷積神經網絡加速器

簡易檢索 / 詳目顯示

回結果列表

研究生：	陳廖顓 Chen, Liao-Chuan
論文名稱：	基於 SRAM 存取優化方法以及可重構處理單元陣列的 1.93TOPS/W 卷積神經網絡加速器 A 1.93TOPS/W Convolutional Neural Network Accelerator with a Reconfigurable Processing Element array based on SRAM Access Optimization
指導教授：	鄭桂忠 Tang, Kea-Tiong
口試委員:	黃朝宗 Huang, Chao-Tsung 呂仁碩 Liu, Ren-Shuo 盧峙丞 Lu, Chih-Cheng
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 電機工程學系 Department of Electrical Engineering
論文出版年：	2022
畢業學年度：	111
語文別：	中文
論文頁數：	41
中文關鍵詞：	加速器、深度學習
外文關鍵詞：	accelerator, deep learning
相關次數：	點閱：91 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

近年來，人工智慧領域隨著深度神經網路的快速發展已被廣泛實現於生活中的許多應用，隨著應用的複雜度提升，深度神經網路所需的參數量也越趨龐大。在蓄電量有限的邊緣裝置上執行推論時，龐大的參數量以及計算量會導致可觀的資料搬運能耗，限制了邊緣裝置的可工作時間。因此，如何減少資料搬運是一項重要的課題。
在本研究中，考慮深度神經網路具有每層網路參數不同以及輸入圖片大小不同的特性，提出了一個片上記憶體存取最佳化的方法以取得較佳的資料流進行運算，再透過調整運算陣列的資料復用比例，達到片上記憶體存取次數減少的目的。基於此演算法，本研究提出一個可重新配置之運算陣列，其採用輸出固定資料流，並具有省略數值為零的計算判斷機制，進一步降低推論所需的能耗。
為了使加速器操作在多種模式的運算皆有良好的表現，本研究提出了融合資料記憶體塊，使運算陣列在不同網路特性下皆能保持良好的使用率。基於以上提到的演算法以及硬體設計，使加速器可以達到1.93TOPS/W的能量效率。

In recent years, with the rapid development of deep neural networks in the field of artificial intelligence, it has been widely implemented in many applications in life. As the complexity of applications increases, the amount of parameters required by deep neural networks is also increasing. When performing inferences on edge devices with limited power storage, the vast amount of parameters and computation will lead to significant data handling energy consumption, limiting the available time of the edge devices. Therefore, how to reduce data handling is an important issue.
In this research, considering that the deep neural network has the characteristics of different network parameters of each layer and different input image sizes, an on-chip memory access optimization method is proposed to obtain a better data flow for operation. Based on this algorithm, this study proposes a reconfigurable processing element array, which uses an output stationary dataflow and has a zero-gatting mechanism that omits zero values, reducing the energy consumption required for inference. And then adjust the data multiplexing ratio of the arithmetic array to reduce the number of on-chip memory accesses. In order to make the accelerator operate well in various modes of operation, this study proposes a fused data memory banking scheme so that the computing array can maintain a high utilization rate under different network characteristics. Based on the algorithm mentioned above and hardware design, the accelerator can achieve an energy efficiency of 1.93TOPS/W.

摘要----------------------------------i
ABSTRACT------------------------------ii
致謝----------------------------------iv
目錄-----------------------------------v
圖目錄---------------------------------vii
表目錄---------------------------------x
第 1 章    緒論------------------------1
1.1    研究背景------------------------1
1.2    研究動機與目的------------------4
1.3    章節簡介------------------------6
第 2 章    文獻回顧--------------------7
2.1    深度學習運算特性-----------------7
2.1.1    卷積運算----------------------7
2.1.2    資料搬運能耗------------------8
2.1.3    資料流-----------------------9
2.2    深度學習加速器-----------------10
2.3    研究動機----------------------13
第 3 章    SRAM存取優化算法-----------14
3.1    卷積運算拆解分析---------------14
3.2    SRAM存取優化算法---------------19
第 4 章    可重構處理單元陣列加速器----23
4.1    系統架構及資料流---------------23
4.2    可重構運算單元陣列-------------25
4.2.1    運算單元中的資料流-----------26
4.2.2    多模式運算設計---------------27
4.3    融合資料記憶體塊---------------29
4.4    資料路由器--------------------30
第 5 章    實驗結果-------------------31
5.1    SRAM功耗分析-------------------31
5.1.1    資料流的影響-----------------31
5.1.2    SRAM存取優化算法的影響--------32
5.2    FPGA 系統驗證-------------------33
5.3    Post Sim 模擬結果---------------35
5.4    與多種加速器之比較---------------38
第 6 章    結論與未來發展---------------39
參考文獻-------------------------------40


                                

[1] Olga Russakovsky, et al., “Imagenet large scale visual recognition challenge.” In International Journal of Computer Vision, 115.3: 211-252, 2015.
[2] A. Krizhevsky, et al., “Imagenet classification with deep convolutional neural networks.” In NIPS, 2012.
[3] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition.” In ICLR, 2015.
[4] C. Szegedy et al., "Going deeper with convolutions." In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1-9, doi: 10.1109/CVPR.2015.
[5] K. He, et al., “Deep Residual Learning for Image Recognition.” In CVPR, 2016.
[6] J. Hu, L. Shen and G. Sun, "Squeeze-and-Excitation Networks," In CVPR, 2018
[7] M. Horowitz, “Computing’s energy problem (and what we can do about it),” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2014, pp. 10–14.
[8] T. Chen et al., “DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning,” in Proc. 19th Int. Conf. Architectural Support Program. Lang. Operating Syst. (ASPLOS), Mar. 2014, pp. 269–284.
[9] V. Sze, T.-J. Yang, Y.-H. Chen, J. Emer, "Efficient Processing of Deep Neural Networks: A Tutorial and Survey." In Proceedings of the IEEE, vol. 105, no. 12, pp. 2295-2329, December 2017.
[10] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks,” in Proc. 43rd Annu. Int. Symp. Comput. Archit. (ISCA), Jun. 2016, pp. 367–379.
[11] Y. Ma, Y. Cao, S. Vrudhula, and J.-S. Seo, “Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks,” in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays (FPGA), Feb. 2017, pp. 45–54.
[12] Y. Ma, Y. Cao, S. Vrudhula, and J.-S. Seo, ‘‘Optimizing the convolution operation to accelerate deep neural networks on FPGA,’’ IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 26, no. 7, pp. 1354–1367, Jul. 2018..
[13] J. Sim, S. Lee, and L. S. Kim, “An energy-efficient deep convolutional neural network inference processor with enhanced output stationary dataflow in 65-nm CMOS,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 28, no. 1, pp. 87–100, Jan. 2020.
[14] HUANG, Boming, et al. IECA: An In-Execution Configuration CNN Accelerator With 30.55 GOPS/mm² Area Efficiency. IEEE Transactions on Circuits and Systems I: Regular Papers, 2021, 68.11: 4672-4685
[15] Yin, Shouyi, et al. "A high energy efficient reconfigurable hybrid neural network processor for deep learning applications." IEEE Journal of Solid-State Circuits 53.4 (2017): 968-982.

簡易檢索 / 詳目顯示

相關論文