研究生: |
凌胤淳 Ling, Yin-Chun |
---|---|
論文名稱: |
適用於嵌入式FPGA的輕量化卷積類神經網絡處理器之設計方法 Designing A Compact Convolutional Neural Network Processor on Embedded FPGAs |
指導教授: |
蔡仁松
Tsay, Ren-Song |
口試委員: |
麥偉基
Mak, Wai-Kei 呂仁碩 Liu, Ren-Shuo 何宗易 Ho, Tsung-Yi |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
論文出版年: | 2020 |
畢業學年度: | 108 |
語文別: | 英文 |
論文頁數: | 57 |
中文關鍵詞: | 類神經網絡處理器 、現場可程式化邏輯閘陣列 、軟硬體協同設計 、設計最佳化 、電子系統級設計方法 |
外文關鍵詞: | CNN Processor, SW/HW Co-design Methodology, Design Optimization, Electronic System-Level Design |
相關次數: | 點閱:2 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
因為能達成高平行計算以及快速的部署,以FPGA為基礎的卷積類神經網路處理器被日益廣泛的應用。然而,在嵌入式的FPGA設計上,有著許多的考量點,包括︰FPGA上有限的可重構邏輯資源、外部記憶體所造成的傳輸延遲以及資料傳輸單元與運算單元之間的搭配合作;因為這些原因,也間接限制了FPGA上的使用。基於這些問題,我們提出了一個系統化的設計方法,以達成快速部署的可行性,包括︰以參數化的方法去設定不同的邏輯與儲存單元,以快速找尋滿足目標平台的設計、並且提供資源與時間上的建模,以方便快速驗證。
為驗證所提出的設計方法,我們在PYNQ-Z1平台上實際打造了圖像辨識的應用︰YOLOv2,並且達到了48.23 GOPs的吞吐量與0.611秒的執行時間。在執行同樣的推論下,能達到與CPU和GPU相比42.38與12.8倍的加速;並且與其他相仿的FPGA設計,能達到2.36倍的執行速度。此外,我們的預測模型與實際實驗結果僅有5-22%的誤差,與先前的研究相比,減少了近60%的誤差。
FPGA-based Convolutional Neural Network (CNN) processor has been widely applied for highly-parallelized computations and fast deployment. However, designing on embedded FPGA needs to consider multiple aspects, such as the feasibility of limited configurable resources on FPGA, external memory latency, and the scheduling between memory and computation units. These considerations hence hinder the usage of FPGA. Addressing these issues, we elaborate on a systematic design approach that allows the fast deployment, including the parameterized computation and memory unit, which can be configured based on the target platform, and an evaluation approach for searching the optimal setting sets. To evaluate the proposed approach, we performed an object detection task, YOLOv2, on PYNQ-Z1. We achieved 48.23 GOPs throughput, which is 42 and 13 times faster than performing the same inference on CPU and GPU and is 2.4 times faster than other published FPGA implementations. Additionally, our created evaluation model is only 5-22% apart from the implementation result, which is 60% less than previous work.
[1] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao and J. Cong, “Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks,” Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2015
[2] Guo, Kaiyuan, et al. "[DL] A Survey of FPGA-based Neural Network Inference Accelerators." ACM Transactions on Reconfigurable Technology and Systems (TRETS) 12.1 (2019): 2.
[3] J. Redmon and A. Farhadi, “YOLO9000: better, faster, stronger,” Proceedings of the IEEE conference on computer vision and pattern recognition, 2017.
[4] A. Rahman, S. Oh, J. Lee and Ki. Choi, “Design Space Exploration of FPGA Accelerators for Convolutional Neural Networks,” Proceedings of the Conference on Design, Automation & Test in Europe. European Design and Automation Association, 2017.
[5] Y. Ma, Y. Cao, S. Vrudhula and J.S. Seo, “Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks.” Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2017.
[6] R. Zhao, X. Niu, Y. Wu, W. Luk and Q. Liu, “Optimizing CNN-based object detection algorithms on embedded FPGA platforms,” International Symposium on Applied Reconfigurable Computing. Springer, Cham, 2017.
[7] Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. Leong, M. Jahre and K. Viessers, “Finn: A framework for fast, scalable binarized neural network inference,” Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2017.
[8] Xilinx, “UG473: 7 Series FPGAs Memory Resources,” 2019.
[9] W. Shi, J. Cao, Q. Zhang, Y. Li, L.Xu, "Edge computing: Vision and challenges," IEEE Internet of Things Journal 3.5 (2016): 637-646.
[10] Qiu, Jiantao, et al, “Going deeper with embedded fpga platform for convolutional neural network,” Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2016.
[11] Xilinx, “DS190: Zynq-7000 all programmable SoC overview,” 2018.
[12] C. Zhang, P. Li, G. Y. Sun, Y. J. Guan, B. J. Xiao and J. Cong, “Optimizing fpga-based accelerator design for deep convolutional neural networks,” Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2015.
[13] Y. J. Wai, Z bin M. Yussof, S.I. bin Salim, and L. K. Chuan, "Fixed point implementation of tiny-yolo-v2 using opencl on fpga," International Journal of Advanced Computer Science and Applications 9.10, 2018, p. 506-512.
[14] J. Cong, and B. J. Xiao, "Minimizing computation in convolutional neural networks," International conference on artificial neural networks. Springer, Cham, 2014.
[15] M. Verhelst, and B. Moons. "Embedded deep neural network processing: Algorithmic and processor techniques bring deep learning to iot and edge devices," IEEE Solid-State Circuits Magazine 9.4 (2017): 55-65.
[16] Guo, Kaiyuan, et al. "[DL] A survey of FPGA-based neural network inference accelerators," ACM Transactions on Reconfigurable Technology and Systems (TRETS) 12.1 (2019): 1-26.
[17] Zhang, Chen, et al. "Caffeine: Toward uniformed representation and acceleration for deep convolutional neural networks," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 38.11 (2018): 2072-2085.
[18] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton, "Imagenet classification with deep convolutional neural networks," Advances in neural information processing systems. 2012.
[19] Lin, Tsung-Yi, et al. "Microsoft coco: Common objects in context," European conference on computer vision. Springer, Cham, 2014.
[20] Xilinx, “UG585: Zynq-7000 Technical Reference Manual,” 2018.
[21] Xu, Bing, et al. "Empirical evaluation of rectified activations in convolutional network," arXiv preprint arXiv:1505.00853, 2015.
[22] Sadri, Mohammadsadegh, et al. "Energy and performance exploration of accelerator coherency port using Xilinx ZYNQ," Proceedings of the 10th FPGAworld Conference. 2013.
[23] Xilinx, “UG1073 Vivado AXI Reference:,” 2017.
[24] “PYNQ_Burst_Test,” Feb. 21, 2020. [Online]. Available: https://github.com/ms0488638/PYNQ_Burst_Test
[25] Ma, Yufei, et al. "End-to-end scalable FPGA accelerator for deep residual networks." 2017 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 2017.
[26] Sze, Vivienne, et al. "Efficient processing of deep neural networks: A tutorial and survey." Proceedings of the IEEE 105.12 (2017): 2295-2329.
[27] Mittal, Sparsh. "A Survey on optimized implementation of deep learning models on the NVIDIA Jetson platform." Journal of Systems Architecture (2019).
[28] Das, Reetuparna, et al. "DNN accelerator Architecture- SIMD or Systolic," ACM, ACM SIGARCH (2018).
[29] Rajmohan, Shathanaa, and Ramasubramanian Natarajan. "Group influence based improved firefly algorithm for Design Space Exploration of Datapath resource allocation." Applied Intelligence 49.6 (2019): 2084-2100.
[30] Rockchip, “RK3399 Open source Document,” 2019.