適用於嵌入式FPGA的輕量化卷積類神經網絡處理器之設計方法

簡易檢索 / 詳目顯示

回結果列表

研究生：	凌胤淳 Ling, Yin-Chun
論文名稱：	適用於嵌入式FPGA的輕量化卷積類神經網絡處理器之設計方法 Designing A Compact Convolutional Neural Network Processor on Embedded FPGAs
指導教授：	蔡仁松 Tsay, Ren-Song
口試委員:	麥偉基 Mak, Wai-Kei 呂仁碩 Liu, Ren-Shuo 何宗易 Ho, Tsung-Yi
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 電機工程學系 Department of Electrical Engineering
論文出版年：	2020
畢業學年度：	108
語文別：	英文
論文頁數：	57
中文關鍵詞：	類神經網絡處理器、現場可程式化邏輯閘陣列、軟硬體協同設計、設計最佳化、電子系統級設計方法
外文關鍵詞：	CNN Processor, SW/HW Co-design Methodology, Design Optimization, Electronic System-Level Design
相關次數：	點閱：3 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

因為能達成高平行計算以及快速的部署，以FPGA為基礎的卷積類神經網路處理器被日益廣泛的應用。然而，在嵌入式的FPGA設計上，有著許多的考量點，包括︰FPGA上有限的可重構邏輯資源、外部記憶體所造成的傳輸延遲以及資料傳輸單元與運算單元之間的搭配合作；因為這些原因，也間接限制了FPGA上的使用。基於這些問題，我們提出了一個系統化的設計方法，以達成快速部署的可行性，包括︰以參數化的方法去設定不同的邏輯與儲存單元，以快速找尋滿足目標平台的設計、並且提供資源與時間上的建模，以方便快速驗證。

為驗證所提出的設計方法，我們在PYNQ-Z1平台上實際打造了圖像辨識的應用︰YOLOv2，並且達到了48.23 GOPs的吞吐量與0.611秒的執行時間。在執行同樣的推論下，能達到與CPU和GPU相比42.38與12.8倍的加速；並且與其他相仿的FPGA設計，能達到2.36倍的執行速度。此外，我們的預測模型與實際實驗結果僅有5-22％的誤差，與先前的研究相比，減少了近60％的誤差。

FPGA-based Convolutional Neural Network (CNN) processor has been widely applied for highly-parallelized computations and fast deployment. However, designing on embedded FPGA needs to consider multiple aspects, such as the feasibility of limited configurable resources on FPGA, external memory latency, and the scheduling between memory and computation units. These considerations hence hinder the usage of FPGA. Addressing these issues, we elaborate on a systematic design approach that allows the fast deployment, including the parameterized computation and memory unit, which can be configured based on the target platform, and an evaluation approach for searching the optimal setting sets. To evaluate the proposed approach, we performed an object detection task, YOLOv2, on PYNQ-Z1. We achieved 48.23 GOPs throughput, which is 42 and 13 times faster than performing the same inference on CPU and GPU and is 2.4 times faster than other published FPGA implementations. Additionally, our created evaluation model is only 5-22% apart from the implementation result, which is 60% less than previous work.

Abstract
Contents
I.    Introduction    5
II.    Preliminaries    9
A.    Convolutional Neural Networks for Object Detection    9
B.    FPGA-based CNN Processor Design    12
1)    Configurable logical components of FPGAs.    12
2)    Design Abstractions for CNN Processor.    13
3)    State-of-art FPGA-based CNN Processor Design    15
C.    External Memory Access Latency    16
D.    Previous work and proposed issues    18
III.    Design Methodology    21
A.    Modularized Computation units    22
1)    Modeling Computation Resource    25
2)    Modeling Execution Latency    25
B.    Burst length-oriented Memory units Design    26
1)    Data distributor and collector    26
2)    Modeling memory operations latency    28
3)    Resource Utilization of the memory units    31
IV.    Search for  The Optimum Configurations    33
A.    Verification of the target FPGA resource constraint    33
B.    Verification of the latency    34
1)    Data Reuse Strategy    34
2)    Ping-Pong Manner    37
C.    Searching for the Unrolling and Tiling setting sets    39
V.    System implementation    41
A.    Top-level Acceleration System on All Programmable SoC    41
B.    The workflow of executing CNN layer    42
VI.    Experimental Result    43
A.    Experimental Setup and Our Design Setting sets    44
B.    Comparison of memory unit design approaches    45
C.    Comparison with the theoretical results    46
D.    Comparison with other Embedded Platforms    48
E.    Comparison with other FPGA Designs    49
VII.    Conclusion    50
VIII.    Appendix 1: design of other function units    51
IX.    Appendix 2: external memory data arrangement    53
A.    Dynamic Memory Allocation    53
B.    Optimization from External Data Rearrangement    55
X.    References    56

                                

[1] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao and J. Cong, “Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks,” Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2015
[2] Guo, Kaiyuan, et al. "[DL] A Survey of FPGA-based Neural Network Inference Accelerators." ACM Transactions on Reconfigurable Technology and Systems (TRETS) 12.1 (2019): 2.
[3] J. Redmon and A. Farhadi, “YOLO9000: better, faster, stronger,” Proceedings of the IEEE conference on computer vision and pattern recognition, 2017.
[4] A. Rahman, S. Oh, J. Lee and Ki. Choi, “Design Space Exploration of FPGA Accelerators for Convolutional Neural Networks,” Proceedings of the Conference on Design, Automation & Test in Europe. European Design and Automation Association, 2017.
[5] Y. Ma, Y. Cao, S. Vrudhula and J.S. Seo, “Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks.” Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2017.
[6] R. Zhao, X. Niu, Y. Wu, W. Luk and Q. Liu, “Optimizing CNN-based object detection algorithms on embedded FPGA platforms,” International Symposium on Applied Reconfigurable Computing. Springer, Cham, 2017.
[7] Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. Leong, M. Jahre and K. Viessers, “Finn: A framework for fast, scalable binarized neural network inference,” Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2017.
[8] Xilinx, “UG473: 7 Series FPGAs Memory Resources,” 2019.
[9] W. Shi, J. Cao, Q. Zhang, Y. Li, L.Xu, "Edge computing: Vision and challenges," IEEE Internet of Things Journal 3.5 (2016): 637-646.
[10] Qiu, Jiantao, et al, “Going deeper with embedded fpga platform for convolutional neural network,” Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2016.
[11] Xilinx, “DS190: Zynq-7000 all programmable SoC overview,” 2018.
[12] C. Zhang, P. Li, G. Y. Sun, Y. J. Guan, B. J. Xiao and J. Cong, “Optimizing fpga-based accelerator design for deep convolutional neural networks,” Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2015.
[13] Y. J. Wai, Z bin M. Yussof, S.I. bin Salim, and L. K. Chuan, "Fixed point implementation of tiny-yolo-v2 using opencl on fpga," International Journal of Advanced Computer Science and Applications 9.10, 2018, p. 506-512.
[14] J. Cong, and B. J. Xiao, "Minimizing computation in convolutional neural networks," International conference on artificial neural networks. Springer, Cham, 2014.
[15] M. Verhelst, and B. Moons. "Embedded deep neural network processing: Algorithmic and processor techniques bring deep learning to iot and edge devices," IEEE Solid-State Circuits Magazine 9.4 (2017): 55-65.
[16] Guo, Kaiyuan, et al. "[DL] A survey of FPGA-based neural network inference accelerators," ACM Transactions on Reconfigurable Technology and Systems (TRETS) 12.1 (2019): 1-26.
[17] Zhang, Chen, et al. "Caffeine: Toward uniformed representation and acceleration for deep convolutional neural networks," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 38.11 (2018): 2072-2085.
[18] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton, "Imagenet classification with deep convolutional neural networks," Advances in neural information processing systems. 2012.
[19] Lin, Tsung-Yi, et al. "Microsoft coco: Common objects in context," European conference on computer vision. Springer, Cham, 2014.
[20] Xilinx, “UG585: Zynq-7000 Technical Reference Manual,” 2018.
[21] Xu, Bing, et al. "Empirical evaluation of rectified activations in convolutional network," arXiv preprint arXiv:1505.00853, 2015.
[22] Sadri, Mohammadsadegh, et al. "Energy and performance exploration of accelerator coherency port using Xilinx ZYNQ," Proceedings of the 10th FPGAworld Conference. 2013.
[23] Xilinx, “UG1073 Vivado AXI Reference:,” 2017.
[24] “PYNQ_Burst_Test,” Feb. 21, 2020. [Online]. Available: https://github.com/ms0488638/PYNQ_Burst_Test
[25] Ma, Yufei, et al. "End-to-end scalable FPGA accelerator for deep residual networks." 2017 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 2017.
[26] Sze, Vivienne, et al. "Efficient processing of deep neural networks: A tutorial and survey." Proceedings of the IEEE 105.12 (2017): 2295-2329.
[27] Mittal, Sparsh. "A Survey on optimized implementation of deep learning models on the NVIDIA Jetson platform." Journal of Systems Architecture (2019).
[28] Das, Reetuparna, et al. "DNN accelerator Architecture- SIMD or Systolic," ACM, ACM SIGARCH (2018).
[29] Rajmohan, Shathanaa, and Ramasubramanian Natarajan. "Group influence based improved firefly algorithm for Design Space Exploration of Datapath resource allocation." Applied Intelligence 49.6 (2019): 2084-2100.
[30] Rockchip, “RK3399 Open source Document,” 2019.

簡易檢索 / 詳目顯示

相關論文