簡易檢索 / 詳目顯示

研究生: 張又仁
Chang, Yu-Jen
論文名稱: 基於SystemC的晶片上網路一維處理單元陣列模擬器之可重構數據流、分塊優化及TVM整合
Reconfigurable Dataflow and Tile Shape Optimization in SystemC-Based NoC of 1-D PE Array Simulators integrated with TVM
指導教授: 邱瀞德
Chiu, Ching-Te
口試委員: 李政崑
Lee, Jenq-Kuen
范倫達
Van, Lan-Da
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2024
畢業學年度: 113
語文別: 英文
論文頁數: 47
中文關鍵詞: 卷積神經網路加速器模擬器晶片上網路可重構資料流
外文關鍵詞: convolutional neural network, hardware accelerator simulator, network on chip, reconfigurable dataflow
相關次數: 點閱:74下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 卷積神經網絡(CNN)在圖像分類、去噪和物體檢測方面具有高度的有
    效性,但其計算效率可以通過加速器顯著提升。傳統加速器通常採用固定
    的資料流,這可能並不適合所有的卷積神經網絡層。1-D PE 陣列架構支持
    靈活的資料流,但其可擴展性仍然是一個重大問題。此外,目前的設計空
    間探索(DSE)工具主要關注 2D 網格 PE 陣列或脈動陣列架構,忽略了對
    一維處理單元陣列的晶片上網絡(NoC)架構的探索。這些工具也缺乏對每
    層可重構資料流和塊大小的支持,並且通常需要將 DNN 模型重寫為特定配
    設定檔格式,這既不方便也效率低下。
    為了解決這些限制,我們提出了一種基於 SystemC 並集成 TVM 的新型
    晶片上網路一維處理單元陣列模擬器。我們的模擬器支持靈活的資料流和
    塊大小配置,從而實現對神經網路模型的高效性能估計。晶片上網絡架構
    允許各種尺寸和形狀的設定,提供近乎週期精確的性能估計模型。通過添
    加自定義 Opstrategy、自定義 compute 和自定義 C runtime,TVM 與晶片上
    網絡模擬器集成,自動化性能估計過程,優化每層的切片和資料流策略。
    此方法支援各種神經網路開發框架,增強了所提方法的可用性和效率。
    實驗結果證明了我們方法的有效性,顯示出相較於現有工具如 Scalesim
    [1] 的 1.84 倍加速。我們的模擬器評估了 VGG16 上不同資料流和塊大小的
    性能,在延遲和資源利用率方面取得了顯著改進。所提方法為 DNN 加速器
    設計提供了一種可擴展且靈活的解決方案,解決了現有方法的不足之處。


    Convolutional Neural Networks (CNNs) are highly effective for image classi-fication [2], denoising [?, ?] and object detection [3], but their computational efficiency can be significantly enhanced using accelerators. Traditional accelerators often employ fixed dataflows, which may not be optimal for all CNN layers. The 1-D PE array architecture supports flexible dataflows, but scalability remains a significant issue. Moreover, current Design Space Exploration (DSE) tools focus on 2D-mesh PE arrays or systolic array architectures, neglecting the exploration of Network-on-Chip (NoC) architectures for 1-D PE arrays. These tools also lack support for reconfigurable dataflows and tile sizes per layer, often requiring the rewriting of DNN models into specific configurations, which is both inconvenient and inefficient.
    To address these limitations, we propose a novel NoC-based 1-D PE array simulator based on SystemC integrated with TVM. Our simulator supports flexible dataflow and tile size configurations, enabling efficient performance estimation for DNN models. The NoC architecture allows for various sizes and shapes of configurations, providing a nearly cycle-accurate model for performance estimation. By adding custom Opstrategy, custom compute, and custom C runtime, TVM is integrated with the NoC simulator to automate the performance estimation process, optimizing tiling and dataflow strategies for each layer. This integration supports various DNN frameworks, enhancing the usability and efficiency of the proposed method.
    Experimental results demonstrate the effectiveness of our approach, showing a 1.84x speedup compared to existing tools like Scalesim [1]. Our simulator evaluates the performance of different dataflows and tile sizes on VGG16, achieving significant improvements in latency and resource utilization. The proposed method offers a scalable and flexible solution for DNN accelerator design, addressing the shortcomings of existing approaches.

    摘要 Abstract 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Related Work 9 2.1 1-D PE array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Design Space Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.1 GAMMA [4] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.2 AutoDNNchip [5] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.3 STONNE [6] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3 Proposed Method 13 3.1 Overall System Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 SystemC model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2.1 NoC model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.3 Memory Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.4 Main Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.4.1 Configuration Register . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.4.2 Central controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.4.3 PE array state registers . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.4.4 PE array controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.4.5 SRAM loading controller . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.5 Tiling and dataflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.5.1 Tiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.5.2 dataflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.6 TLM 2.0 Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.6.1 TLM 2.0 generic payload . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.6.2 TLM 2.0 Blocking Transport . . . . . . . . . . . . . . . . . . . . . . . 23 3.7 TVM Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.7.1 Custom Opstrategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.7.2 Custom Compute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.7.3 Custom C runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.7.4 AutoTVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4 Experiment 29 4.1 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.1.1 Experiment Set Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.1.2 Experiment Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.2 Analysis of Different NoC Topologies with Equal PE Count . . . . . . . . . . 31 4.2.1 Experiment Set Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.2.2 Experiment Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.3 Tile shape and subtile shape comparison . . . . . . . . . . . . . . . . . . . . . 33 4.3.1 Experiment Set Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.3.2 Experiment Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.4 Utilization and bandwidth analysis . . . . . . . . . . . . . . . . . . . . . . . . 35 4.4.1 Experiment Set Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.4.2 Experiment Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.4.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.5 Reconfigurable dataflow and subtile size . . . . . . . . . . . . . . . . . . . . . 39 4.5.1 Experiment Set Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.5.2 Experiment Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.6 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5 Conclusion 43 References 45

    [1] A. Samajdar, Y. Zhu, P. Whatmough, M. Mattina, and T. Krishna, “Scale-sim: Systolic cnn accelerator simulator,” arXiv preprint arXiv:1811.02883, 2018.
    [2] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
    [3] C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao, “Yolov7: Trainable bag-of-freebies
    sets new state-of-the-art for real-time object detectors,” arXiv preprint arXiv:2207.02696, 2022.
    [4] S.-C. Kao and T. Krishna, “Gamma: Automating the hw mapping of dnn models on accelerators via genetic algorithm,” ICCAD, 2020.
    [5] P. Xu, X. Zhang, C. Hao, Y. Zhao, Y. Zhang, Y. Wang, C. Li, Z. Guan, D. Chen, and
    Y. Lin, “Autodnnchip: An automated dnn chip predictor and builder for both fpgas and asics,” Int’l Symp. on Field-Programmable Gate Arrays (FPGA), 2020.
    [6] F. Muñoz-Matrínez, J. L. Abellán, M. E. Acacio, and T. Krishna, “Stonne: Enabling cycle-level microarchitectural simulation for dnn inference accelerators,” in 2021 IEEE Inter-national Symposium on Workload Characterization (IISWC), 2021.
    [7] Yi-Fan Chen, Yu-Jen Chang, Ching-Te Chiu et al., “Low dram memory access and flexible dataflow convolutional neural network accelerator based on risc-v custom instruction,”IEEE International Symposium on Circuits and Systems (ISCAS), 2024.
    [8] D. B. e. a. Zhi Jin, Muhammad Zafar Iqbal, “A flexible deep cnn framework for image restoration,” IEEE Transactions on Multimedia, 2020.
    [9] H. Kwon, P. Chatarasi, V. Sarkar, T. Krishna, M. Pellauer, and A. Parashar, “MAESTRO:A data-centric approach to understand reuse, performance, and hardware cost of DNN mappings,” IEEE Micro, vol. 40, no. 3, pp. 20–29, 2020.
    [10] A. Parashar, P. Raina, Y. S. Shao, Y.-H. Chen, V. A. Ying, A. Mukkara, R. Venkatesan, B. Khailany, S. W. Keckler, and J. Emer, “Timeloop: A systematic approach to dnn accelerator evaluation,” IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2019.
    [11] J. F. et al., “An energy-efficient gemm-based convolution accelerator with on-the-fly im2col,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2023.
    [12] Z.Du, R.Fasthuber, T.Chen, P.Ienne, L.Li, T.Luo, X.Feng, Y.Chen, and O.Temam, “Shidiannao: Shifting vision processing closer to the sensor,"in acm sigarch computer architecture news,” ACM/IEEE 42nd Annual International Symposium on Computer Architecture, pp. 92–104, 2015
    [13] F. Sijstermans, “The nvidia deep learning accelerator,” In Hot Chips, 2018.
    [14] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer,
    S. W. Keckler, and W. J. Dally, “Scnn: An accelerator for compressed-sparse convolutional neural networks,” ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), 2017.
    [15] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices,” IEEE Journal on Emerging and Selected Topics in ircuits and Systems, vol. 9, no. 2, pp. 292–308, 2019.
    [16] Jianing Gao, Qiming Shao, Fangyu Deng et al., “An noc-based cnn accelerator for edge omputing,” IEEE 15th International Conference on ASIC (ASICON), 2023.
    [17] Y. W. e. a. Rui Xu, Sheng Ma, “Hesa: Heterogeneous systolic array architecture for compact cnns hardware accelerators,” Design, Automation & Test in Europe onference &Exhibition (DATE), 2021.
    [18] C. Y. e. a. N. P. Jouppi, “In-datacenter performance analysis of a tensor rocessing unit,” SCA ’17: Proceedings of the 44th Annual International Symposium on omputer Architecture, pp. 1–12, 2017.
    [19] M. H. et al., “Redas: A lightweight architecture for supporting fine-grained eshaping and ultiple dataflows on systolic array,” IEEE Transactions on Computers, 024.
    [20] A. Samajdar, J. M. Joseph, Y. Zhu, P. Whatmough, M. Mattina, and T. Krishna, “A systematic methodology for characterizing scalability of dnn accelerators using scale im,” in 020 IEEE International Symposium on Performance Analysis of Systems and oftware ISPASS), pp. 58–68, IEEE, 2020.
    [21] H. Kwon, A. Samajdar, and T. Krishna, “Maeri: Enabling flexible dataflow mapping ver nn accelerators via reconfigurable interconnects,” ASPLOS, vol. 56, pp. 461–475, 018.
    [22] W. Z. et al., “Activation in network for noc-based deep neural network ccelerator,” VLSITSA, 2024.
    [23] C.-Y. Du, C.-F. Tsai, W.-C. Chen, L.-Y. Lin, N.-S. Chang, C.-P. Lin, C.-S. Chen, and C.- H.
    Yang , “A 28nm 11.2tops/w hardware-utilization-aware neural-network accelerator with dynamic dataflow,” IEEE International Solid- State Circuits Conference (ISSCC), pp. 1–3, 2023.
    [24] Eric Qin Ananda Samajdar, Hyoukjun Kwon, Vineet Nadella, Sudarshan Srinivasan, Dipankar Das, Bharat Kaul and Tushar Krishna, “Sigma: A sparse and irregular gemm accelerator with flexible interconnects for dnn training,” IEEE International Symposium on High Performance Computer Architecture (HPCA), 2020.
    [25] Longbo Huang, Jean Walrand, “A benes packet network,” arXiv:1208.0561, 2012.
    [26] V. E. Benes, “Optimal rearrangeable multistage connecting networks,” Bell System Technical Journal, 1964.
    [27] Mohammad Arjomand et al., “Performance evaluation of butterfly on-chip network for mpsocs,” IEEE International SoC Design Conference, 2008.
    [28] R. V. e. a. Yakun Sophia Shao, Jason Clemons, “Simba: Scaling deep-learning inference with multi-chip-module-based architecture,” IEEE/ACM International Symposium on Microarchitecture, 2019.
    [29] L. M. et al., “Zigzag: Enlarging joint architecture-mapping design space exploration for dnn accelerators,” IEEE Transactions on Computers, 2021.
    [30] E. R. et.al., “Multiobjective end-to-end design space exploration of parameterized dnn accelerators,” IEEE Internet of Things Journal, 2023.
    [31] T. C. et al., “Tvm: An automated end-to-end optimizing compiler for deep learning,” USENIX Symposium on Operating Systems Design and Implementation, 2018.
    [32] A. TVM, “Apache tvm documentation,” 2024. Accessed: 2024-08-10.

    QR CODE