簡易檢索 / 詳目顯示

研究生: 石思宇
Piercius, Freud L. Lewis
論文名稱: 基於TVM編譯器的DNN模型效能分析及矩陣運算加速器的應用
DNN Profiling with TVM AI Compiler and its Application to Performance Modeling of GEMM-based Accelerators
指導教授: 劉靖家
Liou, Jing-Jia
口試委員: 董明智
Tung, Ming-Chih
黃稚存
Huang, Chih-Tsun
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2025
畢業學年度: 113
語文別: 英文
論文頁數: 47
中文關鍵詞: 深度學習效能分析編譯器矩陣乘法加速器
外文關鍵詞: Deep Learning, Performance Profiling, TVM Compiler, GEMM, Hardware Accelerator
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 深度神經網路(DNN)的效能分析對於理解不同網路層的行為並優化其在硬體上的執行方式是非常重要的。本論文提出一種基於TVM編譯器的分析方法,針對運行於基於矩陣乘法(GEMM)的硬體加速器上的DNN模型,取得每個網路層的詳細效能指標。在編譯過程中,我們透過在TensorIR(TIR)程式碼中插入分析函式(profiling hooks),以取得運算次數、記憶體使用量和緩衝區(buffer)形狀等資訊。接著,部分分析結果會輸入至ACE工具,以在不需要實體硬體的情況下進行硬體效能預估。我們透過卷積類型的模型(如ResNet、EfficientNet)以及變壓器架構模型(GPT-2)進行實驗驗證,結果清楚地呈現出不同網路層的執行特性與硬體成本。此效能分析方法能有效地詳細解析DNN工作負載,並有助於未來基於GEMM加速器的排程策略及模型優化之研究。


    Analyzing the performance of deep neural networks (DNNs) is important to un-derstand how different layers behave and optimize how they run on hardware. This thesis introduces a profiling method using the TVM compiler to gather detailed metrics for each layer of a DNN running on GEMM-based hardware accelerators. During compilation, profiling hooks are inserted into TensorIR (TIR) code to col-lect information on computations, memory usage, and buffer shape. Some metrics are then passed into the ACE tool, which estimates hardware performance with-out needing real hardware. To test this approach, experiments were conducted on convolutional models such as ResNet and EfficientNet, as well as on a transformer model (GPT-2). The results show how different layers perform, highlighting run-time behavior and hardware cost. This profiling method offers an effective way to understand DNN workloads in detail and supports future research on scheduling strategies and model optimization for GEMM-based accelerators.

    Abstract I Abstract (Chinese) II Acknowledgements III Contents IV List of Figures VI List of Tables IX 1 Introduction 1 1.1 Problem Statement and Motivation . . . . . . . . . . . . . . . .1 1.2 Thesis Objectives . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Scope, Contributions, and Limitations . . . . . . . . . . . . . 3 1.4 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . 4 2 Background 5 2.1 Deep Neural Networks (DNNs) . . . . . . . . . . . . . . . . . . 5 2.1.1 Architecture and Components . . . . . . . . . . . . . . . . . 5 2.1.2 Computational Challenges in DNN Inference . . . . . . . . . 6 2.2 Hardware Accelerators for ML Inference . . . . . . . . . . . . 6 2.3 TVM and Its Compilation Stack . . . . . . . . . . . . . . . . . 7 2.4 Profiling Techniques for ML Models . . . . . . . . . . . . . . . 8 2.4.1 Static vs. Dynamic Profiling . . . . . . . . . . . . . . . . . 8 2.4.2 ACE Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.5 Limitations of Existing Profiling Tools . . . . . . . . . . . . 10 3 Methodology 11 3.1 Approach Overview . . . . . . . . . . . . . . . . . . . . . . . 11 3.2 TVM Compilation and TIR Instrumentation . . . . . . . . . . . . 13 3.2.1 TVM Compilation . . . . . . . . . . . . . . . . . . . . . . . 13 3.2.2 TIR Instrumentation . . . . . . . . . . . . . . . . . . . . . 14 3.2.3 Metrics Profiled . . . . . . . . . . . . . . . . . . . . . . .20 3.3 Assumed System Architecture . . . . . . . . . . . . . . . . . . 21 3.3.1 PE Array Emulation Using CUDA Tensor Cores . . . . . . . 22 3.3.2 GEMM Scheduling with TVM for Tensor Cores . . . . . . . 23 4 Experimental Results 25 4.1 Experiment Results . . . . . . . . . . . . . . . . . . . . . . 26 4.1.1 ResNet-18 . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.1.2 EfficientNet-B0 . . . . . . . . . . . . . . . . . . . . . . . 30 4.1.3 GPT-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.2 Limitations and Challenges . . . . . . . . . . . . . . . . . . .38 5 Conclusion 40 Bibliography 41 A More Profiling Results 44 A.1 ResNet-50 Results . . . . . . . . . . . . . . . . . . . . . . . 44 List of Figures 2.1 ACE flow. Inputs: Hardware configuration files, workload dimensions, and tiling factors. Output: Estimated performance metrics log. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.1 Approach Overview. . . . . . . . . . . . . . . . . . . . . . . 12 3.2 Typical TVM Compilation Flow. . . . . . . . . . . . . . . . . . 13 3.3 Lowering to TIR without layer mapping . . . . . . . . . . . . . 15 3.4 Lowering to TIR with stage mapping . . . . . . . . . . . . . . .16 3.5 Modified TVM Compilation Flow with Profiling Instrumentation. . 17 3.6 Example of block structure in a TIR . . . . . . . . . . . . . . 18 3.7 Injected profiling hooks in TIR . . . . . . . . . . . . . . . . 19 3.8 Overview of the assumed systolic array-based system architecture .....22 3.9 WMMA Tensor Core dataflow . . . . . . . . . . . . . . . . . . . . . 23 4.1 Number of MACs per layer under different padding configurations . 26 4.2 Parameter size and allocated memory per layer under different padding configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.3 DRAM reads per layer under different padding configurations . . . 27 4.4 DRAM writes per layer under different padding configurations . . . 27 4.5 SRAM memory used per layer . . . . . . . . . . . . . . . . . . . . . 28 4.6 Latency estimated under different padding configurations . . . . . . 28 4.7 Energy estimated under different padding configurations . . . . . . 29 4.8 Energy Delay Product estimated under different padding configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.9 Number of MACs per layer under different padding configurations . 30 4.10 Parameter size and allocated memory per layer under different padding configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.11 DRAM reads per layer under different padding configurations . . . 31 4.12 DRAM writes per layer under different padding configurations . . . 32 4.13 Latency estimated under different padding configurations . . . . . . 32 4.14 Energy estimated under different padding configurations . . . . . . 33 4.15 Energy Delay Product estimated under different padding configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.16 Number of MACs per layer under different sequence length . . . . . 34 4.17 Parameter size and allocated memory per layer under different sequence length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.18 DRAM reads per layer under different sequence length . . . . . . . 35 4.19 DRAM writes per layer under different sequence length . . . . . . . 36 4.20 Latency estimated under different sequence length . . . . . . . . . . 36 4.21 Energy estimated under different sequence length . . . . . . . . . . 37 4.22 Energy Delay Product estimated under different sequence length . . 37 A.1 Number of MACs per layer under different padding configurations . 44 A.2 Parameter size and allocated memory per layer under different padding configurations . . . . . . . . . . . . . . . . . . . . . . . . . . .45 A.3 DRAM reads per layer under different padding configurations . .45 A.4 DRAM writes per layer under different padding configurations . . . 46 A.5 Latency estimated under different padding configurations . . . .46 A.6 Energy estimated under different padding configurations . . . . 47 A.7 Energy Delay Product estimated under different padding configurations. . . . . . . . . . . . . . . . . . . . . . . . . . . 47 List of Tables 2.1 Computational complexity (in MAC operations) for key deep learning operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.1 Profiling Metrics Extracted from TensorIR . . . . . . . . . . . 20

    [1] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015.
    [2] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
    [3] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6):84–90, 2017.
    [4] Ashish Vaswani et al. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
    [5] Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. In Conference on robot learning, pages 1–16. PMLR, 2017.
    [6] Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S Emer. Efficient processing of deep neural networks: A tutorial and survey. Proceedings of the IEEE, 105(12):2295–2329, 2017.
    [7] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
    [8] Norman P. Jouppi et al. In-datacenter performance analysis of a tensor processing unit. Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA), pages 1–12, 2017.
    [9] Andrew Putnam, Adrian M. Caulfield, Eric S. Chung, Derek Chiou,
    Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers,
    Gopi Prashanth Gopal, Jan Gray, Michael Haselman, Scott Hauck, Stephen
    Heil, Amir Hormati, Joo-Young Kim, Sitaram Lanka, James Larus, Eric Peterson, Simon Pope, Aaron Smith, Jason Thong, Phillip Yi Xiao, and Doug Burger. A reconfigurable fabric for accelerating large-scale datacenter services. In 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA), pages 13–24, 2014.
    [10] David A. Patterson and John L. Hennessy. Computer architecture: a quantitative approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1990.
    [11] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. Tvm: An automated end-to-end optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 578–594, 2018.
    [12] Kumar Iyer and Jeffrey Kiel. Gpu debugging and profiling with nvidia parallel nsight. Game Development Tools, pages 303–324, 2016.
    [13] James Reinders. VTune performance analyzer essentials, volume 9. Intel Press Santa Clara, 2005.
    [14] Abdulrahman Mahmoud, Neeraj Aggarwal, Alex Nobbe, Jose Rodrigo
    Sanchez Vicarte, Sarita V. Adve, Christopher W. Fletcher, Iuri Frosio, and Siva Kumar Sastry Hari. Pytorchfi: A runtime perturbation tool for
    dnns. In 2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W), pages 25–31, 2020.
    [15] Patrick Domke, Deepak Majeti, Evaggelia Kotsifakou, Zheng Zhou, Neal Adams, and Tobias M¨uller. Can tensor cores benefit memory-bound kernels? (no!), 2024. https://arxiv.org/abs/2502.16851.

    QR CODE