研究生: |
石思宇 Piercius, Freud L. Lewis |
---|---|
論文名稱: |
基於TVM編譯器的DNN模型效能分析及矩陣運算加速器的應用 DNN Profiling with TVM AI Compiler and its Application to Performance Modeling of GEMM-based Accelerators |
指導教授: |
劉靖家
Liou, Jing-Jia |
口試委員: |
董明智
Tung, Ming-Chih 黃稚存 Huang, Chih-Tsun |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
論文出版年: | 2025 |
畢業學年度: | 113 |
語文別: | 英文 |
論文頁數: | 47 |
中文關鍵詞: | 深度學習 、效能分析 、編譯器 、矩陣乘法 、加速器 |
外文關鍵詞: | Deep Learning, Performance Profiling, TVM Compiler, GEMM, Hardware Accelerator |
相關次數: | 點閱:2 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
深度神經網路(DNN)的效能分析對於理解不同網路層的行為並優化其在硬體上的執行方式是非常重要的。本論文提出一種基於TVM編譯器的分析方法,針對運行於基於矩陣乘法(GEMM)的硬體加速器上的DNN模型,取得每個網路層的詳細效能指標。在編譯過程中,我們透過在TensorIR(TIR)程式碼中插入分析函式(profiling hooks),以取得運算次數、記憶體使用量和緩衝區(buffer)形狀等資訊。接著,部分分析結果會輸入至ACE工具,以在不需要實體硬體的情況下進行硬體效能預估。我們透過卷積類型的模型(如ResNet、EfficientNet)以及變壓器架構模型(GPT-2)進行實驗驗證,結果清楚地呈現出不同網路層的執行特性與硬體成本。此效能分析方法能有效地詳細解析DNN工作負載,並有助於未來基於GEMM加速器的排程策略及模型優化之研究。
Analyzing the performance of deep neural networks (DNNs) is important to un-derstand how different layers behave and optimize how they run on hardware. This thesis introduces a profiling method using the TVM compiler to gather detailed metrics for each layer of a DNN running on GEMM-based hardware accelerators. During compilation, profiling hooks are inserted into TensorIR (TIR) code to col-lect information on computations, memory usage, and buffer shape. Some metrics are then passed into the ACE tool, which estimates hardware performance with-out needing real hardware. To test this approach, experiments were conducted on convolutional models such as ResNet and EfficientNet, as well as on a transformer model (GPT-2). The results show how different layers perform, highlighting run-time behavior and hardware cost. This profiling method offers an effective way to understand DNN workloads in detail and supports future research on scheduling strategies and model optimization for GEMM-based accelerators.
[1] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015.
[2] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
[3] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6):84–90, 2017.
[4] Ashish Vaswani et al. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
[5] Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. In Conference on robot learning, pages 1–16. PMLR, 2017.
[6] Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S Emer. Efficient processing of deep neural networks: A tutorial and survey. Proceedings of the IEEE, 105(12):2295–2329, 2017.
[7] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
[8] Norman P. Jouppi et al. In-datacenter performance analysis of a tensor processing unit. Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA), pages 1–12, 2017.
[9] Andrew Putnam, Adrian M. Caulfield, Eric S. Chung, Derek Chiou,
Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers,
Gopi Prashanth Gopal, Jan Gray, Michael Haselman, Scott Hauck, Stephen
Heil, Amir Hormati, Joo-Young Kim, Sitaram Lanka, James Larus, Eric Peterson, Simon Pope, Aaron Smith, Jason Thong, Phillip Yi Xiao, and Doug Burger. A reconfigurable fabric for accelerating large-scale datacenter services. In 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA), pages 13–24, 2014.
[10] David A. Patterson and John L. Hennessy. Computer architecture: a quantitative approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1990.
[11] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. Tvm: An automated end-to-end optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 578–594, 2018.
[12] Kumar Iyer and Jeffrey Kiel. Gpu debugging and profiling with nvidia parallel nsight. Game Development Tools, pages 303–324, 2016.
[13] James Reinders. VTune performance analyzer essentials, volume 9. Intel Press Santa Clara, 2005.
[14] Abdulrahman Mahmoud, Neeraj Aggarwal, Alex Nobbe, Jose Rodrigo
Sanchez Vicarte, Sarita V. Adve, Christopher W. Fletcher, Iuri Frosio, and Siva Kumar Sastry Hari. Pytorchfi: A runtime perturbation tool for
dnns. In 2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W), pages 25–31, 2020.
[15] Patrick Domke, Deepak Majeti, Evaggelia Kotsifakou, Zheng Zhou, Neal Adams, and Tobias M¨uller. Can tensor cores benefit memory-bound kernels? (no!), 2024. https://arxiv.org/abs/2502.16851.