簡易檢索 / 詳目顯示

研究生: 吳宗憲
Wu, Tsung-Hsien
論文名稱: 在TVM中支持異質平台上的彈性平行計算
Supporting Flexible Parallel Computing on Heterogeneous Platform in TVM
指導教授: 金仲達
King, Chung-Ta
口試委員: 黃稚存
Huang, Chih-Tsun
董明智
Tung, Ming-Chih
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2022
畢業學年度: 110
語文別: 英文
論文頁數: 39
中文關鍵詞: 編譯器深度學習異質平台平行計算
外文關鍵詞: Compiler, Deep learning, Heterogeneous platform, Parallel computing
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • TVM是一個編譯器平台,它支持許多機器學習框架和不同類型的處理器/加速器。它可以優化深度神經網路並生成高效能的程式碼以在不同種類硬體裝置上執行。雖然TVM能支持異質平台上的計算,但目前只能將程式碼配置到異質處理器上以循序的方式執行。此外,TVM的排程是靜態的,不能依據處理器的負載動態調適。在本論文中,我們修改TVM以支持動態平行異質計算。所產生的程式碼可以配置計算到多個異質處理器上同時執行,並且可以在執行時彈性地將計算分配到不同的處理器上。我們將GoogleNet經過所修改的TVM編譯後,在PC和嵌入式系統上執行,並彈性的運用其不同的排程與配置策略,展示程式的效能並比較不同的策略,以證明所修改的TVM在平行執行和彈性排程上的能力。


    TVM is a compiler framework, which supports many machine learning frameworks and hardware backends. It can optimize deep neural networks and generate efficient codes to execute on different kinds of backend devices. Although TVM supports computations on heterogeneous platforms, so far it can only schedule the codes to execute on the heterogeneous devices in serial. Furthermore, the schedule is static and thus cannot adapt to the dynamic loadings of the devices. In this work, we modify TVM to support dynamic parallel heterogeneous computing, in which computations can be scheduled to execute simultaneously on multiple heterogeneous devices, and the allocation of computations to backend devices can be done flexibly at run time. We demonstrate the parallel execution and flexible scheduling capabilities of the modified TVM by compiling GoogleNet to run on PC and embedded systems, and showing their performances using different scheduling strategies.

    摘要 i Abstract ii 誌謝辭 iii Table of Contents iv Chapter 1 Introduction 1 Chapter 2 Related Works and Background 4 Chapter 3 System Design 8 Chapter 4 Method 12 4.1 Find Parallel Operators 12 4.2 Scheduling Methods 14 4.2.1 Schedule by operator 14 4.2.2 Schedule by path 16 Chapter 5 Experiment 18 5.1 Experiment Setup 18 5.2 Evaluation 19 5.2.1 Result in PC 19 5.2.2 Result in embedded system 21 Chapter 6 Conclusions 26 References 27 Appendix A 31 A.1 Details of the schedule results 31 A.2 Execution time of each block in our methods in embedded system 32 A.3 Information of each operator in GoogleNet 36

    [1] LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324.
    [2] Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 1097-1105.
    [3] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1-9).
    [4] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
    [5] Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.
    [6] Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
    [7] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... & Hassabis, D. (2015). Human-level control through deep reinforcement learning. nature, 518(7540), 529-533.
    [8] Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015, June). Trust region policy optimization. In International conference on machine learning (pp. 1889-1897). PMLR.
    [9] Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., & Riedmiller, M. (2014, January). Deterministic policy gradient algorithms. In International conference on machine learning (pp. 387-395). PMLR.
    [10] Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., ... & Yoon, D. H. (2017, June). In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th annual international symposium on computer architecture (pp. 1-12).
    [11] Liao, H., Tu, J., Xia, J., & Zhou, X. (2019, August). Davinci: A scalable architecture for neural network computing. In 2019 IEEE Hot Chips 31 Symposium (HCS) (pp. 1-44). IEEE Computer Society.
    [12] Chen, Y., Luo, T., Liu, S., Zhang, S., He, L., Wang, J., ... & Temam, O. (2014, December). Dadiannao: A machine-learning supercomputer. In 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture (pp. 609-622). IEEE.
    [13] Chen, Y. H., Krishna, T., Emer, J. S., & Sze, V. (2016). Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE journal of solid-state circuits, 52(1), 127-138.
    [14] D. Schor, (2020). Arm Ethos is for Ubiquitous AI At the Edge — WikiChip Fuse. Retrieved from ttps://fuse.wikichip.org/news/ 3282/arm-ethos-is-for-ubiquitous-ai-at-the-edge/
    [15] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., ... & Chintala, S. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 8026-8037.
    [16] Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., ... & Zheng, X. (2016). Tensorflow: A system for large-scale machine learning. In 12th {USENIX} symposium on operating systems design and implementation ({OSDI} 16) (pp. 265-283).
    [17] Chollet, F., & others. (2015). Keras. GitHub. Retrieved from https://github.com/fchollet/keras
    [18] Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., ... & Darrell, T. (2014, November). Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia (pp. 675-678).
    [19] Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., ... & Zhang, Z. (2015). Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274.
    [20] Chen, T., Moreau, T., Jiang, Z., Zheng, L., Yan, E., Shen, H., ... & Krishnamurthy, A. (2018). {TVM}: An automated end-to-end optimizing compiler for deep learning. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18) (pp. 578-594).
    [21] Zoph, B., Vasudevan, V., Shlens, J., & Le, Q. V. (2018). Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 8697-8710).
    [22] Tang, L., Wang, Y., Willke, T. L., & Li, K. (2018). Scheduling computation graphs of deep learning models on manycore cpus. arXiv preprint arXiv:1807.09667.
    [23] Hu, T. C. (1961). Parallel sequencing and assembly line problems. Operations research, 9(6), 841-848.
    [24] Ding, Y., Zhu, L., Jia, Z., Pekhimenko, G., & Han, S. (2021). IOS: Inter-Operator Scheduler for CNN Acceleration. Proceedings of Machine Learning and Systems, 3.
    [25] Ma, L., Xie, Z., Yang, Z., Xue, J., Miao, Y., Cui, W., ... & Zhou, L. (2020). Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks. In 14th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 20) (pp. 881-897).
    [26] Chakaravarthy, R. V., & Jiang, H. (2020, October). Special Session: XTA: Open Source eXtensible, Scalable and Adaptable Tensor Architecture for AI Acceleration. In 2020 IEEE 38th International Conference on Computer Design (ICCD) (pp. 53-56). IEEE.
    [27] Wu, H. I., Guo, D. Y., Chin, H. H., & Tsay, R. S. (2020). A Pipeline-Based Scheduler for Optimizing Latency of Convolution Neural Network Inference over Heterogeneous Multicore Systems. In 2020 2nd IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS) (pp. 46-49). IEEE.
    [28] Rotem, N., Fix, J., Abdulrasool, S., Catron, G., Deng, S., Dzhabarov, R., ... & Wang, M. (2018). Glow: Graph lowering compiler techniques for neural networks. arXiv preprint arXiv:1805.00907.
    [29] Cyphers, S., Bansal, A. K., Bhiwandiwalla, A., Bobba, J., Brookhart, M., Chakraborty, A., ... & Webb, T. J. (2018). Intel ngraph: An intermediate representation, compiler, and executor for deep learning. arXiv preprint arXiv:1801.08058.
    [30] Vasilache, N., Zinenko, O., Theodoridis, T., Goyal, P., DeVito, Z., Moses, W. S., ... & Cohen, A. (2018). Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. arXiv preprint arXiv:1802.04730.
    [31] Chris Leary and Todd Wang. (2017). XLA: TensorFlow, compiled.
    [32] Li, M., Liu, Y., Liu, X., Sun, Q., You, X., Yang, H., ... & Qian, D. (2020). The deep learning compiler: A comprehensive survey. IEEE Transactions on Parallel and Distributed Systems, 32(3), 708-727.
    [33] An Overview of TVM and Model Optimization. https://tvm.apache.org/docs/tutorials/get_started/introduction.html#an-overview-of-tvm-and-model-optimization
    [34] Bai, Junjie and Lu, Fang and Zhang, Ke and others. (2019). ONNX: Open Neural Network Exchange. GitHub. Retrieved from https://github.com/onnx/onnx
    [35] Chen, T., Zheng, L., Yan, E., Jiang, Z., Moreau, T., Ceze, L., ... & Krishnamurthy, A. (2018). Learning to optimize tensor programs. arXiv preprint arXiv:1805.08166.
    [36] Zheng, L., Jia, C., Sun, M., Wu, Z., Yu, C. H., Haj-Ali, A., ... & Stoica, I. (2020). Ansor: Generating high-performance tensor programs for deep learning. In 14th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 20) (pp. 863-879).
    [37] AMD Ryzen 5 2600 GFLOPS performance. https://gadgetversus.com/processor/amd-ryzen-5-2600-gflops-performance/
    [38] NVIDIA GeForce GTX 1080 FLOPS performance. https://www.techpowerup.com/gpu-specs/geforce-gtx-1080.c2839
    [39] ARM-A53 and ARM-A72 FLOPS performance. http://web.eece.maine.edu/~vweaver/group/green_machines.html
    [40] ARM Mali-T860 MP4 FLOPS performance. https://wikimovel.com/index.php/Rockchip

    QR CODE