研究生: |
張元銘 Chang, Yuan-Ming. |
---|---|
論文名稱: |
人工智慧和開放計算語言的執行時系統和框架支持 Runtime System Supports and Frameworks for AI and OpenCL Computing |
指導教授: |
李政崑
Lee, Jenq-Kuen |
口試委員: |
邱瀞德
Chiu, Ching-Te 楊武 Yang, Wuu 黃元欣 Hwang, Yuan-Shin 遊逸平 You, Yi-Ping 陳鵬升 Chen, Peng-Sheng 周百祥 Chou, Pai-Hsiang |
學位類別: |
博士 Doctor |
系所名稱: |
電機資訊學院 - 資訊工程學系 Computer Science |
論文出版年: | 2021 |
畢業學年度: | 109 |
語文別: | 英文 |
論文頁數: | 111 |
中文關鍵詞: | 異構系統架構 、開放計算語言 、執行時系統 、便攜式計算語言 、圖形處理器 、開放式神經網路格式 、交換式神經網路格式 、神經網路應用接口 |
外文關鍵詞: | HSA, OpenCL, Runtime, PoCL, GPU, ONNX, NNEF, NNAPI |
相關次數: | 點閱:2 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在高等程式語言中,執行時系統正變得越來越重要。它可以用來當作控制與向後協尋幫助的引擎,為人工智能模型和人工智能語言提供一個執行環境。
在現在的開放計算語言中,執行時系統,可以調用核心程序在圖形處理器上並做運算。因此,執行時系統可以在程序調度,執行和分區等方面發揮作用。
在我們的研究工作中,我們將努力的為執行時系統提供幾個先進的研究。
本文的主要貢獻如下。
首先,在先前的工作中,我們的實驗室成員幫助異構系統架構中啟用了開放計算語言框架1.2的版本,使用開源軟體的便攜式計算語言的執行時框架作為實作的架構基礎。在這項工作中,我們進一步擴展了框架,予以支持開放計算語言2.0功能,並通過了13隻AMD軟體開發套件中的計算語言2.0範例程式。
其次,我們的研究工作開發出允許交換式神經網路格式的模型在Host和Android
平台上執行推理的任務,並通過Android平台上的神經網絡應用接口靈活地調用神經網絡,以加快推理操作。
我們開發了一種名為BFSelector的算法,該算法基於經典的廣度優先搜索,並包括使用成本的考量去決定輸入的模型如何做區分。我們的初步實驗結果表明,我們在交換式神經網路格式的模型上使用API 27的神經網路應用接口,可以使基準速度提高1.32至22.52倍,在API 28的基準下速度提高4.56至211倍,其中基準是無調用神經網路應用接口。實驗包括AI模型,例如LeNet,AlexNet,MobileNet_V1,MobileNet_V2,VGG-16和VGG-19。
最後,我們看一下在圖形處理器架構上調度相依程序的執行時功能。
我們創建框架來分析神經網路模型(例如交換式神經網路格式和開放式神經網路格式的模型)並找到相關相依程序的模板,然後可將其進行調度使用開放計算語言或CUDA在圖形處理器架構上。 初步的實驗結果表明,通過結合神經網路運算單元和適當的內存策略,該技術平均可將整體性能提高8%,並將緩存未命中率平均降低14%。
In advanced programming languages, the runtime system is growing with importance. It can be used as a control engine or fallback engine to provide an execution environment for the AI models and AI languages. In OpenCL, it is now equipped with runtime systems to invoke kernel programs on GPU and computing engines. The runtime systems therefore can play an role for program scheduling, execution, and partitioning, etc.
In our research work, we will bring up several contributions to
the state of the art of the system research with runtime efforts.
The major contribution of this dissertation is as follows.
First, our lab members helped enable Portable Computing Language (PoCL)-based OpenCL 1.2 runtime frameworks on the HSA in previous work. In this work, we further extend the PoCL-based runtime on the HSA to support OpenCL 2.0 features and pass 13 AMDSDK3.0 OpenCL2.0 samples code.
Second, our work allows NNEF to execute inference tasks on host and Android platforms and flexibly invokes neural networks through the API (NNAPI) on the Android platform to speed up inference operations. We develop an algorithm named BFSelector that is based on a classic breadth-first search and includes cost constraints to determine how to divide the input model. Our preliminary experimental results show that our support of NNEF on NNAPI can obtain a speedup of 1.32 to 22.52 times over the baseline for API 27 and of 4.56 to 211 times over the baseline for API 28, where the baseline is the NNEF-to-Android platform conversion without invoking NNAPI. The experiment includes AI models such as LeNet, AlexNet, MobileNet_V1, MobileNet_V2, VGG-16, and VGG-19.
Finally, we look at the runtime capability for scheduling dependent programs on GPU architectures. We develop a framework to analyze the neural network model (such as NNEF and ONNX) and find the pattern of dependent kernels, which can then be used for scheduling with OpenCL or CUDA on GPU architectures. The preliminary experimental results show that this technique improves the overall performance by 8% and reduces the cache miss rate by 14% on average by combining neural network operators and appropriate memory policies.
[1] Android Studio. https://developer.android.com/studio/.
[2] Aparapi. https://code.google.com/p/aparapi/.
[3] Beignet OpenCL Library for Intel Ivy Bridge and newer GPUs. https://cgit.freedesktop.org/beignet/.
[4] Caffe. http://caffe.berkeleyvision.org/.
[5] CIFAR-10. https://www.cs.toronto.edu/˜kriz/cifar.html.
[6] Clover Git: OpenCL 1.1 Software Implementation. https://people.freedesktop.org/˜steckdenis/clover/index.html.
[7] Compute Library for Deep Neural Networks. https://github.com/intel/clDNN.
[8] Core ML. https://developer.apple.com/documentation/coreml/.
[9] FreeOCL. http://www.zuzuf.net/FreeOCL/.
[10] Google. https://www.google.com/.
[11] HSA support implementation status as of 2016-05-17. http://portablecl.org/docs/html/hsa_status.html.
[12] ImageNet. http://www.image-net.org/.
[13] Khronos OpenCL Resources. https://www.khronos.org/opencl/resources.
[14] KhronosGroup. “KhronosGroup/NNEF-Tools.” GitHub, 12 June 2019. github.com/KhronosGroup/NNEF-Tools/.
[15] M.-Y. Hung, M.-Y. Lai, C.-Y Sung, J.-K. Lee, A Generic Method to Utilize Vendor-Specific AI Accelerator on Android Mobile for TVM, TVM and Deep Learning Compilation Conference, Seattle, Dec. 2020.
[16] MCSDK HPC 3.x OpenCL. http://processors.wiki.ti.com/index.php/OpenCL.
[17] MNIST. http://yann.lecun.com/exdb/mnist/.
[18] MXNet. https://mxnet.apache.org/.
[19] NNAPI. https://developer.android.com/ndk/guides/neuralnetworks.
[20] NNEF Overview. https://www.khronos.org/nnef.
[21] ONNX runtime. https://www.onnxruntime.ai/.
[22] OpenACC. http://www.openacc.org/.
[23] Protocol buffers. https://developers.google.com/protocol-buffers//.
[24] PyTorch. https://pytorch.org/.
[25] Seven OpenCL Benchmarks for Heterogeneous System Architecture Evaluation. https://epaper.ntu.edu.tw/view.php?listid=275&id=23071.
[26] TensorFlow. https://www.tensorflow.org/.
[27] Tensorflow Lite. https://www.tensorflow.org/lite.
[28] The Khronos Group. https://www.khronos.org/.
[29] “The Web’s Scaffolding Tool for Modern Webapps.” Yeoman. yeoman.io/.
[30] Mart´ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
[31] AMD OpenCL Accelerated Parallel Processing (APP). http://developer.amd.com/tools-and-sdks/.
[32] Junjie Bai, Fang Lu, Ke Zhang, et al. Onnx: Open neural network exchange. GitHub repository, 2019.
[33] Ali Bakhoda, George L Yuan, Wilson WL Fung, Henry Wong, and Tor M Aamodt. Analyzing cuda workloads using a detailed gpu simulator. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on, pages 163–174. IEEE, 2009.
[34] Yuan-Ming Chang, Wei-Cheng Liao, Shao-Chung Wang, Chun-Chieh Yang, and Yuan-Shin Hwang. A framework for scheduling dependent programs on gpu architectures. Journal of Systems Architecture, 106:101712, 2020.
[35] Yuan-Ming Chang, Chia-Yu Sung, Yu-Chien Sheu, Meng-Shiun Yu, Min-Yih Hsu, and Jenq-Kuen Lee. Support nnef execution model for nnapi. Journal of Supercomputing, 2021.
[36] Yuan-Ming Chang, Shao-Chung Wang, Chun-Chieh Yang, Yuan-Shin Hwang, and Jenq-Kuen Lee. Enabling pocl-based runtime frameworks on the hsa for opencl 2.0 support. Journal of Systems Architecture, 81:71–82, 2017.
[37] Tai-Liang Chen, Shih-Huan Chien, and Jenq-Kuen Lee. Viennacl++: Enable tensorflow/eigen via viennacl with opencl c++ flow. In Proceedings of the International Workshop on OpenCL, page 28. ACM, 2018.
[38] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. {TVM}: An automated end-to-end optimizing compiler for deep learning. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18), pages 578–594, 2018.
[39] Li-An Her and Jenq-Kuen Lee. Opencl vector swizzling optimization under llvm global value numbering. In April 2018 Workshop on Compilers for Parallel Computing (CPC), 2018.
[40] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
[41] Ming-Yu Hung, Peng-Sheng Chen, Yuan-Shin Hwang, Roy Dz-Ching Ju, and Jenq-Kuen Lee. Support of probabilistic pointer analysis in the ssa form. IEEE Transactions on Parallel and Distributed Systems, 23(12):2366–2379, 2012.
[42] Pekka Ja¨a¨skela¨inen, Carlos Sa´ nchez de La Lama, Erik Schnetter, Kalle Raiskila, Jarmo Takala, and Heikki Berg. pocl: A performance-portable opencl implementation. International Journal of Parallel Programming, 43(5):752–785, 2015.
[43] Yuan-Ming Chang Chao-Lin Lee Pi-Yo Chen Jenq-Kuen Lee, Allen Lu and Shao-Chung Wang. Supporting tvm on risc-v architectures. In TVM and Deep Learning Compiler Conference. TVM and Deep Learning Compiler Conference, 2018.
[44] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages 675–678. ACM, 2014.
[45] Onur Kayıran, Adwait Jog, Mahmut Taylan Kandemir, and Chita Ranjan Das. Neither more nor less: optimizing thread-level parallelism for gpgpus. In Proceedings of the 22nd international conference on Parallel architectures and compilation techniques, pages 157–166. IEEE Press, 2013.
[46] Khronos. https://www.khronos.org/.
[47] Gwangsun Kim, Jiyun Jeong, John Kim, and Mark Stephenson. Automatically exploiting implicit pipeline parallelism from multiple dependent kernels for gpus. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, pages 341–352. ACM, 2016.
[48] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
[49] Ming-Yi Lai, Chia-Yu Sung, Jenq-Kuen Lee, and Ming-Yu Hung. Enabling android nnapi flow for tvm runtime. In 49th International Conference on Parallel Processing - ICPP: Workshops, pages 1–8, 2020.
[50] Chris Lattner and Vikram Adve. Llvm: A compilation framework for lifelong program analysis & transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization, CGO ’04, pages 75–, Washington, DC, USA, 2004. IEEE Computer Society.
[51] Yann LeCun, Le´ on Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
[52] Chao-Lin Lee, Chen-Ting Chao, Jenq-Kuen Lee, Ming-Yu Hung, and Chung-Wen Huang. Accelerate dnn performance with sparse matrix compression in halide. In Proceedings of the 48th International Conference on Parallel Processing: Workshops, pages 1–6, 2019.
[53] Yu-Te Lin and Jenq-Kuen Lee. Vector data flow analysis for simd optimizations on opencl programs. Concurrency and Computation: Practice and Experience, 28(5):1629–1654, 2016.
[54] Yu-Te Lin and Jenq-Kuen Lee. Vector data flow analysis for simd optimizations on opencl programs. Concurrency and Computation: Practice and Experience, Volume 28, Issue 5, pp. 1629-1654, April 2016.
[55] S. Mukherjee, Y. Sun, P. Blinzer, A. K. Ziabari, and D. Kaeli. A comprehensive performance analysis of hsa and opencl 2.0. In 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 183–193, April 2016.
[56] Aaftab Munshi, Benedict Gaster, Timothy G Mattson, and Dan Ginsburg. OpenCL programming guide. Pearson Education, 2011.
[57] Fermi NVidia. Nvidia’s next generation cuda compute architecture. NVidia, Santa Clara, Calif, USA, 2009.
[58] The Open Neural Network Exchange (ONNX) is an open-source artificial intelligence ecosystem. https://onnx.ai/.
[59] Develop applications and solutions that emulate human vision with the Intel Distribution of OpenVINO toolkit. https://software.intel.com/en-us/openvino-toolkit.
[60] Protocol Buffers are a method of serializing structured data. https://developers.google.com/protocol-buffers/.
[61] Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Fre´ do Durand, and Saman Amarasinghe. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’13, pages 519–530, New York, NY, USA, 2013. ACM.
[62] Jared Roesch, Steven Lyubomirsky, Logan Weber, Josh Pollock, Marisa Kirisame, Tianqi Chen, and Zachary Tatlock. Relay: A new ir for machine learning frameworks. In Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pages 58–68, 2018.
[63] Phil Rogers and AC Fellow. Heterogeneous system architecture overview. In Hot Chips, volume 25, 2013.
[64] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018.
[65] Ramtin Shams, Parastoo Sadeghi, Rodney A Kennedy, and Richard I Hartley. A survey of medical image registration on multicore and the gpu. IEEE Signal Processing Magazine, 27(2):50–60, 2010.
[66] Dillon Sharlet, Aaron Kunze, Stephen Junkins, and Deepti Joshi. Shevlin park: Implementing c++ amp with clang/llvm and opencl. In General Meeting of LLVM Developers and Users, 2012.
[67] Alex Sherstinsky. Fundamentals of recurrent neural network (rnn) and long short-term memory (lstm) network. Physica D: Nonlinear Phenomena, 404:132306, 2020.
[68] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[69] Y. Sun, X. Gong, A. K. Ziabari, L. Yu, X. Li, S. Mukherjee, C. Mccardwell, A. Villegas, and D. Kaeli. Hetero-mark, a benchmark suite for cpu-gpu collaborative computing. In 2016 IEEE International Symposium on Workload Characterization (IISWC), pages 1–10, Sept 2016.
[70] Ivan Tanasic, Isaac Gelado, Javier Cabezas, Alex Ramirez, Nacho Navarro, and Mateo Valero. Enabling preemptive multiprogramming on gpus. In ACM SIGARCH Computer Architecture News, volume 42, pages 193–204. IEEE Press, 2014.
[71] Jean-Ste´ phane Varre´ , Bertil Schmidt, Ste´ phane Janot, and Mathieu Giraud. Manycore high-performance computing in bioinformatics. Advances in Genomic Sequence Analysis and Pattern Discovery, 7:179, 2011.
[72] Shao-Chung Wang, Ming-Yu Hung, Jenq-Kuen Lee, Yuan-Shin Hwang, and Roy Dz-Ching Ju. Pointer-based divergence analysis in the ssa form. In In 17th Workshop on Compilers for Parallel Computing(CPC), 2013.
[73] Shao-Chung Wang, Li-Chen Kan, Chao-Lin Lee, Yuan-Shin Hwang, and Jenq-Kuen Lee. Architecture and compiler support for gpus using energy-efficient affine register files. ACM Transactions on Design Automation of Electronic Systems (TODAES), 23(2):18, 2017.
[74] Yue Wang, Ali Malkawi, Yun Yi, and TC Center. Implementing cfd (computational fluid dynamics) in opencl for building simulation. Proceedings of The 12th International Building Performance Simulation (Building Simulation 2011), 2011.
[75] Chun-Chieh Yang, Shao-Chung Wang, Chou-Chuan Chen, and Jenq-Kuen Lee. The support of an experimental opencl compiler on hsa environments. In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA), page 184. The Steering Committee of The World Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp), 2015.
[76] Lin-Ya Yu, Shao-Chung Wang, and Jenq-Kuen Lee. Hierarchical read/write analysis for pointer-based opencl programs on rram. In Parallel Processing Workshops (ICPPW), 2017 46th International Conference on, pages 45–52. IEEE, 2017.
[77] Meng-Shiun Yu, Tai-Liang Chen, and Jenq-Kuen Lee. Accelerating nnef framework on opencl devices using cldnn. In Proceedings of the International Workshop on OpenCL, pages 1–2, 2020.