研究生: |
黃 晨 Huang, Chen |
---|---|
論文名稱: |
探討共享與可動態重構GPU的工作排程及資源分配方法 A job scheduling and resource allocation algorithm for shared and reconfigurable GPUs |
指導教授: |
周志遠
Chou, Jerry |
口試委員: |
李哲榮
Lee, Che-Rung 賴冠州 Lai, Kuan-Chou |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Computer Science |
論文出版年: | 2024 |
畢業學年度: | 112 |
語文別: | 英文 |
論文頁數: | 30 |
中文關鍵詞: | 多執行個體GPU 、GPU叢集排程 、工作排程 |
外文關鍵詞: | Multi-Instance-GPU, GPU cluster scheduling, job scheduling |
相關次數: | 點閱:44 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近些年來,使用GPU 加速的程序在學術界以及產業界都變得越來越熱門,尤其是深度學習。然而,由於GPU 的高造價以及其高耗能的特性,使用GPU 的代價頗為昂貴。因此,為了提升GPU 的使用效率,前人提出了許多GPU 共享技術和GPU 叢集排程演算法。其中,Nvidia 提出一項需要硬體及軟體支援的GPU 共享技術: 多個體執行GPU (MIG)。這項技術允許使用者將一個GPU 分隔成多個獨立且可重新配置的執行個體。在這篇論文中,我們的研究目標是提升MIG GPU 叢集的使用率。我們根據MIG GPU 的特性,提出了一個工作排程及資源分配的演算法。這個演算法可以最大化GPU 叢集當前的使用率,同時其分配資源的方式也有利於未來GPU 的重新配置。另外,我們也執行了程式模擬實驗,以驗證此演算法的功效。
In recent years, GPU-accelerated workloads (especially deep learning) have become increasingly popular in both industry and academia. However, using GPU is pretty expensive due to its cost of manufacture and high power consumption. As a result, multiple GPU sharing mechanisms and GPU cluster scheduling algorithms are proposed to improve GPU utilization and efficiency. Among the rest, Nvidia proposed Multi-Instance-GPU (MIG) [5], which is a hardware and software-supported GPU sharing feature. It enables users to partition a GPU into separate reconfigurable instances. In this thesis, we aim to improve the utilization of a MIG GPU cluster. Based on the characteristics of MIG GPU, we propose a job scheduling and resource allocation algorithm, which maximizes current cluster utilization and benefits future GPU reconfiguration. We also conduct simulated experiments to evaluate the effect of our algorithm.
[1] Gu, J., Chowdhury, M., Shin, K. G., Zhu, Y., Jeon, M., Qian, J., Liu, H., and
Guo, C. Tiresias: A GPU cluster manager for distributed deep learning. In 16th
USENIX Symposium on Networked Systems Design and Implementation (NSDI
19) (Boston, MA, Feb. 2019), USENIX Association, pp. 485–500.
[2] Li, B., Patel, T., Samsi, S., Gadepally, V., and Tiwari, D. Miso: exploiting
multi-instance gpu capability on multi-tenant gpu clusters. In Proceedings of
the 13th Symposium on Cloud Computing (New York, NY, USA, 2022), SoCC
’22, Association for Computing Machinery, p. 173–189.
[3] Mahajan, K., Balasubramanian, A., Singhvi, A., Venkataraman, S., Akella,
A., Phanishayee, A., and Chawla, S. Themis: Fair and efficient GPU cluster
scheduling. In 17th USENIX Symposium on Networked Systems Design and
Implementation (NSDI 20) (Santa Clara, CA, Feb. 2020), USENIX Association,
pp. 289–304.
[4] Narayanan, D., Santhanam, K., Kazhamiaka, F., Phanishayee, A., and Zaharia,
M. Heterogeneity-Aware cluster scheduling policies for deep learning workloads.
In 14th USENIX Symposium on Operating Systems Design and Implementation
(OSDI 20) (Nov. 2020), USENIX Association, pp. 481–498.
[5] NVIDIA. NVIDIA MIG. https://docs.nvidia.com/datacenter/tesla/mig-u
ser-guide/index.html.
[6] NVIDIA. NVIDIA MPS. https://docs.nvidia.com/deploy/mps/index.html.
[7] Peng, Y., Bao, Y., Chen, Y., Wu, C., and Guo, C. Optimus: an efficient dynamic
resource scheduler for deep learning clusters. In Proceedings of the Thirteenth
EuroSys Conference (New York, NY, USA, 2018), EuroSys ’18, Association for
Computing Machinery.
[8] Tan, C., Li, Z., Zhang, J., Cao, Y., Qi, S., Liu, Z., Zhu, Y., and Guo, C. Serving
dnn models with multi-instance gpus: A case of the reconfigurable machine
scheduling problem. arXiv preprint arXiv:2109.11067 (2021).
[9] Xiao, W., Bhardwaj, R., Ramjee, R., Sivathanu, M., Kwatra, N., Han, Z., Patel,
P., Peng, X., Zhao, H., Zhang, Q., Yang, F., and Zhou, L. Gandiva: Introspective
cluster scheduling for deep learning. In 13th USENIX Symposium on
Operating Systems Design and Implementation (OSDI 18) (Carlsbad, CA, Oct.
2018), USENIX Association, pp. 595–610.