簡易檢索 / 詳目顯示

研究生: 黃 晨
Huang, Chen
論文名稱: 探討共享與可動態重構GPU的工作排程及資源分配方法
A job scheduling and resource allocation algorithm for shared and reconfigurable GPUs
指導教授: 周志遠
Chou, Jerry
口試委員: 李哲榮
Lee, Che-Rung
賴冠州
Lai, Kuan-Chou
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2024
畢業學年度: 112
語文別: 英文
論文頁數: 30
中文關鍵詞: 多執行個體GPUGPU叢集排程工作排程
外文關鍵詞: Multi-Instance-GPU, GPU cluster scheduling, job scheduling
相關次數: 點閱:44下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近些年來,使用GPU 加速的程序在學術界以及產業界都變得越來越熱門,尤其是深度學習。然而,由於GPU 的高造價以及其高耗能的特性,使用GPU 的代價頗為昂貴。因此,為了提升GPU 的使用效率,前人提出了許多GPU 共享技術和GPU 叢集排程演算法。其中,Nvidia 提出一項需要硬體及軟體支援的GPU 共享技術: 多個體執行GPU (MIG)。這項技術允許使用者將一個GPU 分隔成多個獨立且可重新配置的執行個體。在這篇論文中,我們的研究目標是提升MIG GPU 叢集的使用率。我們根據MIG GPU 的特性,提出了一個工作排程及資源分配的演算法。這個演算法可以最大化GPU 叢集當前的使用率,同時其分配資源的方式也有利於未來GPU 的重新配置。另外,我們也執行了程式模擬實驗,以驗證此演算法的功效。


    In recent years, GPU-accelerated workloads (especially deep learning) have become increasingly popular in both industry and academia. However, using GPU is pretty expensive due to its cost of manufacture and high power consumption. As a result, multiple GPU sharing mechanisms and GPU cluster scheduling algorithms are proposed to improve GPU utilization and efficiency. Among the rest, Nvidia proposed Multi-Instance-GPU (MIG) [5], which is a hardware and software-supported GPU sharing feature. It enables users to partition a GPU into separate reconfigurable instances. In this thesis, we aim to improve the utilization of a MIG GPU cluster. Based on the characteristics of MIG GPU, we propose a job scheduling and resource allocation algorithm, which maximizes current cluster utilization and benefits future GPU reconfiguration. We also conduct simulated experiments to evaluate the effect of our algorithm.

    1 Introduction 1 2 Background 3 2.1 General scheduling problems for GPU clusters . . . . . . . . . . . . 3 2.1.1 Job scheduling: ordering . . . . . . . . . . . . . . . . . . . 3 2.1.2 Resource allocation: amount . . . . . . . . . . . . . . . . . 4 2.1.3 Resource allocation: placement . . . . . . . . . . . . . . . 4 2.2 Multi-Instance-GPU(MIG) . . . . . . . . . . . . . . . . . . . . . . 5 3 Motivation 7 3.1 Shared PCIe lane . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.2 Tree structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.3 Minimal reconfiguration time(MRT) . . . . . . . . . . . . . . . . . 9 4 Solution 12 4.1 Problem statement and assumptions . . . . . . . . . . . . . . . . . 12 4.2 Job scheduling: ordering . . . . . . . . . . . . . . . . . . . . . . . 13 4.3 Resource allocation: placement . . . . . . . . . . . . . . . . . . . . 13 5 Evaluation 17 5.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 5.2 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 5.2.1 Resource allocation: placement . . . . . . . . . . . . . . . 18 5.2.2 Job scheduling: ordering . . . . . . . . . . . . . . . . . . . 18 5.3 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 5.4 Effect of job placement algorithms . . . . . . . . . . . . . . . . . . 19 5.5 Effect of job ordering methods . . . . . . . . . . . . . . . . . . . . 21 6 Conclusion 25 References 26 Appendix A. Max utilization proof 27

    [1] Gu, J., Chowdhury, M., Shin, K. G., Zhu, Y., Jeon, M., Qian, J., Liu, H., and
    Guo, C. Tiresias: A GPU cluster manager for distributed deep learning. In 16th
    USENIX Symposium on Networked Systems Design and Implementation (NSDI
    19) (Boston, MA, Feb. 2019), USENIX Association, pp. 485–500.
    [2] Li, B., Patel, T., Samsi, S., Gadepally, V., and Tiwari, D. Miso: exploiting
    multi-instance gpu capability on multi-tenant gpu clusters. In Proceedings of
    the 13th Symposium on Cloud Computing (New York, NY, USA, 2022), SoCC
    ’22, Association for Computing Machinery, p. 173–189.
    [3] Mahajan, K., Balasubramanian, A., Singhvi, A., Venkataraman, S., Akella,
    A., Phanishayee, A., and Chawla, S. Themis: Fair and efficient GPU cluster
    scheduling. In 17th USENIX Symposium on Networked Systems Design and
    Implementation (NSDI 20) (Santa Clara, CA, Feb. 2020), USENIX Association,
    pp. 289–304.
    [4] Narayanan, D., Santhanam, K., Kazhamiaka, F., Phanishayee, A., and Zaharia,
    M. Heterogeneity-Aware cluster scheduling policies for deep learning workloads.
    In 14th USENIX Symposium on Operating Systems Design and Implementation
    (OSDI 20) (Nov. 2020), USENIX Association, pp. 481–498.
    [5] NVIDIA. NVIDIA MIG. https://docs.nvidia.com/datacenter/tesla/mig-u
    ser-guide/index.html.
    [6] NVIDIA. NVIDIA MPS. https://docs.nvidia.com/deploy/mps/index.html.
    [7] Peng, Y., Bao, Y., Chen, Y., Wu, C., and Guo, C. Optimus: an efficient dynamic
    resource scheduler for deep learning clusters. In Proceedings of the Thirteenth
    EuroSys Conference (New York, NY, USA, 2018), EuroSys ’18, Association for
    Computing Machinery.
    [8] Tan, C., Li, Z., Zhang, J., Cao, Y., Qi, S., Liu, Z., Zhu, Y., and Guo, C. Serving
    dnn models with multi-instance gpus: A case of the reconfigurable machine
    scheduling problem. arXiv preprint arXiv:2109.11067 (2021).
    [9] Xiao, W., Bhardwaj, R., Ramjee, R., Sivathanu, M., Kwatra, N., Han, Z., Patel,
    P., Peng, X., Zhao, H., Zhang, Q., Yang, F., and Zhou, L. Gandiva: Introspective
    cluster scheduling for deep learning. In 13th USENIX Symposium on
    Operating Systems Design and Implementation (OSDI 18) (Carlsbad, CA, Oct.
    2018), USENIX Association, pp. 595–610.

    QR CODE