研究生: |
王劭元 Wang, Shao-Yuan |
---|---|
論文名稱: |
在成本約束下最大化異質化叢集性能的資源規劃與作業排程協同設計策略 A Resource Planning and Job Scheduling Co-Design Strategy for Maximizing The Performance of Heterogeneous Clusters Under A Cost Constraint |
指導教授: |
周志遠
Chou, Jerry |
口試委員: |
李哲榮
Lee, Che-Rung 賴冠州 Lai, Kuan-Chou |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Computer Science |
論文出版年: | 2024 |
畢業學年度: | 112 |
語文別: | 英文 |
論文頁數: | 23 |
中文關鍵詞: | 異質化集群 、資源配置 、集群調度策略 、資源分配 、深度學習訓練 |
相關次數: | 點閱:42 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著人工智慧的廣泛應用,訓練深度神經網絡(DNNs)也成為了一個重要且常見的工作負載。為了加速這一類訓練工作,通常會建立一個 GPU 集群,以在上面執行、加速訓練過程。除此之外,大多數調度器將 GPU 視為主要資源,並會根據所需求的 GPU 數量來分配其他 CPU 資源,如處理器核心和 RAM。然而,不同的 DNN 模型對 CPU 資源的敏感度不同,這意味著可以利用這種敏感度來制定資源分配策略和建立 GPU 配置。另外,現有的解決方案僅限於同質集群,但大多數商業集群是異質的。因此,我們提供了一種策略,將資源規劃與作業調度共同設計,以在成本限制下最大化異質集群的性能。我們的一個組件,GPU 規劃器,將決定 GPU 配置並為調度器提供更好的資源分配決策建議。
With the widespread application of artificial intelligence, training Deep Neural Network (DNNs) has also become an important and common workload. While this kind of jobs are often executed on a GPU cluster for accelerating the training process, moreover, most of the schedulers consider GPU as the primary resource and would allocate other host resources such as cores, RAM proportional to the number of GPUs requested. However, different model has different sensitivity to host resources, which means there is a chance to leverage the sensitivity not only for the resource allocation strategy but also determining the GPU configuration. Nevertheless, the existing solutions are only confined to homogeneous cluster, but most of the commercial clusters are heterogeneous. In this work, we provide a strategy to perform resource planning co-design with job scheduling, in order to maximizing the performance of heterogeneous cluster under a cost constraint. One of our component, GPU Planner will determine the a GPU plan and also schedule hints for Scheduler to perform better resource allocation decision.
[1] Chen, C., Chen, Y., Chen, Z., Han, J., and Xue, G. Pickyman: A preemptive scheduler for deep learning jobs on gpu clusters. In 2022 IEEE International Performance, Computing, and Communications Conference (IPCCC) (2022), IEEE, pp. 120–129.
[2] Choi, S., Lee, S., Kim, Y., Park, J., Kwon, Y., and Huh, J. Serving het- erogeneous machine learning models on {Multi-GPU} servers with {Spatio- Temporal} sharing. In 2022 USENIX Annual Technical Conference (USENIX ATC 22) (2022), pp. 199–216.
[3] Gu, J., Chowdhury, M., Shin, K. G., Zhu, Y., Jeon, M., Qian, J., Liu, H., and Guo, C. Tiresias: A {GPU} cluster manager for distributed deep learning. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19) (2019), pp. 485–500.
[4] Hwang,C.,Kim,T.,Kim,S.,Shin,J.,andPark,K.Elasticresourcesharingfor distributed deep learning. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21) (2021), pp. 721–739.
[5] Jayaram Subramanya, S., Arfeen, D., Lin, S., Qiao, A., Jia, Z., and Ganger, G. R. Sia: Heterogeneity-aware, goodput-optimized ml-cluster scheduling. In Proceedings of the 29th Symposium on Operating Systems Principles (2023), pp. 642–657.
[6] Marta, M., Silke, H., and Elisabeth, R. H. Gurobi optimization.
[7] Mohan, J., Phanishayee, A., Kulkarni, J., and Chidambaram, V. Synergy: Resource sensitive dnn scheduling in multi-tenant clusters. arXiv preprint arXiv:2110.06073 (2021).
[8] Mohan, J., Phanishayee, A., Kulkarni, J., and Chidambaram, V. Looking be- yond {GPUs} for {DNN} scheduling on {Multi-Tenant} clusters. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) (2022), pp. 579–596.
[9] Qiao, A., Choe, S. K., Subramanya, S. J., Neiswanger, W., Ho, Q., Zhang, H., Ganger, G. R., and Xing, E. P. Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning. In 15th {USENIX} Symposium on Operat- ing Systems Design and Implementation ({OSDI} 21) (2021).
[10] Weng,Q.,Xiao,W.,Yu,Y.,Wang,W.,Wang,C.,He,J.,Li,Y.,Zhang,L.,Lin, W., and Ding, Y. {MLaaS} in the wild: Workload analysis and scheduling in {Large-Scale} heterogeneous {GPU} clusters. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22) (2022), pp. 945– 960.
[11] Xiao, W., Bhardwaj, R., Ramjee, R., Sivathanu, M., Kwatra, N., Han, Z., Pa- tel, P., Peng, X., Zhao, H., Zhang, Q., et al. Gandiva: Introspective cluster scheduling for deep learning. In 13th USENIX Symposium on Operating Sys- tems Design and Implementation (OSDI 18) (2018), pp. 595–610.
[12] Xiao, W., Ren, S., Li, Y., Zhang, Y., Hou, P., Li, Z., Feng, Y., Lin, W., and Jia, Y. {AntMan}: Dynamic scaling on {GPU} clusters for deep learning. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20) (2020), pp. 533–548.
[13] Zhao, H., Cui, W., Chen, Q., Leng, J., Yu, K., Zeng, D., Li, C., and Guo, M. Coda: Improving resource utilization by slimming and co-locating dnn and cpu jobs. In 2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS) (2020), IEEE, pp. 853–863.
[14] Zhao, Y., Liu, Y., Peng, Y., Zhu, Y., Liu, X., and Jin, X. Multi-resource in- terleaving for deep learning training. In Proceedings of the ACM SIGCOMM 2022 Conference (2022), pp. 428–440.