簡易檢索 / 詳目顯示

研究生: 張露文
Zhang, Lu-­Wen
論文名稱: HYPREL: 基於超圖分區之拓撲感知平行計算作業排程
HYPREL: A Topology Aware Parallel Computing Job Scheduler based on Hypergraph Partitioning
指導教授: 周志遠
Chou, Jerry
口試委員: 賴冠州
Lai, Kuan-Chou
李哲榮
Lee, Che-Rung
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2021
畢業學年度: 109
語文別: 英文
論文頁數: 42
中文關鍵詞: 平行計算調度超圖分區優化
外文關鍵詞: parallel computing, scheduling, hypergraph partition, optimization
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 平行計算作業在平行系統的調度中,作業的執行時間會受到任務擺放位置的影響,任務會期望被分配在儘可能近的節點上。但是在系統中,隨著舊的作業在不同時刻的退出,系統資源在每個時間點的碎片化不可避免,如何平衡作業對本地性(locality)需求和系統碎片,是平行作業調度中的一個重要問題。過去的做法大都通過對所分配資源進行本地性限制來達到作業的通信需求,但是過於嚴格的位置限制會延長作業的等待時間,等待時間的過長源自於作業系統空餘資源的碎片化,幫派調度的限制使得作業需要等待滿足嚴格本地性的資源釋放。因此,本文提出了基於超圖劃分的平行系統拓撲感知平行計算作業調度方法。本方法會將平行系統的拓撲信息建模成一個權重超圖,對每一次調度,我們會通過計算超圖k割的最小割解來從當前的待處理作業中選擇最適合當前節點的任務組合。在實驗中,我們的方法可以實現更短的平均作業完成時間和更少的阻塞,對比前沿方法縮短24%-44%的平均作業完成時間,並可以顯著減輕系統擁塞和作業等待時間。


    In the parallel computing jobs scheduling of parallel systems, the job's execution time will be affected by the placement of the tasks. Therefore, we expect that the computing nodes to which tasks are assigned are as close as possible. However, in the parallel system, as the old jobs exit at different times, the fragmentation of system resources at each point in time is inevitable. Therefore, the trade between the locality requirements of jobs and system fragmentation is an essential issue in parallel computing job scheduling. In the past, most of the methods limited the allocated resources locally to meet the communication requirements of the job. But too strict location restrictions will prolong the waiting time of the job, and the long waiting time is due to the fragmentation of the available resources of the parallel system. On the other hand, the restrictions of gang scheduling make jobs need to wait until the release of resources that meet strict locality. Therefore, this paper proposes a topology-aware parallel computing job scheduling method based on hypergraph partitioning. This method will model the topology information of the parallel system into a weighted hypergraph. Then, we will select the combination of the most suitable job from the current pending jobs for each schedule by calculating the minimum cut of the k-way hypergraph partitioning. In the experiment, our method can achieve a shorter average job completion time and less congestion. Compared with the frontier method, it shortens the average job completion time by 24%-44%, and reduces system congestion and the job waiting time.

    1 Introduction 1 2 Problem Description 5 2.1 Problem Assumption 5 2.2 Challenge 7 2.3 Goal 9 3 HYPREL Design 11 3.1 General Idea 11 3.2 Introduction to Hypergraph 12 3.3 Problem Mapping 14 3.4 Solve a Simple Case 17 3.5 HYPREL Algorithm 19 3.6 Complexity Analysis 21 3.7 Algorithm of Hypergraph Partition 23 4 Experiment Setup 24 4.1 Simulator 24 4.2 Workload 25 4.3 Baseline 25 4.4 Performance Model 26 5 Experiment Results 28 5.1 Cluster Efficiency 28 5.2 Dissecting Improvement 30 5.3 Sensitivity Analysis 33 6 Discussion and Future Work 35 7 Related Work 36 8 Conclusion 38 References 39

    [1] Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D. G., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., and Zheng, X. Tensorflow: A system for large­scale machine learning, 2016.
    [2] Alpert, C. J., Kahng, A. B., and Yao, S.­Z. Spectral partitioning with multiple eigenvectors. Discrete Applied Mathematics 90, 1 (1999), 3–26.
    [3] Amaral, M., Polo, J., Carrera, D., Seelam, S., and Steinder, M. Topology­ aware gpu scheduling for learning workloads in cloud environments. In Pro­ ceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (New York, NY, USA, 2017), SC ’17, As­ sociation for Computing Machinery.
    [4] Amaral, M., Polo, J., Carrera, D., Seelam, S., and Steinder, M. Topology­ aware gpu scheduling for learning workloads in cloud environments. In Pro­ ceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (New York, NY, USA, 2017), SC ’17, As­ sociation for Computing Machinery.
    [5] Center, O. S. Ohio supercomputer center, 1987.
    [6] Collette, S., Cucu, L., and Goossens, J. Integrating job parallelism in real­time scheduling theory. Information Processing Letters 106, 5 (2008), 180–187.
    [7] Deveci, M., Kaya, K., Uçar, B., and Catalyurek, U. V. Hypergraph partitioning for multiple communication cost metrics: Model and methods. Journal of Parallel and Distributed Computing 77 (2015), 69–83.
    [8] Feitelson, D. G. Packing schemes for gang scheduling. In Job Scheduling Strategies for Parallel Processing (Berlin, Heidelberg, 1996), D. G. Feitelson and L. Rudolph, Eds., Springer Berlin Heidelberg, pp. 89–110.
    [9] Fiduccia, C. M., and Mattheyses, R. M. A linear­time heuristic for improving network partitions. In DAC (1982).
    [10] Garefalakis, P., Karanasos, K., Pietzuch, P., Suresh, A., and Rao, S. Medea: Scheduling of long running applications in shared production clusters. In Pro­ ceedings of the Thirteenth EuroSys Conference (New York, NY, USA, 2018), EuroSys ’18, Association for Computing Machinery.
    [11] Ghodsi, A., Zaharia, M., Shenker, S., and Stoica, I. Choosy: Max­min fair sharing for datacenter jobs with constraints. In Proceedings of the 8th ACM European Conference on Computer Systems (New York, NY, USA, 2013), Eu­ roSys ’13, Association for Computing Machinery, p. 365–378.
    [12] Gog, I., Schwarzkopf, M., Gleave, A., Watson, R. N., and Hand, S. Firmament: Fast, centralized cluster scheduling at scale. In 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16) (2016), pp. 99– 115.
    [13] Grandl, R., Ananthanarayanan, G., Kandula, S., Rao, S., and Akella, A. Multi­ resource packing for cluster schedulers. SIGCOMM Comput. Commun. Rev. 44, 4 (Aug. 2014), 455–466.
    [14] Gu, J., Chowdhury, M., Shin, K. G., Zhu, Y., Jeon, M., Qian, J., Liu, H., and Guo, C. Tiresias: A gpu cluster manager for distributed deep learning. In Proceedings of the 16th USENIX Conference on Networked Systems Design and Implementation (USA, 2019), NSDI’19, USENIX Association, p. 485– 500.
    [15] Han, Z., Tan, H., Jiang, S. H.­C., Fu, X., Cao, W., and Lau, F. C. Schedul­ ing placement­sensitive bsp jobs with inaccurate execution time estimation. In IEEE INFOCOM 2020 ­ IEEE Conference on Computer Communications (2020), pp. 1053–1062.
    [16] Heroux, M. A., Raghavan, P., and Simon, H. D. Parallel processing for sci­ entific computing. SIAM, 2006.
    [17] Hwang, C., Kim, T., Kim, S., Shin, J., and Park, K. Elastic resource shar­ ing for distributed deep learning. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21) (Apr. 2021), USENIX Asso­ ciation, pp. 721–739.
    [18] Jayarajan, A., Wei, J., Gibson, G., Fedorova, A., and Pekhimenko, G. Priority­based parameter propagation for distributed DNN training. CoRR abs/1905.03960 (2019).
    [19] Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A., Boyle, R., luc Cantin, P., Chao, C., Clark, C., Coriell, J., Daley, M., Dau, M., Dean, J., Gelb, B., Ghaem­ maghami, T. V., Gottipati, R., Gulland, W., Hagmann, R., Ho, C. R., Hogberg, D., Hu, J., Hundt, R., Hurt, D., Ibarz, J., Jaffey, A., Jaworski, A., Kaplan, A.,
    Khaitan, H., Koch, A., Kumar, N., Lacy, S., Laudon, J., Law, J., Le, D., Leary,
    C., Liu, Z., Lucke, K., Lundin, A., MacKean, G., Maggiore, A., Mahony, M., Miller, K., Nagarajan, R., Narayanaswami, R., Ni, R., Nix, K., Norrie, T., Omernick, M., Penukonda, N., Phelps, A., Ross, J., Ross, M., Salek, A., Samadiani, E., Severn, C., Sizikov, G., Snelham, M., Souter, J., Steinberg, D., Swing, A., Tan, M., Thorson, G., Tian, B., Toma, H., Tuttle, E., Vasudevan, V., Walter, R., Wang, W., Wilcox, E., and Yoon, D. H. In­datacenter performance analysis of a tensor processing unit, 2017.
    [20] Kato, S., and Ishikawa, Y. Gang edf scheduling of parallel task systems. In
    2009 30th IEEE Real­Time Systems Symposium (2009), IEEE, pp. 459–468.
    [21] Köhn, H.­F., and Hubert, L. J. Hierarchical cluster analysis. Wiley StatsRef: Statistics Reference Online (2014), 1–13.
    [22] Kyrola, A., Blelloch, G., and Guestrin, C. Graphchi: Large­scale graph com­ putation on just a PC. In 10th USENIX Symposium on Operating Systems De­ sign and Implementation (OSDI 12) (Hollywood, CA, Oct. 2012), USENIX Association, pp. 31–46.
    [23] Lastovetsky, A. L. Parallel computing on heterogeneous networks, vol. 24.
    John Wiley & Sons, 2008.
    [24] Lengauer, T. Combinatorial Algorithms for Integrated Circuit Layout. John Wiley, Inc., USA, 1990.
    [25] Lim, H., Andersen, D. G., and Kaminsky, M. 3lc: Lightweight and effective traffic compression for distributed machine learning. CoRR abs/1802.07389 (2018).
    [26] Ma, C., Teo, Y. M., March, V., Xiong, N., Pop, I. R., He, Y. X., and See,
    S. An approach for matching communication patterns in parallel applications. In 2009 IEEE International Symposium on Parallel & Distributed Processing (2009), IEEE, pp. 1–12.
    [27] Mahajan, K., Balasubramanian, A., Singhvi, A., Venkataraman, S., Akella, A., Phanishayee, A., and Chawla, S. Themis: Fair and efficient GPU cluster scheduling. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20) (Santa Clara, CA, Feb. 2020), USENIX Associa­ tion, pp. 289–304.
    [28] Meng, J., McCauley, S., Kaplan, F., Leung, V. J., and Coskun, A. K. Simulation and optimization of hpc job allocation for jointly reducing communication and cooling costs. Sustainable Computing: Informatics and Systems 6 (2015), 48– 57.
    [29] Moschakis, I. A., and Karatza, H. D. Performance and cost evaluation of gang scheduling in a cloud computing system with job migrations and starva­ tion handling. In 2011 IEEE Symposium on Computers and Communications (ISCC) (2011), pp. 418–423.
    [30] Peng, Y., Bao, Y., Chen, Y., Wu, C., and Guo, C. Optimus: An efficient dynamic resource scheduler for deep learning clusters. In Proceedings of the Thirteenth EuroSys Conference (New York, NY, USA, 2018), EuroSys ’18, Association for Computing Machinery.
    [31] Rolim, J. Parallel and Distributed Processing: 15 IPDPS 2000 Workshops Cancun, Mexico, May 1–5, 2000 Proceedings. Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2003.
    [32] Saifullah, A., Li, J., Agrawal, K., Lu, C., and Gill, C. Multi­core real­time scheduling for generalized parallel task models. Real­Time Systems 49, 4 (2013), 404–435.
    [33] Sajjapongse, K., Wang, X., and Becchi, M. A preemption­based runtime to efficiently schedule multi­process applications on heterogeneous clusters with gpus. In Proceedings of the 22nd International Symposium on High­ Performance Parallel and Distributed Computing (New York, NY, USA, 2013), HPDC ’13, Association for Computing Machinery, p. 179–190.
    [34] Sapio, A., Canini, M., Ho, C.­Y., Nelson, J., Kalnis, P., Kim, C., Krishna­ murthy, A., Moshref, M., Ports, D. R. K., and Richtarik, P. Scaling distributed machine learning with in­network aggregation, 2019. Training complex ma­ chine learning models in parallel is an increasinglyimportant workload. We accelerate distributed parallel training by designing acommunication primitive that uses a programmable switch dataplane to execute akey step of the training process. Our approach, SwitchML, reduces the volume ofexchanged data by aggregating the model updates from multiple workers in thenetwork. We co­ design the switch processing with the end­host protocols and MLframeworks to provide a robust, efficient solution that speeds up training byup to 300for a number of real­world benchmark models.
    [35] Schlag, S. High­Quality Hypergraph Partitioning. PhD thesis, Karlsruher Institut für Technologie (KIT), 2020. 46.12.02; LK 01.
    [36] Shmoys, D., and Hall, L. Approximation schemes for constrained scheduling problems. In 2013 IEEE 54th Annual Symposium on Foundations of Com­ puter Science (Los Alamitos, CA, USA, nov 1989), IEEE Computer Society, pp. 134–139.
    [37] Xiao, W., Bhardwaj, R., Ramjee, R., Sivathanu, M., Kwatra, N., Han, Z., Patel, P., Peng, X., Zhao, H., Zhang, Q., Yang, F., and Zhou, L. Gandiva: Introspec­ tive cluster scheduling for deep learning. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation (USA, 2018), OSDI’18, USENIX Association, p. 595–610.
    [38] Zhao, H., Han, Z., Yang, Z., Zhang, Q., Yang, F., Zhou, L., Yang, M., Lau,
    F. C., Wang, Y., Xiong, Y., and Wang, B. Hived: Sharing a GPU cluster for deep learning with guarantees. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20) (Nov. 2020), USENIX Asso­ ciation, pp. 515–532.

    QR CODE