簡易檢索 / 詳目顯示

研究生: 王宇正
Wang, Yu-Cheng
論文名稱: 利用深度學習方法解決具動態流量的分散式執行工作在平行系統的資源配置問題
A Deep Reinforcement Learning Method for Solving Task Mapping Problems with Dynamic Traffic on Parallel Systems
指導教授: 周志遠
Chou, Jerry
口試委員: 李哲榮
Lee, Che-Rung
洪士灝
Hung, Shih-Hao
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2020
畢業學年度: 108
語文別: 英文
論文頁數: 45
中文關鍵詞: 任務對應平行應用程式深度強化學習演算法
外文關鍵詞: task mapping, parallel applications, deep learning, algorithm
相關次數: 點閱:1下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在平行運算系統中,如何有效地將平行應用程式之溝通模式對應至所運行之網路拓樸是優化平行計算系統上的一門重要的課題,特別是當運行規模擴大或是溝通成本增加時,好的擺放方式將大幅地提升運用程式之運行效能。
    在過去,許多學者已經對此問題進行了廣泛的研究並提出不同的解決方案。過去的做法中大多將此問題轉化平行應用程式之溝通模式和網路拓樸成兩個靜態圖,並提出不同的數學方法或演算法來找出兩個靜態圖之對應關係。然而實際應用中,我們會發現因為平行應用程式大多擁有動態的溝通模式,所以靜態圖並不容易獲得,並且靜態圖無法精準的描述平行應用程式的實際溝通模式。因此我們提出一個深度強化學習 (deep reinforcement learning) 解決方案,透過動態的溝通資訊和較佳的評估模型,進而找到好的平行應用程式與運行之網路拓樸的對應關係。
    我們深入的比較我們提出的方案與其他現有的方法和函式庫。我們所提出的解決方案能夠實現較佳或可以比擬的平行應用程式效能。特別是在實際平行應用程式上,我們的解決方案在環面 (torus) 拓樸和 dragonfly 拓樸中分別取得十一和十六百分比的平均效能提升。相比之下,其他方法的平均效能提升均不到百分之六。


    Efficient mapping of application communication patterns to the network topology is a critical problem for optimizing the performance of communication bound applications on parallel computing systems.
    The problem has been extensively studied in the past, but they mostly formulate the problem as finding an isomorphic mapping between two static graphs with edges annotated by traffic volume and network bandwidth. But in practice, the network performance is difficult to be accurately estimated, and communication patterns are often changing over time and not easily obtained. Therefore, this work proposes a deep reinforcement learning (DRL) approach to explore better task mappings by utilizing the performance prediction and runtime communication behaviors provided from a simulator to learn an efficient task mapping algorithm. We extensively evaluated our approach using both synthetic and real applications with varied communication patterns on Torus and Dragonfly networks. Compared with several existing approaches from literature and software library, our proposed approach found task mappings that consistently achieved comparable or better application performance. Especially for a real application, the average improvement of our approach on Torus and Dragonfly networks are 11% and 16%, respectively. In comparison, the average improvements of other approaches are all less than 6%.

    Chapter1: Introduction ---------------- 1 Chapter2: Background ------------------ 5 Chapter3: DRL Task Mapping Approach --- 10 Chapter4: Setups ---------------------- 19 Chapter5: Experimental Results -------- 25 Chapter6: Related Work ---------------- 37 Chapter7: Conclusions ----------------- 39 Reference ----------------------------- 41

    [1]Addanki, R., Bojja Venkatakrishnan, S., Gupta, S., Mao, H., and Alizadeh, M.Placeto: LearningGeneralizableDevicePlacementAlgorithmsforDistributedMachine Learning.arXiv e-prints(June 2019).
    [2]Akbudak, K., Kayaaslan, E., and Aykanat, C. Hypergraph partitioning basedmodels and methods for exploiting cache locality in sparse matrix-vector mul-tiplication.SIAM Journal on Scientific Computing 35, 3 (2013), C237–C262.
    [3]Bello, I., Pham, H., Le, Q. V., Norouzi, M., and Bengio, S. Neural combina-torial optimization with reinforcement learning.
    [4]Bhatele, A., Jain, N., Isaacs, K. E., Buch, R., Gamblin, T., Langer, S. H., andKale, L. V. Optimizing the performance of parallel applications on a 5d torusviataskmapping. In201421stInternationalConferenceonHighPerformanceComputing (HiPC)(2014), pp. 1–10.
    [5]Bhatele, A., and Kale, L. V. Heuristic-based techniques for mapping irreg-ular communication graphs to mesh topologies. In2011 IEEE InternationalConference on High Performance Computing and Communications(2011),pp. 765–771.
    [6]Bhatelé, A., Kalé, L. V., and Kumar, S. Dynamic topology aware load bal-ancing algorithms for molecular dynamics applications. InProceedings ofthe 23rd International Conference on Supercomputing(New York, NY, USA,2009), ICS’09, Association for Computing Machinery, p. 110–116.
    [7]Bhatelé, A., Gupta, G. R., Kalé, L. V., and Chung, I. Automated mapping ofregular communication graphs on mesh interconnects. In2010 InternationalConference on High Performance Computing(2010), pp. 1–10.
    [8]Bokhari. On the mapping problem.IEEE Transactions on Computers C-30, 3(1981), 207–214.
    [9]Bollinger, S. W., and Midkiff, S. F. Heuristic technique for processor and linkassignmentinmulticomputers.IEEETransactionsonComputers40,3(1991),325–333.
    [10]Buyya, R., and Murshed, M. Gridsim: A toolkit for the modeling and simu-lation of distributed resource management and scheduling for grid computing.Concurrency and Computation: Practice and Experience 14(11 2002).41
    [11]Calheiros, R. N., Ranjan, R., Beloglazov, A., De Rose, C. A. F., and Buyya,R. Cloudsim: A toolkit for modeling and simulation of cloud computing en-vironments and evaluation of resource provisioning algorithms.Softw. Pract.Exper. 41, 1 (Jan. 2011), 23–50.
    [12]Casanova, H., Giersch, A., Legrand, A., Quinson, M., and Suter, F. Ver-satile, scalable, and accurate simulation of distributed applications and plat-forms.Journal of Parallel and Distributed Computing 74, 10 (June 2014),2899–2917.
    [13]Davis, T. A., and Hu, Y. The university of florida sparse matrix collection.ACM Trans. Math. Softw. 38, 1 (Dec. 2011).
    [14]Degomme, A., Legrand, A., Markomanolis, G. S., Quinson, M., Stillwell, M.,and Suter, F. Simulating mpi applications: The smpi approach.IEEE Trans-actions on Parallel and Distributed Systems 28, 8 (2017), 2387–2400.
    [15]Deveci, M., Kaya, K., Uçar, B., and Çatalyürek, ￿. V. Fast and high qual-ity topology-aware task mapping. In2015 IEEE International Parallel andDistributed Processing Symposium(May 2015), pp. 197–206.
    [16]Deveci,M.,Rajamanickam,S.,Leung,V.J.,Pedretti,K.,Olivier,S.L.,Bunde,D. P., Çatalyürek, U. V., and Devine, K. Exploiting geometric partitioning intaskmappingforparallelcomputers. In2014IEEE28thInternationalParalleland Distributed Processing Symposium(2014), pp. 27–36.
    [17]Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert, M., Radford, A.,Schulman, J., Sidor, S., Wu, Y., and Zhokhov, P. Openai baselines.https://github.com/openai/baselines, 2017.
    [18]Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T.,Doron, Y., Firoiu, V., Harley, T., Dunning, I., Legg, S., and Kavukcuoglu,K. IMPALA:ScalableDistributedDeep-RLwithImportanceWeightedActor-Learner Architectures.arXiv e-prints(Feb. 2018), arXiv:1802.01561.
    [19]Faanes, G., Bataineh, A., Roweth, D., Court, T., Froese, E., Alverson, B.,Johnson, T., Kopnick, J., Higgins, M., and Reinhard, J. Cray cascade: A scal-able hpc system based on a dragonfly network. InProceedings of the Inter-national Conference on High Performance Computing, Networking, StorageandAnalysis(Washington,DC,USA,2012),SC’12,IEEEComputerSocietyPress.
    [20]Gao, Y., Chen, L., and Li, B. Spotlight: Optimizing device placement fortraining deep neural networks. InProceedings of the 35th International Con-ference on Machine Learning(Stockholmsmässan,StockholmSweden,10–15Jul2018),J.DyandA.Krause,Eds.,vol.80ofProceedingsofMachineLearn-ing Research, PMLR, pp. 1676–1684.
    [21]Gertphol, S., Alhusaini, A., and Prasanna, V. An integer programming ap-proach for static mapping onto heterogeneous real-time systems. p. 95.42
    [22]Glantz, R., Meyerhenke, H., and Noe, A. Algorithms for mapping parallelprocesses onto grid and torus architectures.
    [23]Hochreiter,S.,andSchmidhuber,J. Longshort-termmemory.NeuralComput.9, 8 (Nov. 1997), 1735–1780.
    [24]Hoefler, T., and Snir, M. Generic topology mapping strategies for large-scaleparallel architectures. InProceedings of the International Conference on Su-percomputing(New York, NY, USA, 2011), ICS’11, Association for Com-puting Machinery, p. 75–84.
    [25]Huang, K., Zhang, X., Zheng, D., Yu, M., Jiang, X., Yan, X., de Brisolara,L. B., and Jerraya, A. A. A scalable and adaptable ilp-based approach for taskmappingonmpsocconsideringloadbalanceandcommunicationoptimization.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Sys-tems 38, 9 (2019), 1744–1757.
    [26]Jasak, H. Openfoam: Open source cfd in research and industry.InternationalJournal of Naval Architecture and Ocean Engineering 1, 2 (2009), 89 – 94.
    [27]Karypis, G., and Kumar, V. Parallel multilevel k-way partitioning schemefor irregular graphs. InProceedings of the 1996 ACM/IEEE Conference onSupercomputing(USA,1996),Supercomputing’96,IEEEComputerSociety,p. 35–es.
    [28]Lantz, B., Heller, B., and McKeown, N. A network in a laptop: Rapid proto-typing for software-defined networks. p. 19.
    [29]Mao, H., Alizadeh, M., Menache, I., and Kandula, S. Resource managementwith deep reinforcement learning. InProceedings of the 15th ACM Workshopon Hot Topics in Networks(New York, NY, USA, 2016), HotNets’16, As-sociation for Computing Machinery, p. 50–56.
    [30]Mirhoseini, A., Goldie, A., Pham, H., Steiner, B., Le, Q. V., and Dean, J.A hierarchical model for device placement. InInternational Conference onLearning Representations(2018).
    [31]Mirhoseini, A., Goldie, A., Yazgan, M., Jiang, J., Songhori, E., Wang, S., Lee,Y.-J., Johnson, E., Pathak, O., Bae, S., Nazi, A., Pak, J., Tong, A., Srinivasa,K., Hang, W., Tuncer, E., Babu, A., Le, Q. V., Laudon, J., Ho, R., Carpenter,R., and Dean, J. Chip Placement with Deep Reinforcement Learning.arXive-prints(Apr. 2020).
    [32]Mirhoseini, A., Pham, H., Le, Q., Norouzi, M., Bengio, S., Steiner, B., Zhou,Y., Kumar, N., Larsen, R., and Dean, J. Device placement optimization withreinforcement learning.
    [33]Mnih, V., Puigdomènech Badia, A., Mirza, M., Graves, A., Lillicrap, T. P.,Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous Methods for DeepReinforcement Learning.arXiv e-prints(Feb. 2016).43
    [34]Nazari, M., Oroojlooy, A., Snyder, L., and Takac, M. Reinforcement learningfor solving the vehicle routing problem. InAdvances in Neural InformationProcessing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman,N.Cesa-Bianchi,andR.Garnett,Eds.CurranAssociates,Inc.,2018,pp.9839–9849.
    [35]Pellegrini, F. Static mapping by dual recursive bipartitioning of process archi-tecturegraphs.InProceedingsofIEEEScalableHighPerformanceComputingConference(1994), pp. 486–493.
    [36]Pellegrini, F., and Roman, J. Scotch: A software package for static mappingby dual recursive bipartitioning of process and architecture graphs. InHigh-Performance Computing and Networking(Berlin, Heidelberg, 1996), H. Lid-dell, A. Colbrook, B. Hertzberger, and P. Sloot, Eds., Springer Berlin Heidel-berg, pp. 493–498.
    [37]Plimpton, S. Fast parallel algorithms for short-range molecular dynamics.Journal of Computational Physics 117, 1 (1995), 1 – 19.[38]Rodrigues, A. F., Hemmert, K. S., Barrett, B. W., Kersey, C., Oldfield, R.,Weston, M., Risen, R., Cook, J., Rosenfeld, P., Cooper-Balis, E., and Jacob,B. The structural simulation toolkit.SIGMETRICS Perform. Eval. Rev. 38, 4(Mar. 2011), 37–42.
    [39]Sanders, P., and Schulz, C. Think Locally, Act Globally: Highly BalancedGraph Partitioning. InProceedings of the 12th International Symposiumon Experimental Algorithms (SEA’13)(2013), vol. 7933 ofLNCS, Springer,pp. 164–175.
    [40]Schloegel, K., Karypis, G., and Kumar, V. Parallel static and dynamic multi-constraint graph partitioning.Concurrency and Computation: Practice andExperience 14(03 2002), 219–240.
    [41]Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. ProximalPolicy Optimization Algorithms.arXiv e-prints(July 2017).
    [42]Tetzlaff, D., and Glesner, S. Intelligent task mapping using machine learning.In2010International Conferenceon Computational Intelligence and SoftwareEngineering(2010), pp. 1–4
    .[43]Tetzlaff, D., and Glesner, S. Making mpi intelligent. vol. P-199 ofLectureNotes in Informatics,KöllenDruck+VerlagGmbH,Bonn, pp.75–88. Beitragzur 5. Arbeitstagung Programmiersprachen (ATPS’12).
    [44]Vastenhouw,B.,andBisseling,R. Atwo-dimensionaldatadistributionmethodfor parallel sparse matrix-vector multiplication.SIAM Review 47(06 2002).
    [45]Waschneck, B., Reichstaller, A., Belzner, L., Altenmüller, T., Bauernhansl, T.,Knapp, A., and Kyek, A. Optimization of global production scheduling withdeep reinforcement learning.Procedia CIRP 72(01 2018), 1264–1269.
    [46]Watkins, C. J. C. H.Learning from Delayed Rewards. PhD thesis, King’sCollege, Cambridge, UK, May 1989.44
    [47]Williams, R. J. Simple statistical gradient-following algorithms for connec-tionist reinforcement learning.Mach. Learn. 8, 3–4 (May 1992), 229–256.
    [48]Zhou, Y., Roy, S., Abdolrashidi, A., Wong, D., Ma, P. C., Xu, Q., Zhong, M.,Liu, H., Goldie, A., Mirhoseini, A., and Laudon, J. GDP: Generalized DevicePlacement for Dataflow Graphs.arXiv e-prints(Sept. 2019).

    QR CODE