簡易檢索 / 詳目顯示

研究生: 林恩德
Lin, En-Te
論文名稱: 隱藏共享 GPU 在統一記憶體架構下的資料傳輸時間
Hiding the Data Transfer Time of a Shared GPU with Unified Memory
指導教授: 周志遠
Chou, Jerry
口試委員: 李哲榮
賴冠州
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2022
畢業學年度: 110
語文別: 英文
論文頁數: 32
中文關鍵詞: 共享 GPU統一記憶體演算法
外文關鍵詞: GPU sharing, Unified Memory, Algorithm
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • GPU 因為擁有高度平行所驅動的巨大吞吐量,所以被大量的平台普遍採用。
    但是,許多應用程式並無法充分利用 GPU 資源。所以讓計算任務能夠共享
    GPU 的方法被提出以解決 GPU 利用率問題。此外,如果每個應用程序的
    GPU 利用率較低但使用大量內存,物理內存的大小仍然會限制 GPU 的利
    用率。因此,統一記憶體架構 (Unified Memory) 被應用於突破有限物理內存
    的限制。此外,我們將使用預取 (prefetching) 來減少使用統一記憶體時的負
    擔。我們發現在這種情況下,執行順序對總執行時間至關重要。受此觀察的
    啟發,我們提出了一種低負擔排程演算法來最小化總執行時間。在我們的解
    決方案中,我們定義了一個矩陣 - 權重來表徵應用程序的特性。我們的解決
    方案利用共享同一 GPU 的每個應用程序的特性來利用每個應用程序的空閒
    時間。在我們的實驗中,我們編寫了一個模擬器來模擬所提出的解決方案,
    並將我們的解決方案與最佳解決方案和其他排程演算法進行比較,包括依序
    循環 (round-robin) 和最短作業優先 (shortest-job-first)。與依序循環相比,我
    們的解決方案可以將總執行時間減少多達 32%。


    GPUs have been ubiquitously adopted in recent years because of the enormous
    throughput driven by massive parallelism. However, many workloads can not fully
    utilize a GPU all the time. Enabling GPU sharing among computing tasks is pro-
    posed to address the GPU utilization problem. Moreover, the size of physical mem-
    ory stills limits the GPU utilization if each application’s GPU utilization is low but
    uses much memory. Thus, Unified Memory is applied to break through the limit of
    limited physical memory. Furthermore, we will use prefetching to reduce the over-
    head when using Unified Memory. We found out that execution order is crucial to
    the total execution time under this scenario. Motivated by this observation, we pro-
    posed a low-overhead scheduling algorithm to minimize the total execution time.
    In our solution, we define a matrix - weight to characterize the applications. Our
    solution exploits the characteristics of each application that shares the same GPU
    to utilize the idle time of each application. In our evaluation, we write a simulator
    to simulate the proposed solution and compare our solution to the optimal solution
    and other scheduling algorithms, including round-robin and shortest-job-first. Our
    solution can reduce the total execution time by up to 32% compared to round-robin.

    1 Introduction 1 2 Background 4 2.1 Gemini ................................ 4 2.1.1 FrontendAPIInterception .................. 4 2.1.2 backendscheduler ...................... 5 2.2 UnifiedMemory ........................... 6 3 Motivation 8 4 Solution 13 4.1 Systemarchitecture.......................... 13 4.2 Optimalsolution ........................... 14 4.3 Matricesoftheschedulingdecision ................. 17 4.3.1 Computationratio(C) .................... 17 4.3.2 datatransfer(D) ....................... 17 4.4 Intuition................................ 18 4.4.1 Hidethedatatransferlatency ................ 18 4.4.2 Utilizetheidletimeeffectively. . . . . . . . . . . . . . . . 19 4.5 Algorithmexplanation ........................ 20 5 Setups 6 Experiment results 6.1 Scheduling algorithm comparison under different workloads 6.2 Schedulingoverhead ......................... 26 6.3 Scheduling effects under different numbers of clients . . . . . . . . 27 7 Conclusion 30 References. 31

    1] Bai, Z., Zhang, Z., Zhu, Y., and Jin, X. {PipeSwitch}: Fast pipelined context
    switching for deep learning applications. In 14th USENIX Symposium on Op-
    erating Systems Design and Implementation (OSDI 20) (2020), pp. 499–514.
    [2] Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J. W., Lee, S.-H., and
    Skadron, K. Rodinia: A benchmark suite for heterogeneous computing. In
    2009 IEEE international symposium on workload characterization (IISWC)
    (2009), Ieee, pp. 44–54.
    [3] Chen, H.-H., Lin, E.-T., Chou, Y.-M., and Chou, J. Gemini: Enabling multi-
    tenant gpu sharing based on kernel burst estimation. IEEE Transactions on
    Cloud Computing (2021).
    [4] Chien, S., Peng, I., and Markidis, S. Performance evaluation of advanced
    features in cuda unified memory. In 2019 IEEE/ACM Workshop on Memory
    Centric High Performance Computing (MCHPC) (2019), pp. 50–57.
    [5] Duato, J., Pena, A. J., Silla, F., Mayo, R., and Quintana-Ortí, E. S. rcuda:
    Reducing the number of gpu-based accelerators in high performance clusters.
    In 2010 International Conference on High Performance Computing & Simu-
    lation (2010), IEEE, pp. 224–231.
    [6] Landaverde, R., Zhang, T., Coskun, A. K., and Herbordt, M. An investigation
    of unified memory access performance in cuda. In 2014 IEEE High Perfor-
    mance Extreme Computing Conference (HPEC) (2014), IEEE, pp. 1–6.
    [7] Li, W., Jin, G., Cui, X., and See, S. An evaluation of unified memory tech-
    nology on nvidia gpus. In 2015 15th IEEE/ACM International Symposium on
    Cluster, Cloud and Grid Computing (2015), pp. 1092–1098.
    [8] Lim, G., Ahn, J., Xiao, W., Kwon, Y., and Jeon, M. Zico: Efficient {GPU}
    memory sharing for concurrent {DNN} training. In 2021 USENIX Annual
    Technical Conference (USENIX ATC 21) (2021), pp. 161–175.
    [9] Linux. Linux manual page. https://man7.org/linux/man-pages/man8/ld.
    so.8.html.
    [10] LocalSolver. LocalSolver. https://www.localsolver.com.
    [11] NVIDIA. NVIDIA MPS. https://docs.nvidia.com/deploy/mps/index.
    html.
    [12] Song, S., Deng, L., Gong, J., and Luo, H. Gaia scheduler: A kubernetes-
    based scheduler framework. In 2018 IEEE Intl Conf on Parallel & Distributed
    Processing with Applications, Ubiquitous Computing & Communications, Big
    Data & Cloud Computing, Social Computing & Networking, Sustainable Com-
    puting & Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom)
    (2018), IEEE, pp. 252–259.
    [13] Yeh, T.-A., Chen, H.-H., and Chou, J. Kubeshare: A framework to manage
    gpus as first-class and shared resources in container cloud. In Proceedings
    of the 29th International Symposium on High-Performance Parallel and Dis-
    tributed Computing (2020), pp. 173–184.
    [14] Yu, P., and Chowdhury, M. Salus: Fine-grained gpu sharing primitives for
    deep learning applications. arXiv preprint arXiv:1902.04610 (2019).
    [15] Yu, Q., Childers, B., Huang, L., Qian, C., and Wang, Z. A quantitative evalua-
    tion of unified memory in gpus. The Journal of Supercomputing 76, 4 (2020),
    2958–2985.

    QR CODE