研究生: |
林恩德 Lin, En-Te |
---|---|
論文名稱: |
隱藏共享 GPU 在統一記憶體架構下的資料傳輸時間 Hiding the Data Transfer Time of a Shared GPU with Unified Memory |
指導教授: |
周志遠
Chou, Jerry |
口試委員: |
李哲榮
賴冠州 |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Computer Science |
論文出版年: | 2022 |
畢業學年度: | 110 |
語文別: | 英文 |
論文頁數: | 32 |
中文關鍵詞: | 共享 GPU 、統一記憶體 、演算法 |
外文關鍵詞: | GPU sharing, Unified Memory, Algorithm |
相關次數: | 點閱:2 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
GPU 因為擁有高度平行所驅動的巨大吞吐量,所以被大量的平台普遍採用。
但是,許多應用程式並無法充分利用 GPU 資源。所以讓計算任務能夠共享
GPU 的方法被提出以解決 GPU 利用率問題。此外,如果每個應用程序的
GPU 利用率較低但使用大量內存,物理內存的大小仍然會限制 GPU 的利
用率。因此,統一記憶體架構 (Unified Memory) 被應用於突破有限物理內存
的限制。此外,我們將使用預取 (prefetching) 來減少使用統一記憶體時的負
擔。我們發現在這種情況下,執行順序對總執行時間至關重要。受此觀察的
啟發,我們提出了一種低負擔排程演算法來最小化總執行時間。在我們的解
決方案中,我們定義了一個矩陣 - 權重來表徵應用程序的特性。我們的解決
方案利用共享同一 GPU 的每個應用程序的特性來利用每個應用程序的空閒
時間。在我們的實驗中,我們編寫了一個模擬器來模擬所提出的解決方案,
並將我們的解決方案與最佳解決方案和其他排程演算法進行比較,包括依序
循環 (round-robin) 和最短作業優先 (shortest-job-first)。與依序循環相比,我
們的解決方案可以將總執行時間減少多達 32%。
GPUs have been ubiquitously adopted in recent years because of the enormous
throughput driven by massive parallelism. However, many workloads can not fully
utilize a GPU all the time. Enabling GPU sharing among computing tasks is pro-
posed to address the GPU utilization problem. Moreover, the size of physical mem-
ory stills limits the GPU utilization if each application’s GPU utilization is low but
uses much memory. Thus, Unified Memory is applied to break through the limit of
limited physical memory. Furthermore, we will use prefetching to reduce the over-
head when using Unified Memory. We found out that execution order is crucial to
the total execution time under this scenario. Motivated by this observation, we pro-
posed a low-overhead scheduling algorithm to minimize the total execution time.
In our solution, we define a matrix - weight to characterize the applications. Our
solution exploits the characteristics of each application that shares the same GPU
to utilize the idle time of each application. In our evaluation, we write a simulator
to simulate the proposed solution and compare our solution to the optimal solution
and other scheduling algorithms, including round-robin and shortest-job-first. Our
solution can reduce the total execution time by up to 32% compared to round-robin.
1] Bai, Z., Zhang, Z., Zhu, Y., and Jin, X. {PipeSwitch}: Fast pipelined context
switching for deep learning applications. In 14th USENIX Symposium on Op-
erating Systems Design and Implementation (OSDI 20) (2020), pp. 499–514.
[2] Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J. W., Lee, S.-H., and
Skadron, K. Rodinia: A benchmark suite for heterogeneous computing. In
2009 IEEE international symposium on workload characterization (IISWC)
(2009), Ieee, pp. 44–54.
[3] Chen, H.-H., Lin, E.-T., Chou, Y.-M., and Chou, J. Gemini: Enabling multi-
tenant gpu sharing based on kernel burst estimation. IEEE Transactions on
Cloud Computing (2021).
[4] Chien, S., Peng, I., and Markidis, S. Performance evaluation of advanced
features in cuda unified memory. In 2019 IEEE/ACM Workshop on Memory
Centric High Performance Computing (MCHPC) (2019), pp. 50–57.
[5] Duato, J., Pena, A. J., Silla, F., Mayo, R., and Quintana-Ortí, E. S. rcuda:
Reducing the number of gpu-based accelerators in high performance clusters.
In 2010 International Conference on High Performance Computing & Simu-
lation (2010), IEEE, pp. 224–231.
[6] Landaverde, R., Zhang, T., Coskun, A. K., and Herbordt, M. An investigation
of unified memory access performance in cuda. In 2014 IEEE High Perfor-
mance Extreme Computing Conference (HPEC) (2014), IEEE, pp. 1–6.
[7] Li, W., Jin, G., Cui, X., and See, S. An evaluation of unified memory tech-
nology on nvidia gpus. In 2015 15th IEEE/ACM International Symposium on
Cluster, Cloud and Grid Computing (2015), pp. 1092–1098.
[8] Lim, G., Ahn, J., Xiao, W., Kwon, Y., and Jeon, M. Zico: Efficient {GPU}
memory sharing for concurrent {DNN} training. In 2021 USENIX Annual
Technical Conference (USENIX ATC 21) (2021), pp. 161–175.
[9] Linux. Linux manual page. https://man7.org/linux/man-pages/man8/ld.
so.8.html.
[10] LocalSolver. LocalSolver. https://www.localsolver.com.
[11] NVIDIA. NVIDIA MPS. https://docs.nvidia.com/deploy/mps/index.
html.
[12] Song, S., Deng, L., Gong, J., and Luo, H. Gaia scheduler: A kubernetes-
based scheduler framework. In 2018 IEEE Intl Conf on Parallel & Distributed
Processing with Applications, Ubiquitous Computing & Communications, Big
Data & Cloud Computing, Social Computing & Networking, Sustainable Com-
puting & Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom)
(2018), IEEE, pp. 252–259.
[13] Yeh, T.-A., Chen, H.-H., and Chou, J. Kubeshare: A framework to manage
gpus as first-class and shared resources in container cloud. In Proceedings
of the 29th International Symposium on High-Performance Parallel and Dis-
tributed Computing (2020), pp. 173–184.
[14] Yu, P., and Chowdhury, M. Salus: Fine-grained gpu sharing primitives for
deep learning applications. arXiv preprint arXiv:1902.04610 (2019).
[15] Yu, Q., Childers, B., Huang, L., Qian, C., and Wang, Z. A quantitative evalua-
tion of unified memory in gpus. The Journal of Supercomputing 76, 4 (2020),
2958–2985.