簡易檢索 / 詳目顯示

研究生: 鄭柏文
Cheng, Bo-Wun
論文名稱: 繪圖處理器中重複快取需求之高效共同處理機制
COLAB: Collaborative and Efficient Processing of Replicated Cache Requests in GPU
指導教授: 李濬屹
Lee, Chun-Yi
口試委員: 陳聿廣
Chen, Yu-Guang
葉宗泰
Yeh, Tsung-Tai
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2023
畢業學年度: 111
語文別: 英文
論文頁數: 25
中文關鍵詞: 圖形處理器快取
外文關鍵詞: Graphics Processing Unit, Cache
相關次數: 點閱:36下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本論文旨在探討如何藉由捕捉近代圖形處理器 (Graphics Processing Unit, GPU) 中同一 Stream Multipleprocessor (SM) 叢集 (Cluster) 中所產生之重複快取需 (Replicated cache request) ,來達到減輕晶片上網路 (Network-on-Chip, NoC) 阻塞之目的。為此,本研究提出在每個 SM 叢集中整合入一個快取行擁有者資訊查找表 (Cache line Ownership Lookup tABle, COLAB),並且用該查找表來紀錄叢集內特定快取行 (Cache line) 被存取在哪一SM的一級資料快取 (Level-one cache) 之中。透過存取此查找表所存放之擁有者資訊,叢集中的SM可以將重複快取需求重新導向至擁有相對應快取行的SM進行處理,並以此阻止重複快取需求佔用珍貴的晶片上網路頻寬。本研究的實驗結果展示,此機制可在有限硬體成本下有效地減輕重複快取需求所造成的晶片上網路之壅塞,進而增加整體圖形處理器之運算效能。成果顯示,本研究所提出之機制可以減少百分之三十八的晶片上網路讀取運送量,並且增加百分之四十三之每周期指令 (Instruction per cycle)。本學位論文之研究成果已被整裡成為一學術論文,並且已經被2023年亞洲南太平洋設計自動化研討會(the 28th Asia and South Pacific Design Automation Conference, ASP-DAC 2023)接受準備發表[1]。


    In this work, we aim to capture replicated cache requests between Stream Multiprocessors (SMs) within an SM cluster to alleviate the Network-on-Chip (NoC) congestion problem of modern GPUs. To achieve this objective, we incorporate a per-cluster Cache line Ownership Lookup tABle (COLAB) that keeps track of which SM within a cluster holds a copy of a specific cache line. With the assis- tance of COLAB, SMs can collaboratively and efficiently process replicated cache requests within SM clusters by redirecting them according to the ownership infor- mation stored in COLAB. By servicing replicated cache requests within SM clus- ters that would otherwise consume precious NoC bandwidth, the heavy pressure on the NoC interconnection can be eased. Our experimental results demonstrate that the adoption of COLAB can indeed alleviate the excessive NoC pressure caused by replicated cache requests, and improve the overall system throughput of the baseline GPU while incurring minimal overhead. On average, COLAB can reduce 38% of the NoC traffic and improve instructions per cycle (IPC) by 43%. The results of this thesis have been compiled into a research paper, which has been accepted for publication and presentation at the 28th Asia and South Pacific Design Automation Conference (ASP-DAC 2023)[1].

    Abstract (Chinese) I Acknowledgements (Chinese) II Abstract III Acknowledgements IV Contents V List of Figures VII List of Tables IX 1 Introduction 1 2 Methodology 5 2.1 Overview 5 2.2 Workflow of COLAB 6 2.2.1 Workflow 6 2.2.2 False-Positive and False-Negative Errors 8 2.2.3 Arbitration Policy between COLAB and Local L1 Requests 8 2.3 Organization of COLAB 8 2.3.1 Detailed Architecture of COLAB 9 2.3.2 Estimated Hardware Overhead 9 3 Experimental Results 11 3.1 Experimental Setup 11 3.2 NoC Traffic and Execution Throughput 12 3.3 Energy Evaluation 14 3.4 Ablation Studies 15 3.4.1 Analysis on the Arbitration Policy 15 3.4.2 Analysis on the SM Cluster Size 16 4 Related Works 18 5 Conclusion 20 Bibliography 21

    [1] B.-W. Cheng, E.-M. Haung, C.-H. Chao, W.-F. Sun, T.-T. Yeh, and C.-Y.Lee, “Colab: Collaborative and efficient processing ofreplicated cache requestsin gpu,” inProceedings of 28th Asia and South Pacific Design AutomationConference, ASP-DAC ’23, forthcoming.
    [2] NVIDIA, “Nvidia ampere ga102 gpu architecture,” Sept. 2020.
    [3] J. Wang, L. Jiang, J. Ke, X. Liang, and N. Jing, “A sharing-aware l1.5d cache for data reuse in gpgpus,” in Proceedings of the 24th Asia and South Pacific Design Automation Conference, ASPDAC ’19, (New York, NY, USA), p. 388–393, Association for Computing Machinery, 2019.
    [4] M. A. Ibrahim, O. Kayiran, Y. Eckert, G. H. Loh, and A. Jog, “Analyzing and leveraging decoupled l1 caches in gpus,” in 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pp. 467– 478, Feb 2021.
    [5] M. A. Ibrahim, O. Kayiran, Y. Eckert, G. H. Loh, and A. Jog, “Analyzing and leveraging shared l1 caches in gpus,” in Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques, PACT ’20, (New York, NY, USA), p. 161–173, Association for Computing Machinery, 2020.
    [6] K. Choo, W. Panlener, and B. Jang, “Understanding and optimizing gpu cache memory performance for compute workloads,” in 2014 IEEE 13th In- ternational Symposium on Parallel and Distributed Computing, pp. 189–196, 2014.
    [7] S. Dublish, V. Nagarajan, and N. Topham, “Cooperative caching for gpus,” ACM Trans. Archit. Code Optim., vol. 13, dec 2016.
    [8] M. A. Ibrahim, H. Liu, O. Kayiran, and A. Jog, “Analyzing and leveraging remote-core bandwidth for enhanced performance in gpus,” in 2019 28th In- ternational Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 258–271, 2019.
    [9] B.-W. Cheng, E.-M. Haung, C.-H. Chao, W.-F. Sun, T.-T. Yeh, and C.-Y. Lee, “Remote access tag array for efficient gpu intra-cluster data sharing,” in Proceedings of the 24th Workshop on Synthesis And System Integration of Mixed Information technologies, SASIMI ’22, p. 221–222, 2022.
    [10] D. Tarjan and K. Skadron, “The sharing tracker: Using ideas from cache co- herence hardware to reduce off-chip memory traffic with non-coherent caches,” in Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’10, (USA), p. 1–10, IEEE Computer Society, 2010.
    [11] M. Khairy, Z. Shen, T. M. Aamodt, and T. G. Rogers, “Accel-sim: An exten- sible simulation framework for validated gpu modeling,” in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pp. 473–486, 2020.
    [12] N. Muralimanohar, R. Balasubramonian, and N. Jouppi, “Optimizing nuca organizations and wiring alternatives for large caches with cacti 6.0,” in 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007), pp. 3–14, 2007.
    [13] J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi, “Gpuwattch: Enabling energy optimizations in gpgpus,” in Proceedings of the 40th Annual International Symposium on Computer Ar- chitecture, ISCA ’13, (New York, NY, USA), p. 487–498, Association for Computing Machinery, 2013.
    [14] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, “Rodinia: A benchmark suite for heterogeneous computing,” in 2009 IEEE International Symposium on Workload Characterization (IISWC), pp. 44–54, 2009.
    [15] A. Karki, C. P. Keshava, S. M. Shivakumar, J. Skow, G. M. Hegde, and H. Jeon, “Detailed characterization of deep neural networks on gpus and fpgas,” in Proceedings of the 12th Workshop on General Purpose Processing Using GPUs, GPGPU ’19, (New York, NY, USA), p. 12–21, Association for Computing Machinery, 2019.
    [16] S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos, “Auto-tuning a high-level language targeted to gpu codes,” in 2012 Inno- vative Parallel Computing (InPar), pp. 1–10, 2012.
    [17] X. Chen, L.-W. Chang, C. I. Rodrigues, J. Lv, Z. Wang, and W.-M. Hwu, “Adaptive cache management for energy-efficient gpu computing,” in Proceed- ings of the 47th Annual IEEE/ACM International Symposium on Microarchi- tecture, MICRO-47, (USA), p. 343–355, IEEE Computer Society, 2014.
    [18] G. Koo, Y. Oh, W. W. Ro, and M. Annavaram, “Access pattern-aware cache management for improving data utilization in gpu,” in Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA ’17, (New York, NY, USA), p. 307–319, Association for Computing Machinery, 2017.
    [19] Y. Oh, G. Koo, M. Annavaram, and W. W. Ro, “Linebacker: Preserving victim cache lines in idle register files of gpus,” in Proceedings of the 46th International Symposium on Computer Architecture, ISCA ’19, (New York, NY, USA), p. 183–196, Association for Computing Machinery, 2019.
    [20] C. Li, S. L. Song, H. Dai, A. Sidelnik, S. K. S. Hari, and H. Zhou, “Locality- driven dynamic gpu cache bypassing,” in Proceedings of the 29th ACM on International Conference on Supercomputing, ICS ’15, (New York, NY, USA), p. 67–77, Association for Computing Machinery, 2015.
    [21] T. G. Rogers, M. O’Connor, and T. M. Aamodt, “Cache-conscious wavefront scheduling,” in 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 72–83, 2012.
    [22] A. Jog, O. Kayiran, N. Chidambaram Nachiappan, A. K. Mishra, M. T. Kan- demir, O. Mutlu, R. Iyer, and C. R. Das, “Owl: Cooperative thread array aware scheduling techniques for improving gpgpu performance,” in Proceed- ings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’13, (New York, NY, USA), p. 395–406, Association for Computing Machinery, 2013.
    [23] T. G. Rogers, M. O’Connor, and T. M. Aamodt, “Divergence-aware warp scheduling,” in 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 99–110, 2013.
    [24] O. Kayıran, A. Jog, M. T. Kandemir, and C. R. Das, “Neither more nor less: Optimizing thread-level parallelism for gpgpus,” in Proceedings of the 22nd In- ternational Conference on Parallel Architectures and Compilation Techniques, pp. 157–166, 2013.

    QR CODE