簡易檢索 / 詳目顯示

研究生: 杜巧韻
Tu, Chiao-Yun
論文名稱: 通用圖形處理器內晶片網路之動態頻率調整機制
Run-Time Frequency Scaling of On-Chip Networks for GPGPU
指導教授: 金仲達
King, Chung-Ta
口試委員: 黃婷婷
Hwang, Ting-Ting
劉靖家
Liou, Jing-Jia
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2013
畢業學年度: 101
語文別: 英文
論文頁數: 43
中文關鍵詞: 通用圖形處理器晶片網路動態頻率調整
外文關鍵詞: Dynamic Frequency Scaling, Many-to-Few-to-Many
相關次數: 點閱:1下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 現今的通用圖形處理器(GPGPUs) 於高度平行的應用程式上能提供相較於一般用途處理器 (CPU) 十倍甚至百倍的運算能力。由於通用圖形處理器於晶片網路(Network-on-Chip)上所表現的傳輸行為模式與一般用途處理器不同,傳統設計給一般用途處理器的晶片網路架構並不適用給通用圖形處理器。此碩士論文中提出了一個動態頻率調整機制,根據晶片網路處理的負荷量調整網路頻率以符合不同應用程式對於頻寬的需求。
    在此碩士論文裡,首先,我們探討通用圖形處理器在網路的傳輸行為模式並將它們分成三種類型。在不同的類型下,需求網路和回覆網路由於網路負載量的差異會有不同的頻寬需求。其次,我們動態的監控某些運算核心(shader core)並預測網路的負荷量,再依據前面研究階段的類型特徵去調節網路頻率。實驗結果顯示此動態頻率調節機制最高可以提升二十七百分比的性能(平均能夠提升七點四百分比的性能)。


    Modern General Purpose computing on GPUs (GPGPUs) provide orders of magnitude more computing power than general purpose processors (CPU) for highly parallel applications. Since the traffic pattern of GPGPUs behaves considerably different than CPU, the conventional interconnection network designs for CPU are not applicable for GPGPUs. This thesis proposes a run-time dynamic frequency scaling mechanism that can meet the bandwidth demands of different applications by tuning the frequency of network in response to the network load. In this thesis, we first investigate the characteristics of GPGPU traffic pattern and classify the traffic patterns of GPGPUs to three types. Under the different types, the request network and reply network require different bandwidth to handle the network load. Second, we leverage the property to regulate the network
    frequency dynamically by monitoring some shader cores and predict the network load. Evaluation show that this dynamic frequency tuning design can achieve up to 27% improvement compared to baseline setting (on average, it results 7.4 % improvement).

    1 Introduction 1 2 Background 5 2.1 Baseline GPGPU Architecture 5 2.2 Characteristics of GPGPU Applications 6 2.2.1 Many-to-Few-to-Many Traffic Pattern 6 2.2.2 Characteristic of GPGPU Traffic Pattern 7 3 Methodology 13 3.1 DFS Policy Overview 13 3.1.1 Data collection 15 3.1.2 Parameter computation 15 3.1.3 Dynamic Frequency Control Algorithm 16 3.2 Frequency Tuning Rationale 17 3.2.1 Ideal Frequency Tuning Technique 19 3.2.2 Practical Frequency Tuning Technique 20 3.3 Scaling Overhead and Hardware Implementation 20 3.3.1 Scaling Overhead 20 3.3.2 Hardware Implementation 21 4 Experimental Evaluation 23 4.1 Simulation Setup 23 4.2 Evaluation Result 25 4.2.1 Network Limit Exploration 25 4.2.2 Core Monitor Exploration 26 4.2.3 Time Window Size 27 4.2.4 Performance Result 29 4.2.5 Power Consumption 32 4.2.6 Performance Gain by Ideal DFS Mechanism 34 5 Related Work 37 6 Conclusion and Future Work 39 6.1 Conclusion 39 6.2 Future Work . 40

    [1] NVIDIA, “Cuda zone.” [Online]. Available: http://www.nvidia.com/cuda
    [2] ——, “Nvidias next generation cuda compute architecture: Fermi,” 2009.
    [3] A. Bakhoda, J. Kim, and T. Aamodt, “Throughput-effective on-chip networks for manycore accelerators,” in Proceedings of the 2010 43rd Annual IEEE/ACM international symposium on Microarchitecture. IEEE Computer Society, 2010, pp. 421–432.
    [4] W. Dally and B. Towles, Principles and practices of interconnection networks. Morgan
    Kaufmann, 2004.
    [5] A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt, “Analyzing cuda workloads using
    a detailed gpu simulator,” in Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on. IEEE, 2009, pp. 163–174.
    [6] K. J. Nowka, G. D. Carpenter, E. W. MacDonald, H. C. Ngo, B. C. Brock, K. I. Ishii, T. Y. Nguyen, and J. L. Burns, “A 32-bit powerpc system-on-a-chip with support for dynamic voltage scaling and dynamic frequency scaling,” Solid-State Circuits, IEEE Journal of, vol. 37, no. 11, pp. 1441–1447, 2002.
    [7] R. M. Senger, E. D. Marsman, G. A. Carichner, S. Kubba, M. S. McCorquodale, and
    R. B. Brown, “Low-latency, hdl-synthesizable dynamic clock frequency controller with self-referenced hybrid clocking,” in Circuits and Systems, 2006. ISCAS 2006. Proceedings. 2006
    IEEE International Symposium on. IEEE, 2006, pp. 4–pp.
    [8] L. Shang, L.-S. Peh, and N. K. Jha, “Dynamic voltage scaling with links for power opti-
    mization of interconnection networks,” in High-Performance Computer Architecture, 2003.
    HPCA-9 2003. Proceedings. The Ninth International Symposium on. IEEE, 2003, pp. 91–102.
    [9] “Parboil benchmark suite.” http://impact.crhc.illinois.edu/parboil.php.
    [10] Pcchen, “N-queens solver,” http://forums.nvidia.com/index.php?showtopic=76893.
    [11] S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S. Lee, and K. Skadron, “Rodinia: A
    benchmark suite for heterogeneous computing,” in Workload Characterization, 2009. IISWC
    2009. IEEE International Symposium on. IEEE, 2009, pp. 44–54.
    [12] NVIDIA, “Nvidia cuda sdk code samples.” [Online]. Available:
    http://developer.download.nvidia.com/compute/cuda/sdk /website/samples.html.
    [13] K. Choi, R. Soma, and M. Pedram, “Fine-grained dynamic voltage and frequency scaling
    for precise energy and performance tradeoff based on the ratio of off-chip access to on-chip
    computation times,” Computer-Aided Design of Integrated Circuits and Systems, IEEE Trans-
    actions on, vol. 24, no. 1, pp. 18–28, 2005.
    [14] H. Kim, J. Kim,W. Seo, Y. Cho, and S. Ryu, “Providing cost-effective on-chip network band-
    width in gpgpus,” in Computer Design (ICCD), 2012 IEEE 30th International Conference on.
    IEEE, 2012, pp. 407–412.
    [15] G. Semeraro, G. Magklis, R. Balasubramonian, D. H. Albonesi, S. Dwarkadas, and M. L.
    Scott, “Energy-efficient processor design using multiple clock domains with dynamic voltage
    and frequency scaling,” in High-Performance Computer Architecture, 2002. Proceedings.
    Eighth International Symposium on. IEEE, 2002, pp. 29–40.
    [16] W. Kim, M. S. Gupta, G.-Y. Wei, and D. Brooks, “System level analysis of fast, per-core
    dvfs using on-chip switching regulators,” in High Performance Computer Architecture, 2008.
    HPCA 2008. IEEE 14th International Symposium on. IEEE, 2008, pp. 123–134.
    [17] A. K. Mishra, R. Das, S. Eachempati, R. Iyer, N. Vijaykrishnan, and C. R. Das, “A case
    for dynamic frequency tuning in on-chip networks,” in Microarchitecture, 2009. MICRO-42.
    42nd Annual IEEE/ACM International Symposium on. IEEE, 2009, pp. 292–303.
    [18] J. Lee, V. Sathisha, M. Schulte, K. Compton, and N. S. Kim, “Improving throughput of
    power-constrained gpus using dynamic voltage/frequency and core scaling,” in Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on. IEEE, 2011, pp. 111–120.
    [19] A. Samih, R. Wang, A. Krishna, C. Maciocco, T.-Y. C. Tai, and Y. Solihin, “Energy-efficient
    interconnect via router parking.” in HPCA, 2013, pp. 508–519.

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)

    QR CODE