利用具因果關係之執行軌跡進行一般用途圖形處理器之晶片網路模擬的有效性

簡易檢索 / 詳目顯示

回結果列表

研究生：	廖柏皓 Liao, Bo Hao
論文名稱：	利用具因果關係之執行軌跡進行一般用途圖形處理器之晶片網路模擬的有效性 On the Effectiveness of Causality-aware Trace-driven NoC Simulation for GPGPUs
指導教授：	金仲達 King, Chung Ta
口試委員:	劉廣治劉靖家呂仁碩
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊工程學系 Computer Science
論文出版年：	2015
畢業學年度：	103
語文別：	英文
論文頁數：	34
中文關鍵詞：	晶片系統網路、軌跡模擬、異質系統、通用性圖形處理器
外文關鍵詞：	NOC, Trace-driven, HSA, GPGPU
相關次數：	點閱：4 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

異質多核心電腦架構漸漸地成為電腦系統的主流。在異質多核心電腦架構的領域裡，為了要達到高密度運算量，通用性運算圖形處理單元(GPGPUs)是不可或缺的一部分。隨著通用性運算圖形處理單元的複雜度上升，設計一個有效率的通用性運算圖形處理單元需要有好的工具來完成。在這之中，執行驅動(execution-driven)模擬器常被用來衡量一個通用性運算圖形處理單元架構設計的好壞。而執行驅動模擬器可以提供高精準度與全面性的效能資料，不過這種模擬器時常需要大量的模擬時間來完成。而這個缺點導致設計開發的速度變慢。另一方面，軌跡驅動(trace-driven)模擬器只需要模擬特定的目標系統元件，
例如晶片系統網路(NoC)或是快取架構，然後依靠執行軌跡(trace)來模仿其他像是運算核心之類的系統元件來進行模擬。像這樣的模擬方式讓軌跡驅動模擬器變得比較快也比較適合拿來做設計開發的工具。但是，執行軌跡是產生執行軌跡的那台機器的運算結果而不是模擬目標的機器。因此，軌跡驅動模擬器常會產生效能資料上大幅度的錯誤。不過最近的趨勢是利用軌跡事件之間的因果關係(causality)來調整每一個事件發生的時間，而不是使用產生執行軌跡的那台機器的絕對時間。在這篇論文裡面，我們應用因果關係感知(causality-aware)之軌跡驅動模擬的概念來衡量一個通用性運算圖形處理單元的晶片系
統效率。我們使用一個廣泛被應用的通用性運算圖形處理單元模擬器Multi2sim來研究如何從其執行軌跡中取得因果關係的資訊。其中一個決定晶片系統事件之間的因果關係困難點是記憶體延遲隱藏的機制，這個機制會讓多個記憶體存取同時被處理。我們會討論如何利用通用性運算圖形處理單元的記憶體屏障指令來確認出因果關係。之後我們會將含有因果關係的執行軌跡拿給一個學術界知名的、經過修改的軌跡驅動晶片系統模擬器Garnet當參數模擬。最後我們的實驗結果顯示因果關係感知的Garnet可以獲得跟執行驅動的模擬器Multi2sim一樣的執行效能趨勢，而原本的Garnet是不行的。

Heterogeneous computer architecture is becoming the mainstream of computer systems. In the landscape of heterogeneous computer architecture, General-Purpose Computing on Graphics Processing Units (GPGPUs) is indispensable for supporting ultra high density computing. As the complexity of GPGPUs increases, designing efficient GPGPUs requires good tools to support. Among them, execution-driven simulators are often used to evaluate the architectural designs of GPGPUs. Execution-driven simulators can provide quite accurate and comprehensive performance data, but they often require very long simulation time, which slows down the process of design space exploitation. On the other hand, trace-driven simulators simulate only the specic components that are of interest, e.g. Network-on-Chip (NOC) or cache hierarchy, and rely on execution traces to mimic the operations of other components, e.g. processor cores. As a result, trace-driven simulators are fast and suitable for design space exploitation. However, traces are the execution results of the trace-generating machines, not target machine. Thus, trace-driven simulators often produce performance data that have large error margins. A recent trend in trace-driven simulation is to use the causality relationships among the trace events to adjust the event timing, instead of using the absolute event time from the trace-generating machines. In this thesis, we apply the con-
cept of causality-aware trace-driven simulation to the evaluation of the NOC of GPGPUs. We take a widely used execution-driven GPGPU simulator, Multi2Sim, and study how
to extract causality information from its execution trace. One difficulty in determining the causality relationships of NOC events for GPGPUs is the latency hiding mechanism, which
allows multiple memory access requests outstanding at the same time. We discuss how to leverage the memory fence instructions of GPGPUs to identify the causality relations. The extracted causality traces are then fed into a well-known trace-driven NOC simulator, Garnet, which is modied to be causality-aware. Our evaluation results show that the causality-aware Garnet can match the performance trend obtained from the execution-driven
simulator Multi2Sim, while the original Garnet cannot.

Introduction 1
NoC Dependency 6
Methodology and Implementation 9
1 Trace Collection and Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.1 Architectural dependency . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Programmatic dependency . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Trace Replaying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1 Garnet modication . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Replaying architectural dependency . . . . . . . . . . . . . . . . . . . 12
2.3 Replaying programmatic dependency . . . . . . . . . . . . . . . . . . 13
Evaluation 14
1 Characteristics of Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 Effectiveness Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1
2.1 High packet density . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Medium packet density . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Low packet density . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Related Works 29
Conclusions and Future Works 31
                                

[1] Rafael Ubal, Julio Sahuquillo, Salvador Petit, and Pedro Lopez, \Multi2sim: A simu-
lation framework to evaluate multicore-multithreaded processors", in Computer Archi-
tecture and High Performance Computing, 2007. SBAC-PAD 2007. 19th International
Symposium on, 2007, pp. 62{68.
[2] Niket Agarwal, Tushar Krishna, Li-Shiuan Peh, and Niraj K Jha, \Garnet: A detailed on-
chip network model inside a full-system simulator", in Performance Analysis of Systems
and Software, 2009. ISPASS 2009. IEEE International Symposium on. IEEE, 2009, pp.
33{42.
[3] Joel Hestness, Boris Grot, and Stephen W Keckler, \Netrace: dependency-driven trace-
based network-on-chip simulation", in Proceedings of the Third International Workshop
on Network on Chip Architectures. ACM, 2010, pp. 31{36.
[4] \http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-
processing-app-sdk/".
33
[5] Nan Jiang, Daniel U Becker, George Michelogiannakis, James Balfour, Brian Towles,
David E Shaw, Jung-Ho Kim, and William J Dally, \A detailed and
exible cycle-
accurate network-on-chip simulator", in Performance Analysis of Systems and Software
(ISPASS), 2013 IEEE International Symposium on. IEEE, 2013, pp. 86{96.
[6] Ali Bakhoda, George L Yuan, Wilson WL Fung, Henry Wong, and Tor M Aamodt,
\Analyzing cuda workloads using a detailed gpu simulator", in Performance Analysis of
Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on. IEEE,
2009, pp. 163{174.
[7] Karen MacDonald, Christopher Nitta, Matthew Farrens, and Venkatesh Akella,
\Pdg gen: A methodology for fast and accurate simulation of on-chip networks", Com-
puters, IEEE Transactions on, vol. 63, no. 3, pp. 650{663, 2014.

全文公開日期本全文未授權公開 (校內網路)
全文公開日期本全文未授權公開 (校外網路)

簡易檢索 / 詳目顯示

相關論文