電子系統層級多核心平台之記憶體架構評估｜國立清華大學博碩士論文庫

簡易檢索 / 詳目顯示

回結果列表

研究生：	賴俊龍 Lai, Jyun-Long
論文名稱：	電子系統層級多核心平台之記憶體架構評估 ESL Evaluation of Memory Interface Architecture for Many-Core System
指導教授：	黃稚存 Huang, Chih-Tsun
口試委員:	李毅郎 Li, Yih-Lang 劉靖家 Liou, Jing-Jia 金仲達 King, Chung-Ta
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊工程學系 Computer Science
論文出版年：	2016
畢業學年度：	104
語文別：	英文
論文頁數：	55
中文關鍵詞：	多核心、記憶體架構、電子系統層級
外文關鍵詞：	Many-Core, Memeory Architecture, ESL
相關次數：	點閱：2 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

相對於其他類型的隨機存取記憶體而言，動態隨機存取記憶體結構簡單、高密度且相對便宜，在整個電腦架構設計上通常會將其當做主記憶體使用，然而，許多年來，降低存取記憶體的速度不如處理器時脈增加的快速，換句話說，動態隨機存取記憶體效率的演進相對於晶片核心時脈演進的速度來的慢許多，因此W. Wulf 和 S. McKee 就提出"記憶體牆"的概念。

因此，近幾十年來人們不再盲目的追求時脈高的單核心晶片，而是去增加單一晶片上的核心數或者使用網路多核心系統的概念增加平行度來達到降低功耗和增加產出效率。但不幸的是對於記憶體吞吐量的需求不減反增，因此許多科學家致力於改善記憶體存取效率，如：改善記憶體控制器的排程效率或者增加會流排的寬度來改善存取速度等等。

近年來，堆疊記憶體架構的出現縮短了處理器和記憶體之間速度的差距，但對於使用晶片網路的多核心系統架構來說，從處理器到到記憶體控制器的距離會隨著晶片上的網路越來越大而相對變遠，因此，基於這個原因，我們利用一個額外的多對多交換網路並且分組處理器提供一些額外的通道去處理多核心對記憶體控制器的存取，能夠減少因大量存取所造成的晶片網路壅塞，且能夠提升核心存取記憶體控制器的效率。經由SPLASH-2的實驗證明，可以將能使核心到記憶體的存取效率達到 1.02 到 1.13 倍。

Because the advantage of DRAM is its structural simplicity: high densities and more inexpensive than other type of RAM, it is very suited to be a role of main-memory in computer architecture. However, for many years, DRAM access latencies have not decreased at the same rate as microprocessor cycle times. In other words, the rate of improvement in processor speed exceeds the rate of improvement in DRAM memory speed, that W. Wulf and S. McKee called the phenomenon "memory wall". Therefore, in past few decades, people do not blindly upgrades single processor’s performance, but increasing the amount of on-chip cores or using the NoC-based many-core architecture for the throughput and low power consumption. Unfortunately, the demand for memory bandwidth or throughput is still increased. Therefore, many engineers dedicate to improve the efficiency between memory controller and DRAM by proposing better memory scheduling policy, increasing bandwidth and improving the access speed, etc. Recently, the emergence of 3D-stacked DRAM (wide I/O) slightly reduces the speed gap between processor and memory system. But the many-core architecture which use mesh or torus architecture a bridge to connect processors and memory controllers has a characteristic that some DRAM request from processor may go through very far distance to access memory controller. Based on the above motivation, we present an architecture which improves the efficiency of accessing stacked memories and reduce routing time on many-core platform. We use an extra crossbar switch interconnect to transport the DRAM request and groups few numbers of processor to specify DRAM-channel. We call the traditional method as \textbf{Original approach} and call our proposed architecture as \textbf{CS-based approach}. Experimental results of SPLASH2 applications demonstrates speed up that ranges from 1.02 to 1.13 times, with crossbar switch interconnect.

Introduction 1
1 Introduction to Many-Core Platform . . . . . . . . . . . . . . . . . . . . . . 1
2 The Challenge of Memory Wall on SoC . . . . . . . . . . . . . . . . . . . . . 2
3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
5 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Previous Work 5
1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1 Open Core Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 OpenRISC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Open Virtual Platforms . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Overview of The ESL Many-Core Platform . . . . . . . . . . . . . . . . . . . 7
2.1 Processing Element . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Communication Units . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Network-on-Chip Timing Simulator . . . . . . . . . . . . . . . . . . . 10
3 Hardware-Independent Software Layer . . . . . . . . . . . . . . . . . . . . . 11
3.1 Application-Level Inter-PE Communication Protocol . . . . . . . . . 12
3.2 DMA Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
The Many-Core Based 3D-Stacked Memories Architecture 14
1 Evolution of Memory System . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.1 Wide I/O Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2 Introduction of 3D-stacked memory . . . . . . . . . . . . . . . . . . . 16
2 Overview of the Full System Architecture . . . . . . . . . . . . . . . . . . . . 17
2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Memory Controller Placement Analysis . . . . . . . . . . . . . . . . . 18
2.3 Methodology and Feasibility . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 The Architecture of Memory Sub-system . . . . . . . . . . . . . . . . 22
2.5 The Discussion of Scalability . . . . . . . . . . . . . . . . . . . . . . . 23
3 Crossbar Switch Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4 DRAMSim2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Experiment Results and Analysis 31
1 Overview of Experiment Environment . . . . . . . . . . . . . . . . . . . . . . 31
2 Random Distribution Program Analysis . . . . . . . . . . . . . . . . . . . . . 34
3 Applications Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.1 Odd-Even Sort Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 3D Parallel Graphics Pipeline Program . . . . . . . . . . . . . . . . . 39
4 The SPLASH-2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1 Introduce SPLASH-2 Programs . . . . . . . . . . . . . . . . . . . . . 41
4.2 Analysis SPLASH-2 Trace and Experiment Result . . . . . . . . . . . 44
Conclusion and Future Work 49
1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
                                

[1] Baas, B., Zhiyi Yu, Meeuwsen, M., Sattari, O., Apperson, R., Work, E., Webb, J., Lai,
M., Mohsenin, T., Truong, D., Cheung, J., \Asap: A ne-grained many-core platform
for dsp applications." IEEE Micro, pp. 27(2): 25{35, Mar. 2007.
[2] S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung, J. MacKay, M. Reif,
L. Bao, J. Brown et al., \Tile64-processor: A 64-core soc with mesh interconnect," in
Solid-State Circuits Conference, 2008. ISSCC 2008. Digest of Technical Papers. IEEE
International. IEEE, 2008, pp. 88{598.
[3] Vangal, S.R. et al., \An 80-tile sub-100-w tera
ops processor in 65 nm cmos." IEEE
Journal of Solid-State Circuits, pp. 43(1): 29{41, Jan. 2008.
[4] Mattson, T.G., Van Der Wijngaart, R., Frumkin, M., \Programming the intel 80-core
network-on-a-chip terascale processor."
[5] Howard, J., Dighe, S., Vangal, S.R., Ruhl, G., Borkar, N., Jain, S., Erraguntla, V.,
Konow, M., Riepen, M., Gries, M., Droege, G., Lund-Larsen, T., Steibl, S., Borkar,
S., De, V.K., Van Der Wijngaart, R., \A 48-core ia-32 processor in 45 nm cmos using
on-die message-passing and dvfs for performance and power scaling." IEEE Journal of
Solid-State Circuits, pp. 46(1): 173{183, Jan. 2011.
[6] Mattson, T.G., van der Wijngaart, R.F., Riepen, M., Lehnig, T., Brett, P., Haas, W.,
Kennedy, P., Howard, J., Vangal, S., Borkar, N., Ruhl, G., Dighe, S., \The 48-core sccprocessor: the programmer's view)," in Proc. Int. Conf. High Performance Computing,
Networking, Storage and Analysis (SC), pp. 1-11.
[7] W. A. Wulf and S. A. McKee, \Hitting the memory wall: implications of the obvious,"
ACM SIGARCH computer architecture news, vol. 23, no. 1, pp. 20{24, 1995.
[8] Manoj, S., Kanwen Wang, Hantao Huang, Hao Yu, \Smart I/Os: a data-pattern aware
2.5D interconnect with space-time multiplexing," in System Level Interconnect Predic-
tion (SLIP), 2015 ACM/IEEE International Workshop on, Jun. 2015, pp. 1{6.
[9] Dongjun Xu, Ningmei Yu, Sai Manoj, P.D., Kanwen Wang, Hao Yu, Mingbin Yu,
\A 2.5-D Memory-Logic Integration With Data-Pattern-Aware Memory Controller," in
Design & Test, IEEE (Volume:32, Issue: 4 ), Jun. 2015, pp. 1{10.
[10] G. H. Loh, \3D-stacked Memory Architectures for Multi-Core Processors," in ACM
SIGARCH Computer Architecture News, vol. 36, no. 3. IEEE Computer Society, 2008,
pp. 453{464.
[11] G. L. Loi, B. Agrawal, N. Srivastava, S.-C. Lin, T. Sherwood, and K. Banerjee, \A
thermally-aware performance analysis of vertically integrated (3-D) processor-memory
hierarchy," in Proceedings of the 43rd annual Design Automation Conference. ACM,
2006, pp. 991{996.
[12] Loi, Igor and Benini, Luca, \An ecient distributed memory interface for many-core
platform with 3D stacked DRAM," in Proceedings of the Conference on Design, Au-
tomation and Test in Europe. European Design and Automation Association, 2010,
pp. 99{104.
[13] Tao Zhang, Cong Xu, Ke Chen, Guangyu Sun, Yuan Xie, \3D-SWIFT: A High-
Performance 3D-Stacked Wide IO DRAM," in the ACM Great Lakes Symposium on
VLSI (GLSVLSI), May 2014.
52

全文公開日期本全文未授權公開 (校內網路)
全文公開日期本全文未授權公開 (校外網路)
全文公開日期本全文未授權公開 (國家圖書館：臺灣博碩士論文系統)

簡易檢索 / 詳目顯示

相關論文