研究生: |
葉國楷 Yeh, Kuo Kai |
---|---|
論文名稱: |
多核心平台上記憶體架構之設計與分析 Design and Analysis of Memory Interface Architecture for Many-Core Platforms |
指導教授: |
黃稚存
Huang, Chih Tsun |
口試委員: |
劉靖家
Liou, Jing Jia 金仲達 King, Chung Ta |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Computer Science |
論文出版年: | 2015 |
畢業學年度: | 104 |
語文別: | 英文 |
論文頁數: | 63 |
中文關鍵詞: | 多核心 、多通道記憶體控制器 、動態隨機存取記憶體 |
外文關鍵詞: | Many-core, Muti-channel memory controller, DRAM |
相關次數: | 點閱:2 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在過去的幾十年中,系統單晶片 ( SoC ) 提供開發人員加入更多功能在單一晶片上。但是,摩爾定律 (Moores Law) 指出晶片上電晶體個數在每兩年內會倍增,所以晶片設計的複雜度將會面臨劇烈的挑戰。毫無疑問的,提高設計模組的複雜度是迫切需要的。現今,單一核心的發展已經遇到頻率無法提高與功率消耗的問題。所以整合多核心 (Multi-core) 架構被設計出來取代傳統單核心架構。多核心架構的優勢在於計算能力的表現,低功率消耗,適用於多執行緒 (Multi-thread) 應用程式。然而,在多核心架構上對於記憶體頻寬的需求仍會增加。在1994年,Wulf和McKee提出電腦效能的提升將會停止,事實也證明了在1986 ~ 2000間,CPU效能以年均55%的成長遠勝於記憶體效能以年均10%的成長,此現象將導致記憶體效能將成為電腦效能提升的瓶頸。也因此有許多的工程師致力於提升記憶體控制器跟記憶體之間的效能。除此之外,在採用mesh或torus的多核心架構會有核心與記憶體之間的距離過長的現象,就此現象我們提出了一個架構能縮短記憶體的存取在NoC上所消耗的時間。此架構所採用的方法為將核心做分組,並提供專屬於該組核心的記憶體通道,此通道另一端連結於一個多埠的Crossbar Switch用於重排記憶體的存取至正確的記憶體控制器,我們稱此方法為CS-based approach。我們另外採用了Standard Co-Emulation - Modeling Interface (SCE-MI) 來連結軟體與硬體以實現完整的平台架構。CS-based approach相較於一般的方法在SPLASH-2程式效能表現上有著1.18 ~ 1.74倍的顯著成長,而Crossbar Switch所需額外的gate count約為7k。
In past decades, system on a chip gives explorers add more functions on a single chip. But Moore's Law indicates transistor counts doubled approximately every two years. The design complexity also encounter sharp challenge. Undoubtedly, raising the abstraction level of modeling and simulation is urgent need. Nowadays, single processor development has encounter bottleneck of rising frequency and energy efficiency problem. So the emerging many-core architecture has been designed for replacing traditional centralized single core design. Multi-core processor's advantages are high performance computing, low power, and suitable to multi-thread applications. However, the demand for memory bandwidth is still increased. In 1994, Wulf and McKee through the improvement of computer's performance would stop. Factual proof that from 1986 to 2000, CPU speed improved at an annual rate of 55% while memory speed only improved at 10%. In other words, memory speed would become the bottleneck in computer performance. Therefore, many engineers dedicate to improve the efficiency between memory controller and DRAM.
In addition, the many-core architecture which use mesh or torus architecture between cores has a phenomenon that the distance from the core to DRAM may be very far. Based on the above motivation, we present an architecture which has better efficiency of memory access, and a mechanism reduces memory access's routing time on NoC. This mechanism clusters processors and as-signs exclusive memory channel to the cluster. The architecture uses a multi-port Crossbar Switch to re-schedule DRAM requests from memory channels to DRAM. We call the architecture that memory requests routing by Crossbar Switch as CS-based approach. In contrast with Original approach that memory requests routing by NoC. To implement the architecture, we adopt SCE-MI to bridge ESL many-core platform with RTL memory sub-system. Experiment of SPLASH2 applications demonstrates remarkable speed up that ranges from 1.18 to 1.74 times. And the extra Crossbar Switch is about 7k gate count.
[1] B. Baas, Z. Yu, M. Meeuwsen, O. Sattari, R. Apperson, E.Work, J.Webb, M. Lai, T.Mohsenin, D. Truong et al., Asap: A ne-grained many-core platform for dsp applications, Micro, IEEE, vol. 27, no. 2, pp. 3445, 2007.
[2] S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung, J. MacKay, M. Reif, L. Bao, J. Brown et al., Tile64-processor: A 64-core soc with mesh interconnect, in
Solid-State Circuits Conference, 2008. ISSCC 2008. Digest of Technical Papers. IEEE International. IEEE, 2008, pp. 88598.
[3] W. A. Wulf and S. A. McKee, Hitting the memory wall: implications of the obvious, ACM SIGARCH computer architecture news, vol. 23, no. 1, pp. 2024, 1995.
[4] Random-access memory, http://en.wikipedia.org/wiki/Random-access memory.
[5] C. C. Liu, I. Ganusov, M. Burtscher, and S. Tiwari, Bridging the processor-memory performance gap with 3d ic technology, Design & Test of Computers, IEEE, vol. 22, no.
6, pp. 556 564, 2005.
[6] I. Loi and L. Benini, An ecient distributed memory interface for many-core platform with 3d stacked dram, in Proceedings of the Conference on Design, Automation and
Test in Europe. European Design and Automation Association, 2010, pp. 99104.
[7] S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens, Memory access scheduling, in ACM SIGARCH Computer Architecture News, vol. 28, no. 2. ACM, 2000,
pp. 128138.
[8] Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter, Thread cluster memory scheduling: Exploiting dierences in memory access behavior, in Microarchitecture (MI-CRO), 2010 43rd Annual IEEE/ACM International Symposium on. IEEE, 2010, pp.6576.
[9] P.-Y. Chen and C.-T. Huang, RTL Realization of NoC-Based Multi-Core Platform, in Master Thesis, Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan, Oct. 2011.
[10] AMBA specication (Rev 2.0), http://wwwmicro.deis.unibo.it/ magagni/amba99.pdf.
[11] D. Lampret, C.-M. Chen, M. Mlinar, J. Rydberg, M. Ziv-Av, C. Ziomkowski, G. McGary, B. Gardner, R. Mathur, and M. Bolado, OpenRISC 1000 Architecture Manual rev 1.3, http://opencores.org/or1k/Main Page, May 2006.
[12] Lampret D and Baxter J, OpenRISC 1200 IP Core Specication rev 0.11, http://opencores.org/or1k/Main Page, Jan. 2011.
[13] D. Wentzla, P. Grin, H. Homann, L. Bao, B. Edwards, C. Ramey, M. Mattina, C.-C. Miao, J. F. Brown, and A. Agarwal, On-chip interconnection architecture of the tile processor, Micro, IEEE, vol. 27, no. 5, pp. 1531, 2007.
[14] Sai Manoj P. D., Kanwen Wang, Hantao Huang and Hao Yu, Smart I/Os: A Data-pattern Aware 2.5D Interconnect with Space-Time Multiplexing.
[15] O. Mutlu and T. Moscibroda, Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared dram systems, in ACM SIGARCH Computer Architecture News, vol. 36, no. 3. IEEE Computer Society, 2008, pp. 6374.
[16] S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, P. Iyer, A. Singh, T. Jacob et al., An 80-tile 1.28 tops network-on-chip in 65nm cmos, in Solid-State Circuits Conference, 2007. ISSCC 2007. Digest of Technical Papers. IEEE International. IEEE, 2007,pp. 98589.
[17] D. Abts, N. D. Enright Jerger, J. Kim, D. Gibson, and M. H. Lipasti, Achieving pre-dictable performance through better memory controller placement in many-core cmps,
in ACM SIGARCH Computer Architecture News, vol. 37, no. 3. ACM, 2009, pp. 451461.
[18] S. Borkar, Thousand core chips: a technology perspective, in Proceedings of the 44th annual Design Automation Conference. ACM, 2007, pp. 746749
[19] Churoo (Chul-Woo) Park, HoeJu Chung, Yun-Sang Lee, Jun-Ho Shin, Jin-Hyung Cho, Seunghoon Lee, Ki- Whan Song, Kyu-Hyoun Kim,Jung-Bae Lee, Changhyun Kim, Senior Member, IEEE, and Soo-In Cho. A 512-Mb DDR3 SDRAM Prototype and Self-Calibration Techniques Proc.IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41,NO.4, APRIL 2006
[20] DDR3 SDRAM Specication (JESD79-3A), JEDEC Standard, JEDEC Solid State Technology Association, Sept. 2007.
[21] 7 Series FPGAs Memory Interface Solutions v2.0 User Guide, October 2, 2013.
[22] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta, The splash-2 programs: Characterization and methodological considerations, in ACM SIGARCH Computer
Architecture News, vol. 23, no. 2. ACM, 1995, pp. 2436.
[23] J. P. Singh, W.-D. Weber, and A. Gupta, Splash: Stanford parallel applications for sharedmemory, in ACM SIGARCH Computer Architecture News, vol. 20, no. 1, pp.
544, 1992.
[24] D. H. Bailey, Ffts in external of hierarchical memory, in Proceedings of the 1989 ACM/IEEE conference on Supercomputing. ACM, 1989, pp. 234242.
[25] L. Greengard, The rapid evaluation of potential elds in particle systems. the MIT Press, 1988.
[26] P. Hanrahan, D. Salzman, and L. Aupperle, A rapid hierarchical radiosity algorithm, in ACM SIGGRAPH Computer Graphics, vol. 25, no. 4. ACM, 1991, pp. 197206.
[27] G. E. Blelloch, C. E. Leiserson, B. M. Maggs, C. G. Plaxton, S. J. Smith, and M. Zagha, A comparison of sorting algorithms for the connection machine cm-2, in Proceedings of the third annual ACM symposium on Parallel algorithms and architectures. ACM, 1991, pp. 316.
[28] P. S.Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner, Simics: A full system simulation platform,
Computer, vol. 35, no. 2, pp. 5058, 2002.
[29] M. M. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood, Multifacets general execution-driven multi-
processor simulator (gems) toolset, ACM SIGARCH Computer Architecture News, vol.33, no. 4, pp. 9299, 2005.
[30] Yu, Chao-Kai, "Dynamic Timing Simulation for Network-on-Chip with Parameterized Router Pipeline Architectures and Arbitration Policies" , in Master Thesis, Department
of Electric Engineering, National Tsing Hua University, Hsinchu, Taiwan.