研究生: |
陳煥文 Chen, Huan-Wen |
---|---|
論文名稱: |
應用於多核心平台之可堆疊記憶體存取效率改進與分析 Efficiency Improvement and Analysis of Accessing Stacked Memories on Many-Core Platforms |
指導教授: |
黃稚存
Huang, Chih-Tsun |
口試委員: |
黃稚存
Huang, Chih-Tsun 劉靖家 Liou, Jing-Jia 黃俊達 Huang, Juinn-Dar |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Computer Science |
論文出版年: | 2013 |
畢業學年度: | 102 |
語文別: | 英文 |
論文頁數: | 77 |
中文關鍵詞: | 多核心 、多核心單晶片 、堆疊記憶體 、加寬輸入輸出 、動態隨機存取記憶體 |
外文關鍵詞: | Many-Core, CMP, Stacked Memories, Wide I/O, DRAM |
相關次數: | 點閱:1 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
由於結構簡單且相對便宜,在電腦架構的設計上通常會將動態隨機存取記憶體當作主記憶體使用,然而,就歷史的觀點來看,動態隨機存取記憶體效率的演進相對於晶片核心時脈演進的速度來得慢很多,因此早在1994年W. Wulf和S. McKee就提出"記憶體牆"的概念。然而,為了滿足摩爾定律,單晶片上的核心數越來越多,從原本的單核心到現今的多核心系統。相對於單核心來說,多核心系統以核心平行度來取代核心時脈的增加,但對於記憶體的吞吐量需求未減反增,因此很多科學家致力於記憶體存取效率的改善,如:改善記憶體控制器的排程效率、增加匯流排寬度或是增加記憶體存取速率等等。近年來,堆疊記憶體架構的出現使得記憶體吞吐需求量有些微的獲得滿足,但對於使用晶片網路的多核心系統架構來說,從核心到記憶體控制器的距離會隨著晶片上網路的增大而相對變遠,因此,在本篇論文中,我們使用了一個額外的多對多交換網路去處理核心對控制器的存取,此舉不但能減少因大量存取所造成晶片網路的雍塞,且能使核心能更快的對記憶體控制器做存取。經由SPLASH-2測資的證明,此種架構能使核心到記憶體的存取效率達到1.13到2.57倍之多,並且適用於現今的記憶體堆疊架構。
Because of DRAM is its structural simplicity, high density per unit area and more inexpensive, it’s very suited to be a role of main-memory in computer architecture. However, from a historical point of view, since the DRAM was flourished, the rate of improvement in processor speed exceeds the rate of improvement in DRAM memory speed, that W. Wulf and S. McKee called the phenomenon “memory wall”. Nevertheless, over the past few decades the amount of on-chip cores comes from one to several, and the up-coming NoC-based (most is mesh) many-core architecture no longer blindly upgrades processor’s performance, but takes advantage of parallelism to achieve the throughput requirement with superior cost-effectiveness. Unfortunately, the demand for memory bandwidth or throughput is still increased. Therefore, many engineer try to do their best to enhance the efficiency
between memory controller and DRAM devices by proposing better memory scheduling policy, increasing bandwidth and improving the access speed, etc. Recently, the emergence of 3D-stacked DRAM (wide I/O) slightly reduces the speed gap between processor and memory system. But the architecture
which used Network-on-Chip as a bridge to connect processors and memory controllers has a characteristic that some DRAM requests from processors may go through very far distance to
access memory controller. Based on the above motivation, in this thesis we present an architecture which improves efficiency of accessing stacked memories on many-core platforms. This architecture uses an extra switch network to transport the packets which come from processor to DRAM sub-system and groups few numbers of processor to specify DRAM-channel. By this method, we can alleviate the traffic contention between DRAM-requests and inter-processor communication. We use traditional method as a contrast, that all of DRAM-requests are routed by NoC. Experimental results of SPLASH2 applications demonstrate significant speed up that ranges from 1.13 times to 2.57 times, with cost-affordable crossbar switch network which also applies to the Wide I/O DRAM interface.
[1] J.-S. Kim, C. S. Oh, H. Lee, D. Lee, H.-R. Hwang, S. Hwang, B. Na, J. Moon, J.-G. Kim,
H. Park et al., “A 1.2 v 12.8 gb/s 2gb mobile wide-i/o dram with 4 128 i/os using tsv-based
stacking,” in Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2011 IEEE
International. IEEE, 2011, pp. 496–498.
[2] M. B. Taylor, J. Kim, J. Miller, D.Wentzlaff, F. Ghodrat, B. Greenwald, H. Hoffman, P. Johnson,
J.-W. Lee, W. Lee et al., “The raw microprocessor: A computational fabric for software circuits
and general-purpose programs,” Micro, IEEE, vol. 22, no. 2, pp. 25–35, 2002.
[3] E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch,
R. Barua et al., “Baring it all to software: Raw machines,” Computer, vol. 30, no. 9, pp. 86–93,
1997.
[4] B. Baas, Z. Yu, M. Meeuwsen, O. Sattari, R. Apperson, E.Work, J.Webb, M. Lai, T. Mohsenin,
D. Truong et al., “Asap: A fine-grained many-core platform for dsp applications,” Micro, IEEE,
vol. 27, no. 2, pp. 34–45, 2007.
[5] S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung, J. MacKay, M. Reif, L. Bao,
J. Brown et al., “Tile64-processor: A 64-core soc with mesh interconnect,” in Solid-State Circuits
Conference, 2008. ISSCC 2008. Digest of Technical Papers. IEEE International. IEEE,
2008, pp. 88–598.
[6] S. R. Vangal, J. Howard, G. Ruhl, S. Dighe, H.Wilson, J. Tschanz, D. Finan, A. Singh, T. Jacob,
S. Jain et al., “An 80-tile sub-100-w teraflops processor in 65-nm cmos,” Solid-State Circuits,
IEEE Journal of, vol. 43, no. 1, pp. 29–41, 2008.
[7] J. Howard, S. Dighe, Y. Hoskote, S. Vangal, D. Finan, G. Ruhl, D. Jenkins, H.Wilson, N. Borkar,
G. Schrom et al., “A 48-core ia-32 message-passing processor with dvfs in 45nm cmos,” in
Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2010 IEEE International.
IEEE, 2010, pp. 108–109.
[8] W. A. Wulf and S. A. McKee, “Hitting the memory wall: implications of the obvious,” ACM
SIGARCH computer architecture news, vol. 23, no. 1, pp. 20–24, 1995.
[9] S. Borkar, “Thousand core chips: a technology perspective,” in Proceedings of the 44th annual
Design Automation Conference. ACM, 2007, pp. 746–749.
[10] S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens, “Memory access scheduling,”
in ACM SIGARCH Computer Architecture News, vol. 28, no. 2. ACM, 2000, pp. 128–138.
[11] K. J. Nesbit, N. Aggarwal, J. Laudon, and J. E. Smith, “Fair queuing memory systems,” in Microarchitecture,
2006. MICRO-39. 39th Annual IEEE/ACM International Symposium on. IEEE,
2006, pp. 208–222.
[12] O. Mutlu and T. Moscibroda, “Parallelism-aware batch scheduling: Enhancing both performance
and fairness of shared dram systems,” in ACM SIGARCH Computer Architecture News,
vol. 36, no. 3. IEEE Computer Society, 2008, pp. 63–74.
[13] Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter, “Thread cluster memory scheduling:
Exploiting differences in memory access behavior,” in Microarchitecture (MICRO), 2010 43rd
Annual IEEE/ACM International Symposium on. IEEE, 2010, pp. 65–76.
[14] R. Ausavarungnirun, K. K.-W. Chang, L. Subramanian, G. H. Loh, and O. Mutlu, “Staged
memory scheduling: Achieving high performance and scalability in heterogeneous systems,”
in Proceedings of the 39th International Symposium on Computer Architecture. IEEE Press,
2012, pp. 416–427.
[15] G. L. Yuan, A. Bakhoda, and T. M. Aamodt, “Complexity effective memory access scheduling
for many-core accelerator architectures,” in Proceedings of the 42nd Annual IEEE/ACM International
Symposium on Microarchitecture. ACM, 2009, pp. 34–44.
[16] C. C. Liu, I. Ganusov, M. Burtscher, and S. Tiwari, “Bridging the processor-memory performance
gap with 3d ic technology,” Design & Test of Computers, IEEE, vol. 22, no. 6, pp. 556–
564, 2005.
[17] G. H. Loh, “3d-stacked memory architectures for multi-core processors,” in ACM SIGARCH
Computer Architecture News, vol. 36, no. 3. IEEE Computer Society, 2008, pp. 453–464.
[18] G. L. Loi, B. Agrawal, N. Srivastava, S.-C. Lin, T. Sherwood, and K. Banerjee, “A thermallyaware
performance analysis of vertically integrated (3-d) processor-memory hierarchy,” in Proceedings
of the 43rd annual Design Automation Conference. ACM, 2006, pp. 991–996.
[19] I. Loi and L. Benini, “An efficient distributed memory interface for many-core platform with
3d stacked dram,” in Proceedings of the Conference on Design, Automation and Test in Europe.
European Design and Automation Association, 2010, pp. 99–104.
[20] T.-S. Hsu and J.-J. Liou, “A DVFS Many-core ESL Simulation Platform with Software Communication
API,” in Master Thesis, Department of Electrical Engineering, National Tsing Hua
University, Hsinchu, Taiwan, Nov. 2011.
[21] O. C. P. Specification and I. Volume, “Release 2.0,” 2003.
[22] D. Lampret, C.-M. Chen, M. Mlinar, J. Rydberg, M. Ziv-Av, C. Ziomkowski, G. McGary,
B. Gardner, R. Mathur, and M. Bolado, “Openrisc 1000 architecture manual,” Description of
assembler mnemonics and other for OR1200, 2003.
[23] S. Rigo, G. Araujo, M. Bartholomeu, and R. Azevedo, “Archc: A systemc-based architecture description
language,” in Computer Architecture and High Performance Computing, 2004. SBACPAD
2004. 16th Symposium on. IEEE, 2004, pp. 66–73.
[24] J.-Y. Lai, P.-Y. Chen, T.-S. Hsu, C.-T. Huang, and J.-J. Liou, “Design and analysis of a manycore
processor architecture for multimedia applications,” in Signal & Information Processing
Association Annual Summit and Conference (APSIPA ASC), 2012 Asia-Pacific. IEEE, 2012,
pp. 1–6.
[25] J. Bennett, “Building a loosely timed soc model with osci tlm 2.0,” 2008.
[26] E. Pekkarinen, L. Lehtonen, E. Salminen, and T. Hamalainen, “A set of traffic models for
network-on-chip benchmarking,” in System on Chip (SoC), 2011 International Symposium on.
IEEE, 2011, pp. 78–81.
[27] J. Aynsley, “Osci tlm-2.0 language reference manual,” Open SystemC Initiative (OSCI), p. 15,
2009.
[28] L. Lehtonen, E. Salminen, and T. Hamalainen, “Analysis of modeling styles on network-on-chip
simulation,” in NORCHIP, 2010. IEEE, 2010, pp. 1–4.
[29] J. Zhu, P. Liu, and D. Zhou, “An sdram controller optimized for high definition video coding
application,” in Circuits and Systems, 2008. ISCAS 2008. IEEE International Symposium on.
IEEE, 2008, pp. 3518–3521.
[30] Y. Kim, D. Han, O. Mutlu, and M. Harchol-Balter, “Atlas: A scalable and high-performance
scheduling algorithm for multiple memory controllers,” in High Performance Computer Architecture
(HPCA), 2010 IEEE 16th International Symposium on. IEEE, 2010, pp. 1–12.
[31] A. Sharifi, E. Kultursay, M. Kandemir, and C. R. Das, “Addressing end-to-end memory access
latency in noc-based multicores,” in Proceedings of the 2012 45th Annual IEEE/ACM International
Symposium on Microarchitecture. IEEE Computer Society, 2012, pp. 294–304.
[32] M. M. Lee, J. Kim, D. Abts, M. Marty, and J. W. Lee, “Approximating age-based arbitration in
on-chip networks,” in Proceedings of the 19th international conference on Parallel architectures
and compilation techniques. ACM, 2010, pp. 575–576.
[33] R. Das, O. Mutlu, T. Moscibroda, and C. R. Das, “Application-aware prioritization mechanisms
for on-chip networks,” in Microarchitecture, 2009. MICRO-42. 42nd Annual IEEE/ACM International
Symposium on. IEEE, 2009, pp. 280–291.
[34] ——, “Aérgia: exploiting packet latency slack in on-chip networks,” in ACM SIGARCH Computer
Architecture News, vol. 38, no. 3. ACM, 2010, pp. 106–116.
[35] A. Kumary, P. Kunduz, A. Singhx, L.-S. Pehy, and N. Jhay, “A 4.6 tbits/s 3.6 ghz single-cycle
noc router with a novel switch allocator in 65nm cmos,” in Computer Design, 2007. ICCD 2007.
25th International Conference on. IEEE, 2007, pp. 63–70.
[36] R. Mullins, A.West, and S. Moore, “Low-latency virtual-channel routers for on-chip networks,”
ACM SIGARCH Computer Architecture News, vol. 32, no. 2, p. 188, 2004.
[37] L.-S. Peh and W. J. Dally, “A delay model and speculative architecture for pipelined routers,” in
High-Performance Computer Architecture, 2001. HPCA. The Seventh International Symposium
on. IEEE, 2001, pp. 255–266.
[38] A. Kumar, L.-S. Peh, and N. K. Jha, “Token flow control,” in Proceedings of the 41st annual
IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 2008, pp.
342–353.
[39] A. Kumar, L.-S. Peh, P. Kundu, and N. K. Jha, “Express virtual channels: towards the ideal
interconnection fabric,” in ACM SIGARCH Computer Architecture News, vol. 35, no. 2. ACM,
2007, pp. 150–161.
[40] Y. Kim, H. Lee, and J. Kim, “An alternative memory access scheduling in manycore accelerators,”
in Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference
on. IEEE, 2011, pp. 195–196.
[41] D. Wentzlaff, P. Griffin, H. Hoffmann, L. Bao, B. Edwards, C. Ramey, M. Mattina, C.-C. Miao,
J. F. Brown, and A. Agarwal, “On-chip interconnection architecture of the tile processor,” Micro,
IEEE, vol. 27, no. 5, pp. 15–31, 2007.
[42] S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, P. Iyer, A. Singh,
T. Jacob et al., “An 80-tile 1.28 tflops network-on-chip in 65nm cmos,” in Solid-State Circuits
Conference, 2007. ISSCC 2007. Digest of Technical Papers. IEEE International. IEEE, 2007,
pp. 98–589.
[43] D. Abts, N. D. Enright Jerger, J. Kim, D. Gibson, and M. H. Lipasti, “Achieving predictable performance
through better memory controller placement in many-core cmps,” in ACM SIGARCH
Computer Architecture News, vol. 37, no. 3. ACM, 2009, pp. 451–461.
[44] G. Katti, M. Stucchi, K. De Meyer, and W. Dehaene, “Electrical modeling and characterization
of through silicon via for three-dimensional ics,” Electron Devices, IEEE Transactions on,
vol. 57, no. 1, pp. 256–262, 2010.
[45] J. Standard, “Wide i/o single data rate,” JESD229, December, 2011.
[46] D. H. Woo, N. H. Seong, D. L. Lewis, and H.-H. Lee, “An optimized 3d-stacked memory architecture
by exploiting excessive, high-density tsv bandwidth,” in High Performance Computer
Architecture (HPCA), 2010 IEEE 16th International Symposium on. IEEE, 2010, pp. 1–12.
[47] R. Ho, “On-chip wires: scaling and efficiency,” Ph.D. dissertation, Citeseer, 2003.
[48] P. Bai, C. Auth, S. Balakrishnan, M. Bost, R. Brain, V. Chikarmane, R. Heussner, M. Hussein,
J. Hwang, D. Ingerly et al., “A 65nm logic technology featuring 35nm gate lengths, enhanced
channel strain, 8 cu interconnect layers, low-k ild and 0.57 m2 sram cell,” in Electron Devices
Meeting, 2004. IEDM Technical Digest. IEEE International. IEEE, 2004, pp. 657–660.
[49] P.-Y. Chen and C.-T. Huang, “RTL Realization of NoC-Based Multi-Core Platform,” in Master
Thesis, Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan, Oct.
2011.
[50] P. Rosenfeld, E. Cooper-Balis, and B. Jacob, “Dramsim2: A cycle accurate memory system
simulator,” Computer Architecture Letters, vol. 10, no. 1, pp. 16–19, 2011.
[51] B. Wilkinson and C. M. Allen, Parallel programming. Prentice hall New Jersey, 1999, vol.
999.
[52] J. R. Jensen et al., Introductory digital image processing: a remote sensing perspective.
Prentice-Hall Inc., 1996, no. Ed. 2.
[53] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta, “The splash-2 programs: Characterization
and methodological considerations,” in ACM SIGARCH Computer Architecture News,
vol. 23, no. 2. ACM, 1995, pp. 24–36.
[54] J. P. Singh, W.-D. Weber, and A. Gupta, “Splash: Stanford parallel applications for sharedmemory,”
ACM SIGARCH Computer Architecture News, vol. 20, no. 1, pp. 5–44, 1992.
[55] D. H. Bailey, “Ffts in external of hierarchical memory,” in Proceedings of the 1989 ACM/IEEE
conference on Supercomputing. ACM, 1989, pp. 234–242.
[56] L. Greengard, The rapid evaluation of potential fields in particle systems. the MIT Press, 1988.
[57] P. Hanrahan, D. Salzman, and L. Aupperle, “A rapid hierarchical radiosity algorithm,” in ACM
SIGGRAPH Computer Graphics, vol. 25, no. 4. ACM, 1991, pp. 197–206.
[58] G. E. Blelloch, C. E. Leiserson, B. M. Maggs, C. G. Plaxton, S. J. Smith, and M. Zagha, “A
comparison of sorting algorithms for the connection machine cm-2,” in Proceedings of the third
annual ACM symposium on Parallel algorithms and architectures. ACM, 1991, pp. 3–16.
[59] P. S.Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson,
A. Moestedt, and B. Werner, “Simics: A full system simulation platform,” Computer, vol. 35,
no. 2, pp. 50–58, 2002.
[60] M. M. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E.
Moore, M. D. Hill, and D. A. Wood, “Multifacet’s general execution-driven multiprocessor
simulator (gems) toolset,” ACM SIGARCH Computer Architecture News, vol. 33, no. 4, pp.
92–99, 2005.
[61] C. Weis, I. Loi, L. Benini, and N. Wehn, “An energy efficient dram subsystem for 3d integrated
socs,” in Design, Automation & Test in Europe Conference & Exhibition (DATE), 2012. IEEE,
2012, pp. 1138–1141.