研究生: |
陳柏宇 Chen, Po-Yu |
---|---|
論文名稱: |
RTL Realization of NOC-Based Multi-Core Platform 基於晶片網路的多核心平台之暫存器轉換層級實作 |
指導教授: |
黃稚存
Huang, Chih-Tsun |
口試委員: |
劉靖家
Liou, Jing-Jia 陳添福 Chen, Tien-Fu 黃俊達 Huang, Juinn-Dar |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Computer Science |
論文出版年: | 2011 |
畢業學年度: | 100 |
語文別: | 英文 |
論文頁數: | 104 |
中文關鍵詞: | 多核心 、晶片網路 |
外文關鍵詞: | multi-core, Network-on-Chip (NoC) |
相關次數: | 點閱:2 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著系統晶片(SoC)技術的進步,多核心處理器變得越來越重要。 有效並且穩定的資料傳輸是當前重要的課題,特別是針對多核心處理器的設計。
這篇論文使用暫存器轉換層級(RTL)的設計方式,實作了棋盤狀的多核心架構,其中包含了16個處理單元(Processing Element)。 我們的系統可以提供處理單元之間穩定的傳輸、在平台上執行平行程式的能力、評估系統效能瓶頸的方法、與電子系統層級(Electronic System Level)相互驗證並且收集硬體參數例如功率、能量消耗和工作頻率以回饋給電子系統層級。
每一個處理單元包含一個處理器和傳輸元件(Transmission Unit)。 傳輸元件由我們提出的PE-to-PE Core和一個修改過的DMA組成。 為了維持系統穩定,處理單元之間的資料傳輸是由PE-to-PE Core負責,而不是讓處理單元可以直接存取其他處理單元中的內部記憶體。 DMA是用來搬移處理單元與外部記憶體的資料。 另一方面,傳輸元件(Transmission Unit)是由軟體所驅動,因此設計了Low-Level Communication library (LLC library)來驅動PE-to-PE Core。 LLC library提供控制傳輸元件的能力,並且支援軟體端的協定來避免資料亂序。 根據LLC 的軟體端協定,資料傳遞的函式庫例如iLib library可以在我們的平台上實現,來評估平行程式執行時系統的效能。 評估複雜且實際的平行程式例如JPEG和Odd-Even Sort通常需要數百萬個時鐘週期。 同時程式的特性例如總時鐘周期數、記憶體行為和傳輸所需要的周期數都可以被收集。 透過這些測試資料,我們發現了多核心系統的效能瓶頸,並且可以回饋給電子系統層級例如SystemC平台。 多核心架構的探討可以先從SystemC平台著手, 由SystemC平台得到的架構改變可以提供給暫存器轉換層級(RTL)平台做修改。 利用RTL平台可以得到準確的時鐘週期資訊。
使用TSMC 0.13μm CMOS製程去合成我們提出的多核心平台,其工作頻率可以達到100MHz。 我們所提出的PE-to-PE Core的面積在處理單元中只占用3.28% (19.2k 個邏輯閘)。 另外,這個平台的總和傳輸率高達952.64 Mbps。
我們接下來的工作包括提升LLC library的效能、完成可程式化邏輯閘陣列 (FPGA)的原型設計、改善平台的架構為叢集式處理單元,最後收集硬體參數例如功率、能量消耗和工作頻率以回饋給電子系統層級。 我們可以試著使用DMA來搬移在內部記憶體與PE-to-PE Core之間的資料。 以組合語言來重新設計LLC library。 另外對於叢集式處理單元最有挑戰的部分是解決其間的快取同調系統 (Cache Coherence)。
As improvement of System-on-Chip technology, multi-core processors are becoming more and more important. Efficient and robust data transferring is one of the most critical and complex issues to be considered, especially when designing multi-core systems. This thesis presents an RTL implementation of mesh-based multi-core architecture containing 16 processing elements (PEs). Our platform provides 1) robust data transmission between PEs, 2) ability to execute realistic parallel programs, 3) approaches to profiling system bottlenecks, 4) cross-verification with ESL (Electronic System Level) design, and 5) physical characteristics such as power, energy and frequency as a feedback to ESL design. Each PE has single processor and a Transmission Unit (TU). The TU is composed of a proposed PE-to-PE Core and a modified DMA core. Data transmission between PEs is assisted by the proposed PE-to-PE Core, instead of accessing remote memory directly due to system stability. And the DMA core is used to move data between the local memory in a PE and the external memory controller with an OCP interface. In addition, the Transmission Unit is software driven, so that a Low-Level Communication (LLC) library is designed and proposed. The LLC library provides controls of the Transmission Unit. Furthermore, the LLC library supports a software protocol to avoid unexpected sequence errors for software developers. Based on LLC software protocol, message passing libraries such as the iLib library can be implemented to evaluate the system performance by porting realistic parallel programs. Evaluating system performance usually takes millions of cycles. Complicated and realistic parallel programs such as Odd-Even Sort and JPEG encoding are ported to this platform. And the application features such as total cycle count, memory behaviors and communication cycle count can be collected. Through these test cases, we can find out bottlenecks in multi-core platforms, and the feedback benefits platforms working at high abstraction level, or so called Electronic System Level (ESL) such as SystemC. The architecture exploration can be progressed in the SystemC platform first, and the corresponding adjustments then are provided for the RTL platform. Take advantage of the RTL implementation, our multi-core platform provides the exact cycle count of the system.
We adopted TSMC 0.13μm CMOS technology to synthesize the proposed multi-core platform at 100MHz as operating frequency. The area overhead of the proposed PE-to-PE Core is only 3.28% (19.2k gates) in a PE. Furthermore, the aggregate throughput of this platform is 952.64 Mbps.
Our future works include 1) optimizing throughput and latency of LLC library, 2) finishing FPGA prototype, 3) improving platform architecture to a cluster-based processing element and 4) extracting more characteristics such as power, energy and memory behaviors to ESL design. We can try to use DMA to move data between the local memory and the PE-to-PE Core. Re-coding LLC library by assembly language may also useful. The most challenging cache coherence issues must be solved in a cluster-based processing element.
[1] Jeremy A. Kaplan, “45 years later, does moore’s law still hold true?”, Jan. 2011,
http://www.foxnews.com/scitech/2011/01/04/years-later-does-moores-law-hold-true/.
[2] Ahmed Jerraya, Hannu Tenhunen, and Wayne Wolf, “Guest editors’ introduction: Multiprocessor
systems-on-chips”, vol. 38 Issue:7, pp. 36–40, July 2005.
[3] Grant Martin, “Overview of the mpsoc design challenge”, in Proceedings of Design
Automation Conference, 2006 43rd ACM/IEEE, Sept. 2006, pp. 274–279.
[4] E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank,
P. Finch, R. Barua, J. Babb, S. Amarasinghe, and A. Agarwal, “Baring it all to software:
Raw machines”, vol. 30, Issue: 9, Sept. 1997.
[5] M.B. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Greenwald, H. Hoffman,
P. Johnson, Jae-Wook Lee, W. Lee, A. Ma, A. Saraf, M. Seneski, N. Shnidman,
V. Strumpen, M. Frank, S. Amarasinghe, and A. Agarwal, “The raw microprocessor: a
computational fabric for software circuits and general-purpose programs”, vol. 22, Issue:
2, 2002.
[6] M.B. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Greenwald, H. Hoffman,
P. Johnson, W. Lee, A. Saraf, N. Shnidman, V. Strumpen, S. Amarasinghe, and
A. Agarwal, “A 16-issue multiple-program-counter microprocessor with point-to-point
scalar operand network”, in Proceedings of IEEE international Solid-State Circuits Conference,
2003 (ISSCC ’03), Feb. 2003.
[7] Juan del Cuvillo, Weirong Zhu, Ziang Hu, and Guang R. Gao, “Toward a software
infrastructure for the cyclops-64 cellular architecture”, in Proceedings of the 20th International
Symposium on High-Performance Computing in an Advanced Collaborative
Environment, 2006 (HPCS ’06), May 2006.
[8] George Almsi, Calin Cascaval, Jos G. Castaos, Monty Denneau, Derek Lieber, Jos E.
Moreira, and Jr. Henry S. Warren, “Dissecting cyclops: a detailed analysis of a multithreaded
architecture”, vol. 31, Issue: 1, Mar. 2003.
[9] B. Baas, Zhiyi Yu, M. Meeuwsen, O. Sattari, R. Apperson, E. Work, J. Webb, M. Lai,
T. Mohsenin, D. Truong, and J. Cheung, “Asap: A fine-grained many-core platform for
dsp applications”, vol. 27, Issue: 2, 2007.
[10] Phi-Hung Pham, Phuong Mau, and Chulwoo Kim, “A 64-pe folded-torus intra-chip communication
fabric for guaranteed throughput in network-on-chip based applications”, in
Proceedings of Custom Integrated Circuits Conference (CICC), Oct. 2009.
[11] Abhijeet A. Ravankar and Stanislav G. Sedukhin, “mesh-of-tori: A novel interconnection
networkfor frontal plane cellular processors”, in Proceedings of International Conference
on Networking and Coomputing (ICNC ’10), Nov. 2010.
[12] Pratiksha Gehlo and Shailesh Singh Chouhan, “Performance evaluation of network on
chip architectures”, in Proceedings of International Conference on Emerging Trends in
Electronic and Photonic Devices & Systems (ELECTRO ’09), Dec. 2009.
[13] David Bafumba-Lokilo, Yvon Savaria, and Jean-Pierre David, “Generic array-based mpsoc
architecture”, in Proceedings of Microsystems and Nanoelectronics Research Conference
(MNRC 2009.2nd), Oct. 2009.
[14] Sergio V. Tota, Mario R. Casu, Massimo Ruo Roch, Luca Rostagno, and Maurizio
Zamboni, “Medea: a hybrid shared-memory/message-passing multiprocessor noc-based architecture”, in Proceedings of Design, Automation & Test in Europe Conference &
Exhibition (DATA ’10), Mar. 2010.
[15] Donghyun Kim, Kwanho Kim, Joo-Young Kim, Seungjin Lee, Se-Joong Lee, and Hoi-
Jun Yoo, “81.6 gops object recognition processor based on a memory-centric noc”, Mar.
2009.
[16] S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung, J. MacKay, M. Reif,
Liewei Bao, J. Brown, M. Mattina, Chyi-Chang Miao, C. Ramey, D. Wentzlaff, W. Anderson,
E. Berger, N. Fairbanks, D. Khan, F. Montenegro, J. Stickney, and J. Zook,
“Tile64 - processor: A 64-core soc with mesh interconnect”, in Proceedings of IEEE
international Solid-State Circuits Conference, 2008 (ISSCC ’08), Feb. 2008.
[17] Silicore OpenCores, “Wishbone, revision b.4 specification”,
http://cdn.opencores.org/downloads/wbspec b4.pdf, June 2010.
[18] Silicore OpenCores, “Wishbone, revision b.3 specification”,
http://cdn.opencores.org/downloads/wbspec b3.pdf, July 2002.
[19] Damjan Lampret, Chen-Min Chen, MarkoMlinar, Johan Rydberg, Matan Ziv-Av, Chris
Ziomkowski, Greg McGary, Bob Gardner, Rohit Mathur, and Maria Bolado, “Openrisc
1000 architecture manual rev 1.3”, http://opencores.org/or1k/Main Page, May 2006.
[20] Damjan Lampret and Julius Baxter, “Openrisc 1200 ip core specification rev 0.11”,
http://opencores.org/or1k/Main Page, Jan. 2011.
[21] OCP-IP, “Open core protocol specification release 2.2”, http://www.ocpip.org, Jan.
2007.
[22] J. Hu and R. Marculescu, “Communication and task scheduling of application-specific
networks-on-chip”, vol. 152, Issue: 5, Sept. 2005.
[23] Arteris S.A., “Arteris noc solution 1.16 noccompiler user’s guide o918v10”, Feb. 2009.
[24] Arteris S.A., “Arteris noc solution 1.16 nocexplorer user’s guide o3088v9”, Feb. 2009.
[25] Arteris S.A., “From ”bus” and ”crossbar” to ”network-on-chip””, Feb. 2009.
[26] Arteris S.A., “Arteris noc solution 1.16 noc transaction and transport protocol technical
reference o3446v10”, Feb. 2009.
[27] Arteris S.A., “Arteris flexnoc interconnect ip”, http://www.arteris.com/flexnoc.
[28] Johny Chi, “Conbus ip core: Overview”, http://opencores.org/project,wb conbus, Apr.
2003.
[29] Richard Herveille, “Combining wishbone interface signals”,
http://cdn.opencores.org/downloads/appnote 01.pdf, Apr. 2001.
[30] Unneback Michael, “Ram wb core: Overview”, http://opencores.org/project,ram wb,
Apr. 2009.
[31] Damjan Lampret, “General-purpose i/o (gpio) core: Overview”,
http://opencores.org/project,gpio, Sept. 2001.
[32] Jacob Gorban, Mohor Igor, and Markovic Tadej, “Uart 16550 core: Overview”,
http://opencores.org/project,uart16550, Sept. 2001.
[33] Usselmann Rudolf, “Wishbone dma/bridge ip core”,
http://opencores.org/project,wb dma, Jan. 2002.
[34] Opencores, “Or1k:community portal”, http://opencores.org/or1k/Main Page, Sept.
2001.
[35] Kuei-Chung Chang, Jih-Sheng Shen, and Tien-Fu Chen, “Evaluation and design tradeoffs
between circuit-switched and packet-switched nocs for application-specific socs”, in
Proceedings of Design Automation Conference, 2006 43rd ACM/IEEE, Sept. 2006, pp.
143–148.
[36] D. Zydek, N. Shlayan, E. Regentova, and H. Selvaraj, “Review of packet switching
technologies for future noc”, in Proceedings of the 19th International Conference on
System Engineering, 2008. (ICSENG ’08), Mar. 2008, pp. 306–311.
[37] Cadence, “Cadence r
nc-verilog r
simulator help product version 8.2”,
http://www.cadence.com, Nov. 2008.
[38] Malcolm Phillips, “Sort techniques ”array sorting””, http://homepages.ihug.co.nz/ aurora76/
Malc/Sorting Array.htm#Exchanging.
[39] B. Wilkinson and M. Allen, Parallel Programming Techniques & Applications Using
Networked Workstations & Parallel Computers 2nd ed., Pearson Education Inc, 2004.
[40] ISO, Information technology – Digital compression and coding of continuous-tone still
images: Requirements and guidelines, 1994.
[41] ISO, Information technology – Digital compression and coding of continuous-tone still
images: Compliance testing, 1995.
[42] ISO, Information technology – Digital compression and coding of continuous-tone still
images: Extensions, 1997.
[43] ISO, Information technology – Digital compression and coding of continuous-tone still
images: Registration of JPEG profiles, SPIFF profiles, SPIFF tags, SPIFF colour
spaces, APPn markers, SPIFF compression types and Registration Authorities (REGAUT),
1999.
[44] DACS IEEE, IEEE Standard Verilog Hardware Description Language, 2001.
[45] Rafael C. Gonzalez and Richard E. Woods, Digital Image Processing (2nd Edition),
Prentice Hall, January 2002.
[46] Synopsys, “Design compiler command-line interface guide, version b-2008.09-sp2”,
http://www.synopsys.com.
[47] Artisan, “Artisan standard library sram generator user manual”,
http://www.artisan.com, 2003.
[48] Artisan, “Artisan standard library register file generator user manual”,
http://www.artisan.com, 2003.