暫存器配置演算法於共有與叢集混合式暫存器架構

簡易檢索 / 詳目顯示

回結果列表

研究生：	許哲宇 Jer-Yu Hsu
論文名稱：	暫存器配置演算法於共有與叢集混合式暫存器架構 A Register Allocation Algorithm for Shared and Clustered Hybrid Register Files Organization
指導教授：	鍾葉青 Yeh-Ching Chung
口試委員:
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊工程學系 Computer Science
論文出版年：	2006
畢業學年度：	94
語文別：	英文
論文頁數：	31
中文關鍵詞：	超長指令集處理器、叢集式處理器架構、暫存器架構、暫存器配置演算
外文關鍵詞：	VLIW Architecture, Clustered Processor Architecture, Register File Architecture, Register Allocation Algorithm
相關次數：	點閱：63 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

對於超長指令集處理器設計，叢集式暫存器架構可以提供較佳的硬體效率。但叢集式暫存器架構會帶來額外的叢集間溝通負擔於執行周期。我們提出了共有與叢集混合式暫存器(SCRF)架構與SCRF暫存器配置演算法來降低叢集間溝通負擔。SCRF架構由一個共有暫存器與多個叢集暫存器所組成。把使用率高的變數配置於共有暫存器中，可以有效地降低叢集間溝通負擔。SCRF暫存器配置演算法可利用SCRF暫存器架構之特性來有效降低叢集間溝通與暫存器滿溢負擔。我們把SCRF架構與SCRF暫存器配置演算法實做在Trimaran這個編譯模擬環境中。並使用mediabench的效能測試程式來分析SCRF架構在執行週期與程式大小上的影響。根據實驗結果，SCRF架構在各項數據上皆優於叢集式暫存器架構。對於執行週期、叢集間溝通負擔、暫存器滿溢負擔、程式大小，SCRF架構平均降低了11.6%、55.6%、52.7%、18.2%。

In VLIW processor design, clustered architecture becomes a popular solution for better hardware efficiency. But the inter-cluster communication (ICC) will cause the execution cycles overhead. In this thesis, we propose a shared cluster register file (SCRF) architecture and a SCRF register allocation algorithm to reduce the ICC overhead. The SCRF architecture is a hybrid register file (RF) organization composed of shared RF (SRF) and clustered RFs (CRFs). By putting the frequently used variables that need ICCs on SRF, we can reduce the number of data communication of clusters and thus reduce the ICC overhead. The SCRF register allocation algorithm exploits this architecture feature to perform optimization on ICC reduction and spill codes balancing. The SCRF register allocation algorithm is a heuristic based on graph coloring. To evaluate the performance of the proposed architecture and the SCRF register allocation algorithm, the frequently used two-cluster architecture with and without the SRF scheme are simulated on Trimaran, a compiler framework. A set of multimedia programs from mediabench is used as the benchmarks. The simulation results show that the performance of the SCRF architecture is better than that of the clustered RF architecture for all test programs in all measured metrics. In the SCRF architecture with macro registers defined in the SRF, the execution cycles, the ICC overhead, the spill codes overhead, and the code density can get 11.6%, 55.6%, 52.7%, and 18.2% reduction in average, respectively.

TABLE OF CONTENTS
中文摘要    I
ABSTRACT    II
TABLE OF CONTENTS    III
LIST OF FIGURES    IV
LIST OF TABLES    V
CHAPTER 1  INTRODUCTION    1
CHAPTER 2  RELATED WORK    6
CHAPTER 3  THE ARCHITECTURE MODELS    9
3.1 THE CLUSTERED RF ARCHITECTURE    9
3.2 THE SCRF ARCHITECTURE    10
CHAPTER 4  THE SCRF REGISTER ALLOCATION ALGORITHM    12
4.1 PHASE 1    15
4.2 PHASE 2    16
4.3 PHASE 3    17
4.4 PHASE 4    18
4.5 A EXAMPLE TO ILLUSTRATE THE SCRF REGISTER ALLOCATION    18
4.6 MACRO REGISTER ALLOCATION    21
CHAPTER 5  PERFORMANCE COMPARISONS    22
5.1 COMPARISONS OF THE EXECUTION CYCLES OF BENCHMARKS    22
5.2 COMPARISONS OF THE ICC OVERHEAD OF BENCHMARKS    24
5.3 COMPARISONS OF THE SPILL CODES OVERHEAD OF BENCHMARKS    25
5.4 COMPARISONS OF THE CODE SIZE OF BENCHMARKS    26
CHAPTER 6  CONCLUSIONS    28
REFERENCES    29

                                

[1] A. Aleta, J. M. Condina, A. Gonzalez, D. Kaeli, “Removing Communications in Clustered Microarchitectures through Instruction Replication”, ACM Transactions on Architecture and Code Optimization (TACO), vol. 1, issue 2, June 2004, pp. 127-151.
[2] A. Capitanio, N. Dutt, A. Nicolau, “Partitioned register files for VLIWs: a preliminary analysis of tradeoffs”, in Proceedings of the 25th annual international symposium on Microarchitecture, MICRO 25, Dec. 1992, pp. 292-300.
[3] CCCP research group, “Compilers Creating Custom Processors”, http://cccp.eecs.umich.edu.
[4] G. J. Chaitin, “Register allocation and spilling via graph coloring”, in Proceeding of the ACM SIGPLAN 82 Symposium on Compiler Construction, June 1982, pp. 98-105.
[5] M. Chu, K. Fan, R. Ravindran, S. Mahlke, “Cost-Sensitive Operation Partitioning in an Architecture Synthesis System for Multicluster Processors”, IEEE Micro, vol. 24, no. 3, May/June 2004, pp. 10-20.
[6] G. Desoli, “Instruction assignment for clustered VLIW DSP compilers: A new approach”, Technical Report HPL-98-13, Hewlett-Pachard Laboratories, Feb. 1998.
[7] J. Ellis, “Bulldog: A Compiler for VLIW Architectures”, MIT Press, MA, 1985.
[8] P. Faraboschi, G. Brown, J. A. Fisher, G. Desoll, F. M. O. Homewood, “Lx: a technology platform for customizable VLIW embedded processing”, in Proceedings of the 27th International Symposium on Computer Architecture, 2000, pp. 203-213.
[9] J. Fridman, Z. Greenfield, “The TigerSHARC DSP architecture”, IEEE Micro, vol. 20, issue 1, Jan. 2000, pp. 66-76.
[10] A. Gangwar, M. Balakrishnan, A. Kumar, "Impact of Inter-cluster Communication Mechanisms on ILP in Clustered VLIW Architectures", In 2nd Workshop on Application Specific Processors (WASP-2), in conjuction with 36th IEEE/ACM Annual International Symposium on Microarchitecture (MICRO-36), Dec 2003.
[11] A. Gangwar, M. Balakrishnan, P. R. Panda, A. Kumar, “Evaluation of bus based interconnect mechanisms in clustered VLIW architectures”In Proceedings of 2005 Design, Automation and Test in Europe, 2005, vol. 2, pp. 730–735.
[12] M. R. Garey, D. S. Johnson,”Computers and Intractability: A Guide to the Theory of NP-Completeness”, W. H. Freeman & Co., New York, NY, 1979.
[13] E. Gibert, J. Sanchez, A. Gonzalez, “Distributed data cache designs for clustered VLIW processors”, IEEE Transactions on Computers, vol. 54, issue 10, Oct. 2005, pp. 1227-1241.
[14] J. Hiser, S. Carr, P, Sweany, “Global register partitioning”, in Proceedings of International Conference on Parallel Architectures and Compilation Techniques, Oct. 2000, pp. 13-23.
[15] R. Ho, K. W. Mai, M. A. Horowitz, “The future of wires”, Proceedings of the IEEE, vol. 89, issue 4, April 2001, pp. 490-504.
[16] M. Jayapala, F. Barat, T. A. Vander, F. Catthoor, H. Corporaal, G. Deconinck, “Clustered loop buffer organization for low energy VLIW embedded processors”, IEEE Transactions on Computers, vol. 54, Issue 6, Jun. 2005, pp. 672-683.
[17] K. Kailas, M. Franklin, K. Ebcioglu, “A Register File Architecture and Compilation Scheme for Clustered ILP Processors”, in Proceedings of the 8th International Euro-Par Conference on Parallel Processing, Aug. 2002, pp. 500-511
[18] K. Kailas, K. Ebcioglu, A. Agrawala, “CARS: a new code generation framework for clustered ILP processors”, in Proceedings of the Seventh International Symposium on High-Performance Computer Architecture, HPCA’01, Jan. 2001, pp.133–143.
[19] V. Kathail, M. S. Schlansker, B. R. Rau, “HPL-PD Architecture Specification: Version 1.1”, Technical Report HPL-93-80 (R.1), Hewlett-Pachard Laboratories, Feb. 2000.
[20] R.E. Kessler, “The Alpha 21264 microprocessor”, IEEE Micro, vol 19, issue 2, March 1999, pp. 24-36.
[21] V. S. Lapinslii, M. F. Jacome, F. A. De Veciana, “Cluster assignment for high-performance embedded VLIW processors”, ACM Transactions on Design Automation of Electronic Systems (TODAES), vol. 7, issue 3, July 2002, pp. 430-454.
[22] C. Lee, M. Potkonjak and W. H. Mangione-Smith, “MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems”, in Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, Dec. 1997, pp. 330-350.
[23] T.J. Lin, C.C. Lee, C.W. Liu, C.W. Jen, “A novel register organization for VLIW digital signal processors”, in Proceedings of 2005 IEEE VLSI-TSA International Symposium on VLSI Design, Automation and Test, VLSI-TSA-DAT, April 2005, pp. 337-340.
[24] Y.C. Lin, Y.P. You, J. K. Lee, “Register Allocation for VLIW DSP Processors with Irregular Register Files”, in Proceedings of Compilers for Parallel Computers, CPC'06, Jan. 2006, pp. 45-59.
[25] J. M. Parcerisa, J. Sahuquillo, A. Gonzalez, J. Duato, “On-chip interconnects and instruction steering schemes for clustered microarchitectures”, IEEE Transactions on Parallel and Distributed Systems, vol.16, issue 2, Feb. 2005, pp.130-144.
[26] S. Rixner, W. J. Dally, B. Khailany, P. Mattson, U. J. Kapasi, J. D. Owens, “Reigister Organization for Media Processing”, in Proceedings of the 6th International Symposium on High Performance Computer Architecture, HPCA-6, Jan. 2000, pp. 375-386.
[27] S. Sudharsanan, P. Sriram, H. Frederickson, A. Gulati, “Image and video processing using MAJC 5200”, in Proceedings of International Conference on Image Processing, vol. 3, Sept. 2000, pp. 122-125.
[28] A. Terechko, E. Le Thenaff, M. Garg, J. van Eijndhoven, H. Corporaal, “Inter-cluster communication models for clustered VLIW processors”, In Proceedings of The Ninth International Symposium on High-Performance Computer Architecture, HPCA-9 2003, Feb. 2003, pp. 354-364.
[29] A. Terechko, M. Garg, H. Corporaal, “Evaluation of speed and area of clustered VLIW processors”, in Proceedings of 18th International Conference on VLSI Design, Jan. 2005, pp. 557 – 563.
[30] A. Terechko, E. Le Thenaff, H. Corporaal, “Cluster assignment of global values for clustered VLIW processors”, in Proceedings of the 2003 International Conference on Compilers, Architecture and Synthesis for Embedded Systems, Oct. 2003, pp. 32-40.
[31] A. Terechko, M. Garg, H. Corporaal, “Evaluation of speed and area of clustered VLIW processors”, in Proceedings of the 18th International Conference on VLSI Design, Jan 2005, pp. 557-563.
[32] Texas Instruments Inc., “TMS320C6000 CPU and Instruction Set Reference Guide”, 2000.
[33] Trimaran Consortium, The Trimaran Compiler Infrastructure, http://www.trimaran.org, 1998.
[34] P. Mattson, W. J. Dally, S. Rixner, U. J. Kapasi, J. D. Owens, “Communication scheduling”, in Proceedings of the 9th international Conference on Architectural Support for Programming Languages and Operating Systems, Nov. 2000, pp. 82-92.
[35] R. Nagpal, Y. N. Srikant, “Integrated temporal and spatial scheduling for extended operand clustered VLIW processors”, in Proceedings of the 1st Conference on Computing Frontiers, Apr. 2004, pp. 457-470.
[36] S. Narayanasamy, W. Hong, P. Wang, J. Shen, B. Calder, “A Dependency Chain Clustered Microarchitecture”, In Proceedings of 19th IEEE International Symposium on Parallel and Distributed Processing, IPDPS’05, Apr. 2005, pp. 21.2.
[37] J. Zalamea, J. Llosa, E. Ayguade, M. Valero, “Modulo scheduling with integrated register spilling for clustered VLIW architectures”, in Proceedings of 34th ACM/IEEE International Symposium on Microarchitecture, MICRO-34, Dec. 2001, pp. 160-169.
[38] J. Zalamea, J. Llosa, E. Ayguade, M. Valero, “Two-level hierarchical register file organization for VLIW processors”, in Proceedings of 33rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-33, Dec. 2000, pp. 137-146.
[39] J. Zalamea, J. Llosa, E. Ayguade, M. Valero, “Hierarchical clustered register file organization for VLIW processors”, in Proceedings of 17th International Symposium on Parallel and Distributed Processing, April. 2003, pp. 77.1.
[40] Y. Zhang, H. He, Y. Sun, “A new register file access architecture for software pipelining in VLIW processors”, in Proceedings of the 2005 Conference on Asia and South Pacific Design Automation, ASP-DAC 2005, vol. 1, Jan. 2005, pp. 627-630.
[41] H. Zhong, K. Fan, S. Mahlke, M. Schlansker, “A distributed control path architecture for VLIW processors”, in Proceedings of 14th International Conference on Parallel Architectures and Compilation Techniques, PACT 2005, Sept. 2005, pp. 197-206.

全文公開日期本全文未授權公開 (校內網路)
全文公開日期本全文未授權公開 (校外網路)

簡易檢索 / 詳目顯示

相關論文