研究生: |
吳中如 Wu, Chung-Ju |
---|---|
論文名稱: |
應用於具有分散式暫存器組的超長指令集數位訊號處理器之全域性編譯器最佳化 Global Optimizations in Compilers for VLIW DSP Processors with Distributed Register Files |
指導教授: |
李政崑
Lee, Jenq Kuen |
口試委員: |
李政崑
賴尚宏 黃慶育 許雅三 楊武 游逸平 陳鵬升 陳呈瑋 |
學位類別: |
博士 Doctor |
系所名稱: |
電機資訊學院 - 資訊工程學系 Computer Science |
論文出版年: | 2015 |
畢業學年度: | 103 |
語文別: | 英文 |
論文頁數: | 62 |
中文關鍵詞: | 編譯器 、超長指令集架構 、全域性最佳化 、分散式暫存器組 、暫存器溢出機制 、指令排程機制 |
外文關鍵詞: | compiler, VLIW architecture, global optimization, distributed register files, register spilling, instruction scheduling |
相關次數: | 點閱:113 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
中文摘要
超長指令集架構的數位訊號處理器逐漸地被應用於有多媒體需求的嵌入式系統上。在開發一個新的超長指令集數位訊號處理器時,設計複雜度,晶片大小,耗電量等等往往是工程師們在設計上的考量。因此對於嵌入系統,一般常用且傳統的設計並非那麼適合,取而代之的是分散式與多重暫存器組的設計逐漸被採用好減少讀寫埠的數量。雖然如此多變化的暫存器組架構與非常規的設計可以達到高效率與耗電量低的要求,對於編譯器最佳化來說是一個很大挑戰。
編譯器最佳化的目的在於產生較有效率的程式碼,一般來說可以大致區分成區域性最佳化與全域性最佳化。區域性最佳化通常僅運作與小區塊的程式碼,所以非常規的硬體設計其影響層面較小。相反地,全域性最佳化通常要掃過整個程式碼並且妥善利用有限的硬體資源,所以常非規的硬體設計與限制經常會讓全域性最佳化得不到期望的效果。
此篇論文的貢獻在於討論在分散式暫存器組的超長指令集數位訊號處理器上,全域性編譯器最佳化的該如何恰當的應用與實現。我們以一個實際的的例子 PAC DSP 作為例子,它是一個高度分散式並有嚴格讀寫埠限制的暫存組設計。藉由分享我們嘗試在此顆超長指令集數位訊號處理器上的編譯器開發經驗,或可作為其它編譯器在面對其它超長指令集數位訊號處理器在開發時上的參考與借鏡。
實驗部份則是以 Open64 compiler 作為基礎,來開發屬於 PAC DSP 的編譯器。藉由導入了在我們提出的方式之後,在 EEMBC, Mibench 等 benchmark 的數據實驗下,可以看出,相比於傳統的最佳化方式,我們確實改善了全域性最佳化在此類特定暫存器組設計下的運行效果。
Abstract
Digital signal processors (DSPs) with very long instruction word (VLIW) data-path architectures are increasingly being deployed on embedded devices for multimedia processing applications. While developing new VLIW DSP processors, engineers always take complexity, die size, and power dissipation into consideration. Therefore, some popular and traditional designs may not be feasible for embedded systems. Instead, distributed register files and multi-bank register architectures are being adopted to eliminate the amount of read/write ports associated with register files. Although such wide varieties of register file architectures and irregular designs achieve high performance and low power consumption criterion, they present challenges for devising compiler optimization schemes as well.
Compiler optimizations, which direct code generation more efficiency, can be conceptually classified into local and global optimizations. Local optimizations only take place within small scope of code fragment, hence the impact of irregular designs is trivial. On the contrary, global optimizations usually go through entire procedure and try to utilize resources as effectively as possible, so the irregular designs and distributed scenarios make global optimizations difficult to have expected improvement.
This dissertation has made contributions to the development and discussion of global optimizations on compilers for a novel VLIW DSP with distributed register files. The target DSP architecture, known as PAC DSP core, is designed with distinctively banked register files with highly restricted port access. Our experiences of developing global optimizations in compilers for the PAC DSP may also be of interest to those involved in developing compilers for the similar architectures.
Experiments were also performed on the PAC VLIW DSP with distributed register files by incorporating our proposed optimization schemes into an Open64-based compiler. Several benchmarks such as EEMBC and MiBench were tested for evaluating the improvement of utilizing the features of the specific register file architectures. It shows that a VLIW DSP compiler applied by our global optimization schemes exhibits performance superior to traditional strategies.
[1] Peter Bergner, Peter Dahl, David Engebretsen, and Matthew O’Keefe. Spill code minimization via interference region spilling. In Proceedings of the ACM SIG- PLAN 1997 Conference on Programming Language Design and Implementation, pages 287–295, 1997.
[2] David Bernstein, Dina Q. Goldin, Martin Charles Golumbic, Hugo Krawczyk, Yishay Mansour, Itai Nahshon, and Ron Y. Pinter. Spill code minimization techniques for optimizing compilers. In Proceedings of the ACM SIGPLAN 1989 Conference on Programming Language Design and Implementation, pages 258– 263, 1989.
[3] Preston Briggs, Keith D. Copper, and Linda Torczon. Rematerialization. In Proceedings of the ACM SIGPLAN 1992 Conference on Programming Language Design and Implementation, pages 311–321, 1992.
[4] Andrea Capitanio, Nikil Dutt, and Alexandru Nicolau. Partitioned register files for VLIW’s: A preliminary analysis of tradeoffs. In Proceedings of the 25th An- nual International Symposium on Microarchitecture, pages 292–300, December 1992.
[5] Andrea Capitanio, Nikil Dutt, and Alexandru Nicolau. Design considerations for limited connectivity VLIW architectures. Technical report, Department of Information and Computer Science, University of California, 1993.
[6] CEVA. Ceva-x1620 datasheet. Technical report, CEVA Inc., 2004.
[7] Gregory J. Chaitin. Register allocation and spilling via graph coloring. In Pro- ceedings of the ACM SIGPLAN 1982 Symposium on Compiler Construction, pages 98–105, June 1982.
[8] Gregory J. Chaitin, Marc A. Auslander, Ashok K. Chandra, John Cocke, Mar- tin E. Hopkins, and Peter W. Markstei. Register allocation via coloring. Com- puter Languages, 6(1):47–57, 1981.
[9] David Chang and Max Baron. Taiwan’s roadmap to leadership in design. Tech- nical report, Microprocessor Report, In-Stat/MDR, December 2004.
[10] Giuseppe Desoli. Instruction assignment for clustered vliw dsp compilers: A new approach. Technical report, Hwelett-Packard Laboratories, 1998.
[11] EEMBC. Edn embedded microprocessor benchmark consortium. Technical re- port, http://www.eembc.org.
[12] John R. Ellis. Bulldog: A compiler for VLIW Architectures. PhD thesis, Yale University, New Heaven, CT, USA, 1985.
[13] Matthew R. Guthaus, Jeffrey S. Ringenberg, Dan Ernst, Todd M. Austin, Trevor Mudge, and Richard B. Brown. MiBench: A free, commercially representative embedded benchmark suite. In Proceedings of the Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop, pages 3–14, 2001.
[14] George Karypis and Vipin Kumar. Multilevel k-way hypergraph partitioning. In Proceedings of the 36th annual ACM/IEEE Design Automation Conference, pages 343–348, 1999.
[15] Brian Wilson Kernighan and Shen Lin. An efficient heuristic procedure for par- titioning graphs. The Bell system technical journal, 49(1):291–307, 1970.
[16] Scott Kirkpatrick, Charles Daniel Gelatt, and Mario Pietro Vecchi. Optimization by simulated annealing. Science, 220:671–680, 1983.
[17] Priyadarshan Kolte and Mary Jean Harrold. Load/store range analysis for global register allocation. In Proceedings of the ACM SIGPLAN 1993 Conference on Programming Language Design and Implementation, pages 268–277, 1993.
[18] Akira Koseki, Hideaki Komastu, and Toshio Nakatani. Spill code minimization by spill code motion. In Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques, pages 125–134, 2003.
[19] Steven M. Kurlander and Charles N. Fischer. Zero-cost range splitting. In Proceedings of the ACM SIGPLAN 1994 Conference on Programming Language Design and Implementation, pages 257–265, 1994.
[20] Rainer Leupers. Instruction scheduling for clustered VLIW DSPs. In Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques, pages 291–300, October 2000.
[21] Tay Jyi Lin, Chin Chi Chang, Chen Chia Lee, and Chein Wei Jen. An efficient VLIW DSP architecture for baseband processing. In Proceedings of the 21th International Conference on Computer Design, October 2003.
[22] Tay Jyi Lin, Chie Min Chao, Chia Hsien Liu, Pi Chen Hsiao, Shin Kai Chen, Li Chun Lin, Chih Wei Liu, and Chein Wei Jen. Computer architecture: A unified processor architecture for RISC and VLIW DSP. In Proceedings of the 15th ACM Great Lakes symposium on VLSI, April 2005.
[23] Yung Chia Lin, Yi Ping You, and Jenq Kuen Lee. Register allocation for VLIW DSP processors with irregular register files. In Proceedings of the 12th Workshop on Compilers for Parallel Computers (CPC 2006), January 2006.
[24] Yung Chia Lin, Yi Ping You, and Jenq Kuen Lee. Palf: compiler supports for irregular register files in clustered VLIW DSP processors. Concurrency and Computation: Pratice and Experience, 19(18):2391–2406, 2007.
[25] P. Geoffrey Lowney, Stefan M. Freudenberger, Thomas J. Karzes, Woody Licht- enstein, Robert P. Nix, John S. Odonnell, and John C. Ruttenberg. The multiflow trace scheduling compiler. Journal of Supercomputing, 7:51–142, 1993.
[26] Chia Han Lu, Yung Chia Lin, Yi Ping You, and Jenq Kuen Lee. LC-GRFA: Global register file assignment with local consciousness for VLIW DSP processors with non-uniform register files. Concurrency and Computation: Pratice and Experience, 21(1):101–114, 2009.
[27] Emre Ozer, Sanjeev Banerjia, and Thomas M. Conte. ‘unified assign and sched- ule: A new approach to scheduling for clustered register files micro architectures. In Proceedings of the 31st Annual International Symposium on Microarchitecture, pages 308–315, November 1998.
[28] Alex Pothen, Horst D. Simon, and Kan Pu Liou. Partitioning sparse matrices with eigenvectors of graphs. SIAM Journal on Matrix Analysis and Applications, 11(3):430–452, 1990.
[29] Yi Qian, Steve Carr, and Philip H. Sweany. Optimizing loop performance for clustered VLIW architectures. In Proceedings of the 2002 International Con- ference on Parallel Architectures and Compilation Techniques, pages 271–280, September 2002.
[30] Ravi Sethi. Complete register allocation problems. SIAM Journal on Computing, 4(3):226–248, 1975.
[31] TIC6. Tms320c64x technical overview. Technical report, Texas Instruments, February 2000.
[32] Vojin Zivojnovic, Juan Martinez, Christian Schlager, and Heinrich Meyr. DSP- stone: A DSP-oriented benchmarking methodology. In Proceedings of the Inter- national Conference on Signal Processing and Technology, 1994.