研究生: |
林永嘉 Yung-Chia Lin |
---|---|
論文名稱: |
具分散式及非正規設計之超長指令集數位訊號處理器架構之編譯器設計與最佳化研究 Compilers for VLIW DSP Architectures with Distributed and Irregular Designs |
指導教授: |
李政崑
Jenq Kuen Lee |
口試委員: | |
學位類別: |
博士 Doctor |
系所名稱: |
電機資訊學院 - 資訊工程學系 Computer Science |
論文出版年: | 2007 |
畢業學年度: | 95 |
語文別: | 英文 |
論文頁數: | 92 |
中文關鍵詞: | 編譯器 、數位訊號處理器 、超長指令集架構 、分散式架構 、非正規設計 、平行架構核心 |
外文關鍵詞: | compiler, DSP, VLIW, distributed architecture, irregular design, PAC |
相關次數: | 點閱:3 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
超長指令集架構已成為近年來提供更高指令層級平行化與效能的現在化高階處理器之主流設計。拜超大型積體電路技術之進步所賜,現今已能夠設計出比過往更強大及快速的晶片,但複雜度、尺寸及耗電量卻成為設計新的超長指令集架構處理器的額外要考慮的議題。對於嵌入式系統市場而言,一個成功的處理器設計必須要同時兼顧高效能、低耗能、低成本、及比別家更短的上市時間等特色。因此,有些可用來增強通用型超長指令集架構處理器效能之熱門、繁複及花俏的設計並不合適用在一個也需要高效能的嵌入式處理器的設計上。
這些年來有眾多的暫存器檔案架構及非常規設計被研發出來用在嵌入式處理器上,以便能比傳統高效能處理器架構節省更多的複雜度、耗能及尺寸。由於編譯器一般被視為一處理器設計能否成功之最重要的系統軟體組件,因此研發能夠有效支援這類非正規設計之架構的程式碼產生與最佳化技術是十分引人興趣的。更何況在這種超長指令集架構下也需要有效的編譯器支援才能讓編寫程式的效率能夠提升。
在本論文中,我們提出了有效支援一具有非正規設計之嶄新的超長指令集架構數位訊號處器的編譯器設計與最佳化研究結果。我們所針對的處理器稱為平行架構核心數位訊號處理器(PAC DSP),是設計成具有高度存取埠限制之區塊分割形式的暫存器檔案。另外、平行架構核心數位訊號處理器也運用了一種異構化分散式資料路徑架構來達到低複雜度、小尺寸、及可低耗能之有效率的設計。我們認為平行架構核心數位訊號處理器提供了一種架構模型是有希望達成實用上能應付應用程式所需之高度平行化但又能減少日益嚴重的複雜度、尺寸、及耗能所帶來的問題。我們針對平行架構核心數位訊號處理器所研發的相關編譯器技術成果與經驗也能夠對其他欲發展編譯器在類似架構上的開發者有所助益。
我們將介紹如何運用開放研究編譯器(Open Research Compiler)架構來完成在一嶄新超長指令集架構數位訊號處理器的非常規暫存器檔案架構下的程式碼產生之具體設計。同時,我們也將介紹在這種架構下有效支援產生高品質程式碼之新的暫存器配置框架。我們提出了數種暫存器配置的方法能夠有效利用非常規的暫存器檔案架構。另外,我們也介紹了其他能夠支援在平行架構核心數位訊號處理器上最佳化的編譯器技術。
使用我們所開發給平行架構核心數位訊號處理器的編譯器之所有實驗結果皆顯示我們在這個架構下所研發的編譯器技術方法都能明顯增進所產生的程式碼效能。進一步而言,利用我們所研發的編譯器將能更有效率地運用平行架構核心數位訊號處理器的特殊暫存器檔案架構及非正規的設計。
VLIW architectures have already been the main-stream design for a modern high-end processor in recent years to support more instruction-level-parallelism (ILP) and potential performance than the traditional single-issue CISC/RISC machines. Due to the advances in VLSI technology, people nowadays could develop more powerful and faster chips than ever, but also get additional issues to be considered while designing a new VLIW processor: complexity, die size, and power dissipation. For the embedded-system market, a successful processor design not only requires to provide ample performance but features low-power consumption, low cost, and reduced time-to-market. Therefore, some popular, fancy and sophisticated design techniques to enhance the performance of a general-purpose VLIW processor may not be feasible for an embedded processor that also demands a high performance criterion.
Wide varieties of register file architectures and irregular designs --- developed for embedded processors --- have turned to aim at reducing the complexity, power dissipation, and die size these years, by contrast with the traditional architectures implemented by high-performance processors. There has been considerable interest in developing the techniques to effectively support the code generation and optimizations for such architectures with irregular designs because the compiler is generally regarded as the most important system-software component that supports a processor design to achieve success. It is also essential to have adequate compiler support for VLIW architectures so that the programming efficiency could be dramatically improved.
This dissertation has made contributions to the design and development of an effective compiler for a novel VLIW DSP with irregular designs. The target DSP architecture, known as the PAC DSP core, is designed with distinctively partitioned register files in which port access is highly restricted. Moreover, the PAC DSP utilizes a heterogeneous distributed data-path architecture to attain an efficient design with low complexity, small size, and the possible low power consumption. We believe that the PAC DSP employs a promising architecture model to pragmatically support the high parallelism demanded by the DSP applications but reduce the disadvantageous progress of chip complexity, die size, and power dissipation. Our experiences in designing the compiler support for the PAC DSP may also be of interest to those involved in developing compilers for the similar architectures with such irregular designs.
Our major contributions in this dissertation are as follows:
1. We present our application of the Open64/ORC infrastructure to a novel VLIW DSP and the specific design for handling its register file architecture. As part of an effort to overcome the new challenges of code generation for the PAC DSP, we have developed a new register allocation framework and other retargeting optimization phases that allow the effective generation of %support in generating high quality code.
2. We propose a novel heuristic algorithm, named ping-pong aware local favorable (PALF) register allocation, to obtain advantageous register allocation that is expected to better utilize irregular register file architectures. We also propose an alternate register allocation scheme using a simulated-annealing (SA) approach, and a hybrid optimization procedure to integrate the PALF and SA. Furthermore, an associated global register allocation strategy is presented and discussed.
3. Advanced subjects to support generating optimized code for PAC DSP architectures are also discussed in this dissertation and preliminarily developed in our compilation infrastructure.
The results of all experiments performed using our optimizing compiler based on the Open Research Compiler (Open64/ORC), showed significant performance improvement over the primitive code generation. Our preliminary experimental results also indicate that our developed compiler can efficiently utilize the features of the specific register file architectures and irregular designs in the PAC DSP.
[1] Introduction to suif2. In The tutorial at the ACM SIGPLAN ’99 Conference on Pro-
gramming Language Design and Implementation, May 1999.
[2] A. V. Aho, J. D. Ullman, and R. Sethi. Compilers: Principles, Techniques, and Tools.
Addison-Wesley, Reading, MA, 1986.
[3] Andrew Appel, Jack Davidson, and Norman Ramsey. The zephyr compiler infrastructure.
Technical report, 1998.
[4] Eduard Ayguade, Xavier Martorell, Jesus Labarta, Marc Gonzalez, and Nacho Navarro.
Exploiting multiple levels of parallelism in openmp: A case study. In International
Conference on Parallel Processing, pages 172–180, 1999.
[5] D. Batten, S. Jinturkar, J. Glossner, M. Schulte, and P. D’Arcy. A new approach to dsp
intrinsic functions. In Proceedings of the Hawaii International Conference on System
Sciences, pages 908–918, January 2000.
[6] D. Batten, S. Jinturkar, J. Glossner, M. Schulte, R. Peri, and P. D’Arcy. Interaction
between optimizations and a new type of dsp intrinsic function. In Proceedings of the In-
ternational Conference on Signal Processing Applications and Technology (ICSPAT’99),
November 1999.
[7] Z. Bozkus, A. Choudhary, G. Fox, T. Haupt, and S.Ranka. Fortran 90d/hpf compiler
for distributed memory mimd computers: design, implementation and performance
results. In Proceedings of Supercomputing ’93, pages 351–360, November 1993.
[8] A. Capitanio, N. Dutt, and A. Nicolau. Partitioned register files for vliw’s: A preliminary
analysis of tradeoffs. In Procs. of the 25th Int. Symp. on Microarchitecture, pages
292–300. IEEE Computer Society Press, December 1–4 1992.
[9] Lakshmi N. Chakrapani, John Gyllenhaal, Wen mei W. Hwu, Scott A. Mahlke, Krishna
V. Palem, and Rodric M. Rabbah. Trimaran: An infrastructure for research in
instruction-level parallelism. Languages and Compilers for High Performance Comput-
ing, 3602:32–41, 2005.
[10] C. Chang and D. Marculescu. Design and analysis of a low power vliw dsp core. In
Procs. of IEEE Computer Society Annual Symposium on Emerging VLSI Technologies
and Architectures, 2006.
[11] D. Chang and M. Baron. Taiwan’s roadmap to leadership in design. Microprocessor
Report, December 2004.
[12] D. C.-W. Chang, C.-W. Jen, I-T. Liao, J.-K. Lee, W.-F. Chen, and S.-Y. Tseng. Pac
dsp core and application processors. In Procs. of the IEEE Int. Conf. on Multimedia
& Expo, July 9–12 2006.
[13] P.P. Chang, S.A. Mahike, W.Y. Chen, N.J. Warier, and W.W. Hwu. Impact: An
architectural framework for multiple-instruction-issue processors. In Proceedings of the
18th Annual International Symposium on Computer Architecture, pages 266–275, May
1991.
[14] Cheng-Wei Chen, Chung-Ling Tang, Yung-Chia Lin, and Jenq-Kuen Lee. Orc2dsp:
Compiler infrastructure supports for vliw dsp processors. In Proceedings of 2005 IEEE
VLSI-TSA International Symposium on VLSI Design, Automation and Test, pages
229–232, April 27–29 2005.
[15] D. Chen, W. Zhao, and H. Ru. Design and implementation issues of intrinsic functions
for embedded dsp processors. In Proceedings of the ACM SIGPLAN International
Conference on Signal Processing Applications and Technology (ICSPAT’97), pages 505–
509, September 1997.
[16] Peng-Sheng Chen, Ming-Yu Hung, Yuan-Shin Hwang, Roy Dz-Ching Ju, and Jenq Kuen
Lee. Compiler support for speculative multithreading architecture with probabilistic
points-to analysis. In Proceedings of ACM Principles and Practices of Parallel Pro-
gramming (ACM PPoPP), 2003.
[17] Peng-Sheng Chen, Yuan-Shin Hwang, Dz-Ching Ju, and Jenq Kuen Lee. Interprocedural
probabilistic pointer analysis. IEEE Transactions on Parallel and Distributed
Systems, 15(10):893–907, October 2004.
[18] M. Chu, K. Fan, and S. Mahlke. Region-based hierarchical operation partitioning for
multicluster processors. In Procs. of the ACM SIGPLAN 2003 Conf. on Programming
Language Design and Implementation, pages 300–311, 2003.
[19] J. Codina, J. Sanchez, and A. Gonzalez. A unified modulo scheduling and register
allocation technique for clustered processors. In Proceedings. 2001 International Con-
ference on Parallel Architectures and Compilation Techniques, pages 175–184. IEEE
Computer Society Press, December 2001.
[20] J. Ellis. Bulldog: A Compiler for VLIW Architectures. MIT Press, 1985.
[21] Free Software Foundation. The GNU Compiler Collection (GCC) Internals, 2007.
[22] G. R. Gao, J. N. Amaral, J. Dehnert, and R. Towle. The sgi pro64 compiler infrastructure:
A tutorial. In Tutorial at the Int’l Conference on Parallel Architecture and
Compilation Techniques, October 2000.
[23] Naji Ghazal, Richard Newton, and Jan Rabaey. Predicting performance potential of
modern dsps. In Proceedings of IEEE/ACM Design Automation Conference (DAC),
June 2000.
[24] The vice president at Texas Instruments Inc. Greg Delagi. Executive comment: Dsp
vendors look beyond silicon, July 2001. Online Issue by CMP Media LLC.
[25] The vice president at Texas Instruments Inc. Greg Delagi. Executive comment: Programmable
dsps benefit ias, May 2001. Online Issue by CMP Media LLC.
[26] The vice president at Texas Instruments Inc. Greg Delagi. Opinion: Dsp vendors’ focus
shifting to software, March 2002. Online Issue by CMP Media LLC.
[27] John R. Hauser. SoftFloat Release 2b Source Documentation, May 2002.
[28] B. Hendrickson and R. Leland. The chaco user’s guide, version 2.0. Technical Report
SAND95-2344, Sandia National Laboratories, October 1994.
[29] Jason Hiser, Steve Carr, and Philip Sweany. Global register partitioning. In Procs. of the
2000 International Conference on Parallel Architectures and Compilation Techniques,
2000.
[30] Chung-Wen Huang, Young-Chia Lin, Yi-Ping You, Jenq-Kuen Lee, and Ting-Ting
Hwang. Architecture-level simulations with rapid power estimations for security processors
with multiple power domains. In Proceedings of Asia and South Pacific Interna-
tional Conference on Embedded SoCs (ASPICES’05). IEEE Computer Society, July 5–8
2005.
[31] Gwan-Hwan Hwang, Jenq Kuen Lee, and Dz-Ching Ju. Integrating automatic data
alignment and array operation synthesis to optimize data parallel programs. In Pro-
ceedings of the 10th International Workshop on Languages and Compilers for Parallel
Computing (LCPC’97), Augest 1997.
[32] Texas Instrucments. TMS320DM6443 Digital Media System-on-Chip Datasheet, 2006.
[33] M. Jersak and M. Willems. Fixed-point extended c compiler allows more efficient highlevel
programming of fixed-point dsps. In Proceedings of the International Conference
on Signal Processing Applications and Technology (ICSPAT’98), October 1998.
[34] J.Glossner, D. Routenberg, E. Hokenek, M. Moudgill, M. Schulte, P. Balzola, and
S. Vassiliadis. Towards a very high bandwidth wireless handheld device. Technical
report, Sandbridge Technologies, Inc., 2001. White Paper.
[35] Roy Ju, Sun Chan, and Chengyong Wu. Open research compiler for the itanium family.
In Tutorial at the 34th Annual Int’l Symposium on Microarchitecture, December 2001.
[36] G. Karypis and V. Kumar. A fast and highly quality multilevel scheme for partitioning
irregular graphs. SIAM Journal of Scientific Computing, 20(1):359–392, 1999.
[37] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing.
Science, 220(4598):671–680, 1983.
[38] B. Krepp. Dsp-oriented extension to ansi c. In Proceedings of the International Confer-
ence on Signal Processing Applications and Technology (ICSPAT’97), pages 658–664,
1997.
[39] Ashutosh K. Kulkarni and Aditya Dube. Benchmarking code generation methodologies
for programmable digital signal processors. April 1997.
[40] K. Leary and W. Waddington. Dsp/c: a standard high level language for dsp aad
numeric processing. In Proceedings of the International Conference on Acoustic, Speech,
and Signal Processing, pages 1065–1068, 1990.
[41] Ching-Ren Lee, Jenq-Kuen Lee, Ting-Ting Hwang, and Shih-Chun Tsai. Compiler
optimizations on vliw instruction scheduling for low power. ACM Transactions on
Design Automation of Electronic Systems, 8(2):252–268, 2003.
[42] R. Leupers. Instruction scheduling for clustered vliw dsps. In Procs. of Int. Conf.
on Parallel Architecture and Compilation Techniques, pages 291–300. IEEE Computer
Society Press, October 2000.
[43] T.-J. Lin, C.-C. Chang, C.-C. Lee, and C.-W. Jen. An efficient vliw dsp architecture
for baseband processing. In Procs. of the 21th Int. Conf. on Computer Design, 2003.
[44] T.-J. Lin, C.-M. Chao, C.-H. Liu, P.-C. Hsiao, S.-K. Chen, L.-C. Lin, C.-W. Liu, and
C.-W. Jen. Computer architecture: A unified processor architecture for risc & vliw dsp.
In Procs. of the 15th ACM Great Lakes symposium on VLSI. ACM Press, April 2005.
[45] T.-J. Lin, P.-C. Hsiao, C.-W. Liu, and C.-W. Jen. Area-efficient register organization
for fully-synthesizable vliw dsp cores. International Journal of Electrical Engineering,
13:117–127, May 2006.
[46] T.-J. Lin, C.-C. Lee, C.-W. Liu, and C.-W. Jen. A novel register organization for
vliw digital signal processors. In Procs. of 2005 IEEE Int. Symp. on VLSI Design,
Automation, and Test, pages 335–338, 2005.
[47] Young-Jia Lin, Yuan-Shin Hwang, and Jenq Kuen Lee. Compiler optimizations with
dsp-specific semantic descriptions. In Proceedings of the 2002 International Workshop
on Languages and Compilers for Parallel Computing, July 2002.
[48] Yung-Chia Lin, Chung-Wen Huang, and Jenq-Kuen Lee. System-level design space
exploration for security processor prototyping in analytical approaches. In Proceedings
of the 10th Asia and South Pacific Design Automation Conference (ASP-DAC’05),
pages 376–380, January 18–21 2005.
[49] Yung-Chia Lin, Yuan-Shin Hwang, and Jenq-Kuen Lee. Compiler optimizations with
dsp-specific semantic descriptions. Lecture Notes in Computer Science, pages 75–89,
2005.
[50] Yung-Chia Lin, Chia-Han Lu, Chung-Ju Wu, Chung-Lin Tang, Yi-Ping You, Ya-Chaio
Moo, and Jenq-Kuen Lee. Effective code generation for distributed and ping-pong
register files: a case study on pac vliw dsp cores. Journal of VLSI Signal Processing
Systems. Accespted.
[51] Yung-Chia Lin, Chung-Lin Tang, Chung-Ju Wu, Ming-Yu Hung, Yi-Ping You, Ya-
Chiao Moo, Sheng-Yuan Chen, and Jenq Kuen Lee. Compiler supports and optimizations
for pac vliw dsp processors. Lecture Notes in Computer Science.
[52] Yung-Chia Lin, Yi-Ping You, Chung-Wen Huang, Jenq-Kuen Lee, Wei-Kuan Shih,
and Ting-Ting Hwang. Energy-aware scheduling and simulation methodologies for
parallel security processors with multiple voltage domains. Journal of Supercomputing.
Accepted.
[53] Yung-Chia Lin, Yi-Ping You, Chung-Wen Huang, Jenq-Kuen Lee, Wei-Kuan Shih,
and Ting-Ting Hwang. Power-aware scheduling for parallel security processors with
analytical models. Lecture Notes in Computer Science, pages 470–484, 2005.
[54] Yung-Chia Lin, Yi-Ping You, and Jenq-Kuen Lee. Register allocation for vliw dsp processors
with irregular register files. In Proceedings of Compilers for Parallel Computers
(CPC’06), pages 45–59, January 9–11 2006.
[55] Yung-Chia Lin, Yi-Ping You, and Jenq-Kuen Lee. Palf: Compiler supports for irregular
register files in clustered vliw dsp processors. Concurrency and Computation: Practice
and Experience, 19:1–16, 2007.
[56] E. Nystrom and A. E. Eichenberger. Effective cluster assignment for modulo scheduling.
In Procs. of the 31th Int. Symp. on Microarchitecture, pages 103–114. IEEE Computer
Society Press, November 1998.
[57] E. Ozer, S. Banerjia, and T. Conte. Unified assign and schedule: a new approach to
scheduling for clustered register file microarchitectures. In Procs. of the 31st Int. Symp.
on Microarchitectures, pages 308–315. IEEE Computer Society Press, November 1998.
[58] The vice president at Motorola. Paul Marino. Optimizing dsps for wireless world, April
2001. Online Issue by eMedia Asia Ltd.
[59] S. Rixner, W. J. Dally, B. Khailany, P. Mattson, U. J. Kapasi, and J. D. Owens. Register
organization for media processing. In Procs. of Int. Symp. on High Performance
Computer Architecture, pages 375–386, 2000.
[60] P. Salamon, P. Sibani, and R. Frost. Facts, Conjectures, and Improvements for Simu-
lated Annealing. Number 7 in Monographs on Mathematical Modeling and Computation.
Society for Industrial and Applied Mathematics, 2002.
[61] SGI. WHIRL Intermediate Language Specification, 2000.
[62] Kuen-Yuan Shieh, Yung-Chia Lin, Chien-Ching Huang, and Jenq-Kuen Lee. Enhancing
microkernel performance on vliw dsp processors via multiset context switch. Journal
of VLSI Signal Processing Systems. Accepted.
[63] A. Terechko, E. L. Thenaff, Eijndhoven M. Garg, and H. Corporaal. Inter-cluster
communication models for clustered vliw processors. In Proceedings. The Ninth In-
ternational Symposium on High-Performance Computer Architecture, pages 354–364,
February 2003.
[64] Andrei Terechko, Erwan Le Thénaff, and Henk Corporaal. Cluster assignment of global
values for clustered vliw processors. In Procs. of the 2003 International Conference on
Compilers, Architecture, and Synthesis for Embedded Systems (CASES ’03), pages 32–
40, 2003.
[65] M. E. Wolf, D. E. Maydan, and D.-K. Chen. Combining loop transformations considering
caches and scheduling. International Journal of Parallel Programming, 26(4),
1998.
[66] Chi Wu, Kun-Yuan Hsieh, Yung-Chia Lin, Chung-Ju Wu, Wen-Li Shih, S. C. Chen,
Chung-Kai Chen, Chien-Ching Huang, Yi-Ping You, and Jenq Kuen Lee. Integrating
compiler and system toolkit flow for embedded vliw dsp processors. In Proceedings
of the 12th IEEE International Conference on Embedded and Real-Time Computing
Systems and Applications, pages 215–222, August 16–18 2006.
[67] Yi-Ping You, Chung-Wen Huang, and Jenq-Kuen Lee. A sink-n-hoist framework for
leakage power reduction. In Proceedings of the Fifth International Conference on Em-
bedded Software, pages 124–133. ACM Press, September 2005.
[68] Yi-Ping You, Ching-Ren Lee, and Jenq Kuen Lee. Compiler analysis and supports
for leakage power reduction on microprocessors. In Proceedings of 15th Workshop on
Languages and Compilersfor Parallel Computing, July 2002.
[69] V. Zivojnovic, J. Martinez, C. Schlager, and H. Meyr. DSPstone: A DSP-oriented
benchmarking methodology. In Procs. of Int. Conf. on Signal Processing and Technol-
ogy, pages 715–720. DSP Associates, 1995.