機器學習方法之複雜處理器編譯器設計｜國立清華大學博碩士論文庫

簡易檢索 / 詳目顯示

回結果列表

研究生：	唐宗麟 Chung-Lin Tang
論文名稱：	機器學習方法之複雜處理器編譯器設計 Code Generation for Complex Processors by Machine Learning
指導教授：	李政崑 Jenq-Kuen Lee 石維寬 Wei-Kuan Shih
口試委員:
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊工程學系 Computer Science
論文出版年：	2005
畢業學年度：	93
語文別：	英文
論文頁數：	45
中文關鍵詞：	編譯器、超長指令集處理器、數位訊號處理器、機器學習
外文關鍵詞：	Compilers, Code Generation, VLIW, DSP Processors, Machine Learning
相關次數：	點閱：2 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

本論文之研究主題為複雜架構處理器之編譯技術。此類複雜處理器，為達到現代嵌入式系統之高效能、低耗電的艱鉅需求，在架構設計上產生許多奇特之處：不規則之資料運算路徑設計、分散式暫存器、簡化之內部資料線路、等。這些設計反應在其指令集架構，對於編譯器之設計造成相當的挑戰。此論文中實作之部分是以Open Research Compiler (ORC) 開放原始碼編譯器為基礎。我們也將詳細介紹此論文中主題之複雜處理器，叫做Parallel Architecture Core (PAC)。
我們將檢視複雜架構之處理器的機器模型，發現在此類處理器架構之中，內部的硬體資源有呈現結構性特質，使得除了傳統的指令排程之外，運算資料的傳輸亦成為新的課題。我們將以PAC為主例，探討複雜處理器架構之程式編譯技術。我們將ORC編譯器，移植至PAC處理器架構，為其產生可執行程式碼。為了產生PAC之程式，我們將提出以機器學習做為程式最佳化之方法的演算法，並且在ORC中實作。我們將數個數位訊號處理的測試程式加以編譯，並執行於PAC架構之模擬器上，以觀察處理器架構及編譯器之效能。測試結果顯示我們的機器學習演算法，相較於早先實作之簡易演算法，有效的增加了所產生程式之處理效率，減少執行時間，幅度約35%至40%。

In this thesis, we describe a method of instruction scheduling and operand placement for complex processor architectures. Such processors possess irregular datapaths, multiple register banks, and incomplete internal connection networks, hindering the abilities of classical compilation techniques in generating efficient code. We discuss a port of the Open Research Compiler (ORC) to such a complex processor, a new VLIW DSP, called the Parallel Architecture Core (PAC). We examine PAC, observe and characterize the machine models of such architectures, and how they are different from contemporary processors. We find that for such architectures, the restrictions on operand transport,
represents a new class of machine resource models. Such resource models are due to incomplete interconnection networks in the design, making structural properties emerge in such processors. Using the PAC processor as an example,
we describe a machine learning method that formulates the problem state space as a storage mapping of data operands,
and generate code for the PAC by doing combined instruction scheduling and operand storage assignment. We evaluate our algorithm using a benchmark suite for DSP processors, and find our technique to obtain approximately 35% to 40% improvement over a prior simplistic algorithm.

Contents
Title Page
Abstract
Acknowledgments
1 Introduction
    1.1 Overview
    1.2 Outline of This Thesis
2 The Open Research Compiler
    2.1 Origins of ORC
    2.2 Overview of ORC/Open64
    2.3 Porting ORC
        2.3.1 Target Information Descriptions
        2.3.2 Code Expansion/Instruction Selection
3 The Parallel Architecture Core
    3.1 The PAC VLIW DSP Processor
    3.2 Clustered Design
    3.3 Port Constrained Register Files
    3.4 DSP Specific Features
    3.5 ORC Porting Issues
        3.5.1 Runtime Environment Registers
        3.5.2 Other Porting Issues
4 Generating Code for PAC
    4.1 Resources, Structure, and Parallelism
        4.1.1 Operand Placement and Instruction Scheduling
        4.1.2 Transport Resources
        4.1.3 Structural Processors
        4.1.4 Examination of Some Processors
    4.2 Machine Learning: Simulated Annealing
        4.2.1 Simulated Annealing
        4.2.2 Cluster Partitioning of Operations
        4.2.3 The Evaluation Function
    4.3 Reformulation of the Problem
        4.3.1 Optimizing Assignments: Transport Resources
        4.3.2 Optimizing Assignments: Operand Locations
        4.3.3 Register Allocation
    4.4 The Simulated Annealing Algorithm
    4.5 The Scheduler Algorithm
5 Evaluation
    5.1 Benchmark Environment
    5.2 DSPStone Results
6 Conclusion
    6.1 Summary
    6.2 Future Work
Bibliography

                                

[1] The GNU Compiler Collection.
http://gcc.gnu.org.
[2] Open Research Compiler for Itanium processor family.
http://ipf-orc.sourceforge.net.
[3] Open64 Compiler Tools.
http://open64.sourceforge.net.
[4] WHIRL Intermediate Language and Symbol Table Speci‾cation.
http://open64.sourceforge.net/documentation.html.
[5] Michael Chu, Kevin Fan, and Scott Mahlke. Region-based hierarchical operation
partitioning for multicluster processors. In PLDI '03: Proceedings of the ACM
SIGPLAN 2003 conference on Programming language design and implementa-
tion, pages 300{311, New York, NY, USA, 2003. ACM Press.
[6] Ron Cytron, Jeanne Ferrante, Barry K. Rosen, Mark N. Wegman, and F. Ken-
neth Zadeck. E±ciently computing static single assignment form and the control
dependence graph. ACM Trans. Program. Lang. Syst., 13(4):451{490, 1991.
[7] Analog Devices. Black‾n Processor Instruction Set Reference.
http://www.analog.com/processors/processors/black‾n/.
[8] J. R. Goodman and W.C. Hsu. Code scheduling and register allocation in large
basic blocks. In ICS '88: Proceedings of the 2nd international conference on
Supercomputing, pages 442{452, New York, NY, USA, 1988. ACM Press.
[9] Texas Instruments. TMS320C6000 CPU and Instruction Set Reference Guide.
http://dspvillage.ti.com/.
[10] Krishnan Kailas, Ashok Agrawala, and Kemal Ebcioglu. CARS: A new code gen-
eration framework for clustered ILP processors. In HPCA '01: Proceedings of the
Seventh International Symposium on High-Performance Computer Architecture
(HPCA'01), page 133, Washington, DC, USA, 2001. IEEE Computer Society.
[11] Robert Kennedy, Sun Chan, Shin-Ming Liu, Raymond Lo, Peng Tu, and Fred
Chow. Partial redundancy elimination in SSA form. ACM Trans. Program. Lang.
Syst., 21(3):627{676, 1999.
[12] S. Kirkpatrick, C.D. Gerlatt Jr., and M.P. Vecchi. Optimization by simulated
annealing. Science, 220:671{680, 1983.
[13] Viktor S. Lapinskii, Margarida F. Jacome, and Gustavo A. De Veciana. Cluster
assignment for high-performance embedded VLIW processors. ACM Trans. Des.
Autom. Electron. Syst., 7(3):430{454, 2002.
[14] Rainer Leupers. Instruction scheduling for clustered VLIW DSPs. In IEEE
PACT, pages 291{300, 2000.
[15] S. A. Mahlke, D. C. Lin, W. Y. Chen, R. E. Hank, and R. A. Bringmann.
E®ective compiler support for predicated execution using the hyperblock. In
25th Annual International Symposium on Microarchitecture, 1992.
[16] Srinivas Mantripragada, Suneel Jain, and James C. Dehnert. A new framework
for integrated global local scheduling. In IEEE PACT, pages 167{, 1998.
[17] Emre Ä Ozer, Sanjeev Banerjia, and Thomas M. Conte. Uni‾ed assign and sched-
ule: a new approach to scheduling for clustered register ‾le microarchitectures.
In MICRO 31: Proceedings of the 31st annual ACM/IEEE international sympo-
sium on Microarchitecture, pages 308{315, Los Alamitos, CA, USA, 1998. IEEE
Computer Society Press.
[18] M. Pincus. A Monte Carlo method for the approximate solution of certain types
of constrained optimization problems. Oper. Res., 18:1225{1228, 1970.
[19] B. Ramakrishna Rau. Iterative modulo scheduling: an algorithm for software
pipelining loops. In MICRO 27: Proceedings of the 27th annual international
symposium on Microarchitecture, pages 63{74, New York, NY, USA, 1994. ACM
Press.
[20] Michael Bedford Taylor, Jason Kim, Jason Miller, David Wentzla®, Fae Gho-
drat, Ben Greenwald, Henry Ho®man, Paul Johnson, Jae-Wook Lee, Walter Lee,
Albert Ma, Arvind Saraf, Mark Seneski, Nathan Shnidman, Volker Strumpen,
Matt Frank, Saman Amarasinghe, and Anant Agarwal. The Raw Microprocessor:
A Computational Fabric for Software Circuits and General-Purpose Programs.
IEEE Micro, 22(2):25{35, 2002.
[21] V. Zivojnovic, J.M. Velarde, C. SchlÄager, and H. Meyr. DSPstone: A DSP-
oriented benchmarking methodology. In Proceedings of Signal Processing Appli-
cations & Technology, 1994.

全文公開日期本全文未授權公開 (校內網路)
全文公開日期本全文未授權公開 (校外網路)

簡易檢索 / 詳目顯示

相關論文