具有針對向量最佳化之向量資料流分析之開放計算語言編譯器

簡易檢索 / 詳目顯示

回結果列表

研究生：	林餘德 Lin, Yu-Te
論文名稱：	具有針對向量最佳化之向量資料流分析之開放計算語言編譯器 An OpenCL Compiler Framework with Vector Data Flow Analysis for SIMD Optimizations on CPUs+GPUs
指導教授：	李政崑 Lee, Jenq-Kuen
口試委員:	游逸平黃冠寰許雅三黃慶育蘇泓萌陳鵬升
學位類別：	博士 Doctor
系所名稱：	電機資訊學院 - 資訊工程學系 Computer Science
論文出版年：	2016
畢業學年度：	104
語文別：	英文
論文頁數：	100
中文關鍵詞：	編譯器、繪圖處理器、向量運算、向量語法分析、資料流分析
外文關鍵詞：	OpenCL compiler, GPU, SIMD optimization, vector parsing, data flow analysis
相關次數：	點閱：2 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

異質多核心平台已被廣泛的使用於嵌入式系統與高效能計算領域，由於異質多核心平台結合了多個不同架構的處理器於同一個平台，因此會需要非常先進的軟體開發框架來協助軟體的開發，Open Computing Language (OpenCL)為目前廣泛使用於異質多核心平台的軟體開發框架，在OpenCL的組成中，其後端編譯器扮演著非常重要的角色，它負責將使用者的程式編譯成支援多個不同架構處理器的執行檔，目前常見的OpenCL開發工具大多利用LLVM作為其後端編譯器，此現象衍生出一個有趣的問題：OpenCL是否可以在其他編譯器上被有效率的實作出來?在其他編譯器上實現OpenCL不論在學術研究上或是OpenCL本身的發展上可望帶來更多革新，此外OpenCL原生支援的向量運算亦使得編譯器需要不同於以往的資料流分析方法。

此論文提出了一套在Open64編譯器上支援OpenCL的方法，Open64本身具備許多著名的最佳化技術，在Open64上支援OpenCL可使這些最佳化技術有機會被利用在OpenCL的程式上，此論文支援OpenCL的方法將涵蓋Open64編譯器的前端、中端與後端等重要部分，在實現OpenCL的支援後，我們亦在Open64上實現了支援OpenCL向量運算的資料流分析方法，並且基於此向量運算資料流分析方法更進一步的提出適用於OpenCL向量運算的編譯器最佳化技術。

最後我們進行了一系列的實驗，實驗結果顯示，利用Open64所開發的OpenCL編譯器可成功編譯並執行多個來自AMD APP SDK的測試程式，而我們所提出的OpenCL向量運算編譯器最佳化技術，可於x86 CPUs與AMD GPUs的環境下分別帶來22%與4%的效能提升，此結果證明利用Open64作為一套OpenCL編譯器為一可行的方法，同時亦可利用Open64來開發OpenCL相關的編譯器最佳化技術。

The use of heterogeneous multi-core platforms for both embedded and high-performance computing is becoming widespread. The integration of processors of different types means that these platforms require novel frameworks for supporting the development of software for them. Open Computing Language (OpenCL) is a commonly used framework for programming on heterogeneous multi-core platforms. One of the most important parts of OpenCL is the back-end compiler that compiles OpenCL programs for different processors. Most OpenCL compilers currently utilize LLVM as their compiler
infrastructure, which presents an interesting question: Can OpenCL be effectively implemented on other compiler infrastructures? Supporting OpenCL on other compiler infrastructures could provide the opportunity to incorporate more academic innovations in the development of OpenCL and its applications. The support of single-instruction multiple-data (SIMD) linguistics of OpenCL also requires special compiler data flow analysis to meet the optimization requirements.

Here we describe a method to apply an OpenCL compiler based
on the Open64 compiler infrastructure to AMD graphics processing units (GPUs) and x86 CPUs. Open64 is equipped with many legacy compiler optimizations, supporting OpenCL on Open64 provides the potential for these legacy optimizations to be applied to OpenCL programs. The required procedures are detailed herein for the front-end, middle-end, and back-end of the Open64 compiler. We then propose a calculus framework to support the data flow analysis of vector constructs for OpenCL programs that compilers can use to perform SIMD optimizations. We model OpenCL vector operations as data access functions in the style of mathematical functions. We then show that the data flow analysis for OpenCL vector linguistics can be performed based on the data access functions. Based on the information gathered from data flow analysis, we illustrate a set of SIMD optimizations on OpenCL programs.

Preliminary experimental results have demonstrated that the Open64-based OpenCL compiler can successfully compile fifteen benchmarks from the AMD APP SDK. Executing the compiled programs on the AMD GPU platform also produces correct results. The experimental results incorporating our calculus and our proposed compiler optimizations show that the proposed SIMD optimizations can provide average performance improvements of 22% on x86 CPUs and 4% on AMD GPUs. For the selected fifteen benchmarks, eleven of them are improved on x86 CPUs and six of them are improved on AMD GPUs. These results demonstrate the potential to adopt Open64 as an alternative OpenCL compiler as well as develop OpenCL SIMD optimizations on Open64.

中文摘要iii
Abstract v
Acknowledgements vii
Introduction 1
1 OpenCL and compiler infrastructures . . . . . . . . . . . . . . . . . 1
2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 Dissertation organization . . . . . . . . . . . . . . . . . . . . . . . . . 7
Background 9
1 OpenCL overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.1 OpenCL platform model . . . . . . . . . . . . . . . . . . . . . 10
1.2 OpenCL execution model . . . . . . . . . . . . . . . . . . . . 10
1.3 OpenCL memory model . . . . . . . . . . . . . . . . . . . . . 11
1.4 OpenCL programming language . . . . . . . . . . . . . . . . 11
2 Open64 compiler infrastructure . . . . . . . . . . . . . . . . . . . . . 14
2.1 Compiler front-end of Open64 . . . . . . . . . . . . . . . . . 14
2.2 WHIRL IR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Code generator intermediate representation . . . . . . . . . . 16
3 The AMD GPU platform . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1 AMD IL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Register swizzles . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 BIF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Compiler for General-Purpose Computation on GPUs 21
1 Compiler overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2 Program transformation methodology . . . . . . . . . . . . . . . . . 23
3 Compiler implementation . . . . . . . . . . . . . . . . . . . . . . . . 25
OpenCL Compiler Design and Implementation 28
1 Compiler front-end . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.1 Building tree nodes for OpenCL vector data types . . . . . . 29
1.2 Initializing vector variables . . . . . . . . . . . . . . . . . . . 31
1.3 Referencing vector components . . . . . . . . . . . . . . . . . 32
1.4 Parsing vector binary expressions . . . . . . . . . . . . . . . . 32
1.5 Address and function qualifier handling . . . . . . . . . . . . 35
2 Compiler middle-end . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.1 Lowering flow-control statements . . . . . . . . . . . . . . . . 37
2.2 Lowering direct load/store operations of OpenCL vectors . . 39
3 Compiler back-end . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.1 CGIR extension . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 Stack layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3 Register allocation . . . . . . . . . . . . . . . . . . . . . . . . 42
3.4 Basic compiler optimizations . . . . . . . . . . . . . . . . . . 44
Data Flow Analysis of OpenCL Vectors 47
1 Motivating examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2 Data access function of OpenCL vectors . . . . . . . . . . . . . . . . 51
3 Data flow analysis of data access functions . . . . . . . . . . . . . . . 54
3.1 Intersection of data access functions . . . . . . . . . . . . . . 55
3.2 Union of data access functions . . . . . . . . . . . . . . . . . . 57
3.3 Complement of data access functions . . . . . . . . . . . . . . 58
3.4 Canonicalization of data access functions . . . . . . . . . . . 60
3.5 Kill operation of data access functions . . . . . . . . . . . . . 61
Compiler Optimizations for OpenCL Vectors 62
1 Dependence analysis of OpenCL vectors . . . . . . . . . . . . . . . . 62
2 SIMD optimizations for OpenCL programs . . . . . . . . . . . . . . . 66
2.1 Vector aggregation . . . . . . . . . . . . . . . . . . . . . . . . 70
2.2 Vector copy propagation . . . . . . . . . . . . . . . . . . . . . 73
2.3 Vector common sub-expression elimination . . . . . . . . . . 76
Evaluations and Discussions 82
1 Correctness and basic performance evaluations . . . . . . . . . . . . 82
1.1 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
1.2 Evaluation results . . . . . . . . . . . . . . . . . . . . . . . . . 85
2 Experiments of the proposed OpenCL vector optimizations . . . . . 87
2.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . 87
2.2 Performance results . . . . . . . . . . . . . . . . . . . . . . . . 88
3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.1 Function inlining . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.2 Branch divergence . . . . . . . . . . . . . . . . . . . . . . . . 92
Conclusion 94
Bibliography 96
                                

[1] OpenCL. Available from: https://www.khronos.org/opencl/. [Accessed on 17 November 2015].
[2] Lattner C, Adve V. LLVM: a compilation framework for lifelong program analysis & transformation. Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization, Palo Alto, CA, USA, 2004.
[3] Open64. Available from: http://www.open64.net. [Accessed on 16 September 2014].
[4] Hung M-Y, Chen P-S, Hwang Y-S, Ju RD, Lee J-K. Support of Probabilistic Pointer Analysis in the SSA Form. IEEE Transactions on Parallel and Distributed Systems, 2012;
23(12):2366–2379.DOI:10.1109/TPDS.2012.73.
[5] Chow FC, Chan S, Liu S-M, Lo R, Streich M. Effective Representation of Aliases and Indirect Memory Operations in SSA Form. Proceedings of the 6th International Conference on Compiler Construction (CC), 1996; 253–267.
[6] Kuan C-B, Lee J-K. Compiler supports for VLIW DSP processors with SIMD intrinsics. Concurrency and Computation: Practice and Experience, 2012; 24(5):517–532.
DOI:10.1002/cpe.1845.
[7] Lin Y-C, You Y-P, Lee J-K. PALF: compiler supports for irregular register files in clustered VLIW DSP processors, 2007; 19(18):2391–2406.DOI:10.1002/cpe.1176.
[8] Hwang G-H, Lee J-K, Ju D-C. An array operation synthesis scheme to optimize Fortran 90 programs. Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming (PPOPP), Santa Barbara, CA, USA, 1995,
DOI:10.1145/209936.209949.
[9] Hwang G-H, Lee J-K, Ju D-C. A Function-composition Approach to Synthesize Fortran 90 Array Operations. Journal of Parallel and Distributed Computing, 1998; 54(1):1-47, DOI:10.1006/jpdc.1998.1481.
[10] AMD Accelerated Parallel Processing (APP) SDK. Available from: http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallelprocessing-app-sdk/. [Accessed on 15 February 2016].
[11] Lin Y-T, Chen P-S. Compiler Support for General-Purpose Computation on GPUs. The Journal of Supercomputing, 2009; 50(1):78-97.
[12] Lin Y-T, Wang S-C, Shih W-L, Hsieh BK-Y, Lee J-K. Enable OpenCL Compiler with Open64 Infrastructures. Proceedings of the 2011 IEEE International Conference on High
Performance Computing and Communications (HPCC), Banff, AB, Canada, 2011; 863–868. DOI:10.1109/HPCC.2011.123.
[13] Lin Y-T, Kuan C-B, Wang S-C, Lee J-K. A Functional Approach to Optimize SIMD Computations of OpenCL Programs. Proceedings of the 16th Workshop on Compilers for Parallel Computing (CPC), Padova, Italy, 2012.
[14] Lin Y-T, Lee J-K. Vector data flow analysis for SIMD optimizations on OpenCL programs. Concurrency and Computation: Practice and Experience, 2016; 28:1629–1654.
DOI:10.1002/cpe.3714.
[15] Intel SDK for OpenCL Applications. Available from: https://software.intel.com/enus/intel-opencl. [Accessed on 15 February 2016].
[16] OpenCL for OS X. Available from: https://developer.apple.com/opencl/. [Accessed on 15 February 2016].
[17] NVIDIA OpenCL. Available from: https://developer.nvidia.com/opencl. [Accessed on 15 February 2016].
[18] Lee J, Kim J, Seo S, Kim S, Park J, Kim H, Dao TT, Cho Y, Seo, SJ, Lee SH, Cho SM, Song HJ, Suh S-B, Choi J-D. An OpenCL Framework for Heterogeneous Multicores with Local Memory. Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT), Vienna, Austria, 2010; 193–204. DOI:10.1145/1854273.1854301.
[19] Kim J, Kim H, Lee JH, Lee, J. Achieving a Single Compute Device Image in OpenCL for Multiple GPUs. Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP), San Antonio, TX, USA, 2011; 277–288. DOI:10.1145/1941553.1941591.
[20] Kim J, Seo S, Lee J, Nah J, Jo G, Lee J. SnuCL: An OpenCL Framework for Heterogeneous CPU/GPU Clusters. Proceedings of the 26th ACM International Conference on Supercomputing (ICS), San Servolo Island, Venice, Italy, 2012; 341–352. DOI:10.1145/2304576.2304623.
[21] Seo S, Lee J, Jo G, Lee J. Automatic OpenCL Work-group Size Selection for Multicore CPUs. Proceedings of the 22nd International Conference on Parallel Architectures and
Compilation Techniques (PACT), Edinburgh, Scotland, UK, 2013; 387–398.
[22] Du P,Weber R, Luszczek P, Tomov S, Peterson G, Dongarra J. FromCUDA to OpenCL: Towards a Performance-portable Solution for Multi-platform GPU Programming. Parallel Computing, 2012; 38(8):391–407. DOI:10.1016/j.parco.2011.10.002.
[23] J¨a¨askel¨ainen P, Lama CS, Schnetter E, Raiskila K, Takala J, Berg H. pocl: A Performance-Portable OpenCL Implementation. International Journal of Parallel Programming, 2014; 43(5):752–785. DOI:10.1007/s10766-014-0320-y.
[24] Allen R, Kennedy K. Automatic translation of FORTRAN programs to vector form. ACM Transactions on Programming Languages and Systems, 1987; 9(4):491-542, DOI:10.1145/29873.29875.
[25] Wolf ME, Lam MS. A Loop Transformation Theory and an Algorithm to Maximize Parallelism. IEEE Transactions on Parallel and Distributed Systems, 1991; 2(4):452-471,
DOI:10.1109/71.97902.
[26] Rauchwerger L, Padua D. The LRPD test: speculative run-time parallelization of loops with privatization and reduction parallelization. Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation (PLDI), LaJolla, CA, USA, 1995, DOI:10.1145/207110.207148.
[27] Kennedy K, Allen JR. Optimizing compilers formodern architectures: a dependencebased approach. Morgan Kaufmann Publishers Inc.:San Francisco, CA, USA, 2002.
[28] Krall A, Lelait S. Compilation Techniques for Multimedia Processors. International Journal of Parallel Programming, 2000; 28(4):347-361, DOI:10.1023/A:1007507005174.
[29] Sreraman N, Govindarajan R. A Vectorizing Compiler for Multimedia Extensions. International Journal of Parallel Programming, 2000; 28(4):363-400, DOI:10.1023/A:1007559022013.
[30] Nuzman D, Zaks A. Outer-loop vectorization: revisited for short SIMD architectures. Proceedings of the 17th international conference on Parallel architectures and compilation techniques (PACT), Toronto, ON, Canada, 2008, DOI:10.1145/1454115.1454119.
[31] Larsen S, Amarasinghe S. Exploiting superword level parallelism with multimedia instruction sets. Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation (PLDI), Vancouver, BC, Canada, 2000, DOI:10.1145/349299.349320.
[32] Barik R, Zhao J, Sarkar V. Efficient Selection of Vector Instructions Using Dynamic Programming. Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Atlanta, GA, USA, 2010,
DOI:10.1109/MICRO.2010.38.
[33] Nuzman D, Rosen I, Zaks A. Auto-vectorization of interleaved data for SIMD. Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and
implementation (PLDI), Ottawa, ON, Canada, 2006, DOI:10.1145/1133981.1133997.
[34] Ren G, Wu P, Padua D. Optimizing data permutations for SIMD devices. Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation (PLDI), Ottawa, ON, Canada, 2006, DOI:10.1145/1133981.1133996.
[35] Carribault P, Cohen A, Jalby W. Deep Jam: Conversion of Coarse-Grain Parallelism to Instruction-Level and Vector Parallelism for Irregular App Proceedings of the 14th
International Conference on Parallel Architectures and Compilation Techniques (PACT), Saint Louis, MO, USA, 2005, DOI:10.1109/PACT.2005.16.
[36] Holewinski J, Ramamurthi R, RavishankarM, Fauzia N, Pouchet L-N, Rountev A, Sadayappan P. Dynamic trace-based analysis of vectorization potential of applications.
Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation (PLDI), Beijing, China, 2012, DOI:10.1145/2254064.2254108.
[37] Stock K, Pouchet L-N, Sadayappan P. Using machine learning to improve automatic vectorization. ACM Transactions on Architecture and Code Optimization, 2012;
8(4):50:1-50:23,DOI:10.1145/2086696.2086729.
[38] Fang J, Varbanescu AL, Liao X, Sips H. Evaluating vector data type usage in OpenCL kernels. Concurrency and Computation: Practice and Experience, 2015; 27:4586–4602.
DOI:10.1002/cpe.3424.
[39] Intel Inc. Intel OpenCL implicit vectorization module, 2011. http://llvm.org/devmtg/2011-11/Rotem IntelOpenCLSDKVectorizer.pdf [29 April 2015].
[40] You Y-P, Chen S-C. Vector-aware Register Allocation for GPU Shader Processors. Proceedings of the 2015 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES), Amsterdam, The Netherlands, 2015.
[41] Karrenberg R, Hack, S. Whole-function Vectorization. Proceedings of the 9th Annual IEEE/ACMInternational Symposium on Code Generation and Optimization (CGO), Chamonix, France, 2011; 141–150.
[42] Khronos Group. The OpenCL Specification. Available from:
http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf. [Accessed on 23 November 2015].
[43] NVIDIA CUDA: Compute Unified Device Architecture.
https://developer.nvidia.com/cuda-zone. [Accessed on 17 February 2016].
[44] GCC, the GNU Compiler Collection. Available from: https://gcc.gnu.org [Accessed on 8 December 2015].
[45] AMD Intermediate Language (IL). Available from:
http://developer.amd.com/wordpress/media/2012/10/AMD
Intermediate Language (IL) Specification v2.pdf. [Accessed on 15 December 2015].
[46] AMDAPP Kernel Analyzer. http://developer.amd.com/tools-and-sdks/archive/appkernel-analyzer [29 August 2014].
[47] Aho AV, Sethi R, Ullman JD. Compilers: Principles, Techniques, and Tools. Addison-Wesley, 1986.

全文公開日期本全文未授權公開 (校內網路)
全文公開日期本全文未授權公開 (校外網路)

簡易檢索 / 詳目顯示

相關論文