研究生: |
張皓鈞 Chang, Hao-Chun |
---|---|
論文名稱: |
支援 VLIW RVV 架構之 LLVM 編譯器及最佳化 LLVM Compiler for VLIW RVV Architectures and Optimizations |
指導教授: |
李政崑
Lee, Jenq-Kuen |
口試委員: |
關啟邦
Kuan, Chi-Bang 陳鵬升 Chen, Peng-Sheng |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Computer Science |
論文出版年: | 2023 |
畢業學年度: | 112 |
語文別: | 英文 |
論文頁數: | 29 |
中文關鍵詞: | 編譯器 、向量化 |
外文關鍵詞: | Modulo Scheduling |
相關次數: | 點閱:67 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
RISC-V 指令集在近年來越來越受歡迎。作為一種指令集架構(ISA), 它不僅允許開發者設計自訂指令,還可能支持 VLIW 編碼。VLIW 代表 超長指令字(Very Long Instruction Word),它表示一種能夠同時執行一 組指令的架構,從而利用指令級並行性來提高性能。主要由張量操作組 成的應用程序,如多媒體和計算機視覺,通常透過部署在採用 VLIW 架 構的數字信號處理器(DSP)上進行優化。本論文重點探討了為 RISC-V VLIW 架構構建 LLVM 編譯器,以探索此類配置的潛力。我們提出了一 種六發射 RISC-V VLIW 架構,該架構包括兩個整數單元、兩個浮點或雙 精度單元以及兩個向量單元。本文集中於在 LLVM 後端中建立 RISC-V VLIW 目標,展示了實作 LLVM 後端目標以及生成 VLIW 組合語言的必 要步驟。除了實作 LLVM 後端外,我們還為我們的目標整合了 Modulo Scheduling 演算法。在生成 RISC-V 向量指令集組合語言的過程中,我們 會利用 LMUL 功能將多個向量暫存器組合在一起,從而增加向量長度。然 而,必須考慮向量長度增加與指令排程期間可能出現的暫存器溢出的狀 況。這個狀況起因於隨著更多向量寄存器被組合在一起,可用的暫存器資 源減少,增加了指令排程階段暫存器溢出的可能性。為了解決這一問題, 我們開發了一個方程式來估計出對於該段迴圈比較適合的 LMUL 值。我們 進行的實驗包括在啟用了 Swing Modulo Scheduling 和未啟用的情況下對 我們提出的 VLIW RVV 架構進行性能比較。根據我們使用 DSP Stone 基準測試的實驗,我們觀察到了 0.2% 至 35% 的性能提升。
The RISC-V instruction set has gained increasing popularity in recent years. As an instruction set architecture (ISA), it not only enables developers to design custom instructions but also potentially supports VLIW encoding. VLIW, standing for Very Long Instruction Word, represents an architecture that concurrently executes a group of instructions, thereby leveraging the benefits of instruction-level parallelism to enhance performance. Applications largely composed of tensor operations, such as multimedia and computer vi- sion, are frequently optimized by deploying these applications on Digital Sig- nal Processors (DSPs) that adopt VLIW architectures. This paper focuses on the construction of LLVM compilers for RISC-V VLIW architectures to explore the potential of such configurations. We propose a 6-way RISC-V VLIW architecture, comprised of two integer units, two floating point or double precision units, and two vector units. The paper concentrates on the establishment of a RISC-V VLIW target within the LLVM backend, demon- strating the process of implementing an LLVM backend target and the neces- sary steps for generating VLIW assembly code. Beyond the implementation of the LLVM backend, we also integrate Modulo Scheduling for our target. In the process of generating RISC-V Vector Extension code, we leverage the LMUL feature to group multiple vector registers together, thereby enhancing vector length. However, it’s essential to consider the trade-off between increased vector length and potential register spilling during scheduling. As more vector registers are grouped together, the available register resources decrease, it increases the likelihood of register spilling during the scheduling phase. To address this, we have developed a heuristic equation to deter- mine the feasible LMUL. The experiment conducted involves a performance comparison between scenarios with and without Swing Modulo Scheduling enabled for our proposed VLIW RVV architectures. Based on our experimen- tation with the DSP Stone benchmark, we observed a performance increase ranging from 0.2% to 35%.
[1] “Qualcomm,” https://www.qualcomm.com/.
[2] “Kalray,” https://www.kalrayinc.com/.
[3] N. M. Qui, C. H. Lin, and P. Chen, “Design and implementation of a 256-bit risc-v-based dynamically scheduled very long instruction word on fpga,” IEEE Access, vol. 8, pp. 172 996–173 007, 2020.
[4] Z. Shen, H. He, X. Yang, D. Jia, and Y. Sun, “Architecture design of a variable length instruction set vliw dsp,” Tsinghua Science and Technology, vol. 14, no. 5, pp. 561–569, 2009.
[5] D. C.-W. Chang, T.-J. Lin, C.-J. Wu, J.-K. Lee, Y.-H. Chu, and A.-Y. Wu, “Parallel architecture core (pac)—the first multicore application processor soc in taiwan part i: Hardware architecture & software de- velopment tools,” Journal of Signal Processing Systems, vol. 62, pp. 373–382, 2011.
[6] Y.-C. Lin, Y.-P. You, and J.-K. Lee, “Palf: compiler supports for ir- regular register files in clustered vliw dsp processors,” Concurrency and computation: practice and experience, vol. 19, no. 18, pp. 2391–2406, 2007.
[7] C.-B. Kuan and J. K. Lee, “Compiler supports for vliw dsp processors with simd intrinsics,” Concurrency and Computation: Practice and Ex- perience, vol. 24, no. 5, pp. 517–532, 2012.
[8] C.-K. Chen, L.-H. Tseng, S.-C. Chen, Y.-J. Lin, Y.-P. You, C.-H. Lu, and J.-K. Lee, “Enabling compiler flow for embedded vliw dsp proces- sors with distributed register files,” in Proceedings of the 2007 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems, 2007, pp. 146–148.
[9] C. Wu, K.-Y. Hsieh, Y.-C. Lin, C.-J. Wu, W.-L. Shih, S.-C. Chen, C.-K. Chen, C.-C. Huang, Y.-P. You, and J. K. Lee, “Integrating compiler and system toolkit flow for embedded vliw dsp processors,” in 12th IEEE In- ternational Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA’06). IEEE, 2006, pp. 215–222.
[10] M.-S. Shih, H.-M. Lai, C.-L. Lee, C.-K. Chen, and J.-K. Lee, “Register- pressure aware predicator for length multiplier of rvv,” in Workshop Proceedings of the 51st International Conference on Parallel Processing, 2022, pp. 1–9.
[11] T. M. Lattner, “An Implementation of Swing Modulo Scheduling with Extensions for Superblocks,” Master’s thesis, Computer Science Dept., University of Illinois at Urbana-Champaign, Urbana, IL, June 2005, See http://llvm.cs.uiuc.edu.
[12] J. Llosa, A. Gonzalez, E. Ayguade, and M. Valero, “Swing module scheduling: a lifetime-sensitive approach,” in Proceedings of the 1996
Conference on Parallel Architectures and Compilation Technique, 1996,
pp. 80–86.
[13] C.-J. Wu, C.-H. Lu, and J. K. Lee, “Register spilling via transformed interference equations for pac dsp architecture,” Concurrency and Com- putation: Practice and Experience, vol. 26, no. 3, pp. 779–799, 2014.
[14] H. Li, N. Mentens, and S. Picek, “A scalable simd risc-v based processor with customized vector extensions for crystals-kyber,” in Proceedings of the 59th ACM/IEEE Design Automation Conference, ser. DAC ’22. New York, NY, USA: Association for Computing Machinery, 2022, p. 733–738. [Online]. Available: https://doi.org/10.1145/3489517.3530552
[15] V. Zivojnovic, J. Martinez, C. Schläger, and H. Meyr, “Dspstone: A dsp- oriented benchmarking methodology,” in Proc. of ICSPAT’94 - Dallas, Oct. 1994.