應用於GPU之蝶形演算法資料排列｜國立清華大學博碩士論文庫

簡易檢索 / 詳目顯示

回結果列表

研究生：	黃湧鈞 Huang, Yong-Jyun
論文名稱：	應用於GPU之蝶形演算法資料排列 Index Conversion on GPU for Butterfly Based Algorithm
指導教授：	許雅三 Hsu, Yar-Sun
口試委員:	鐘太郎 Jong, Tai-Lang 邱瀞德 Chiu, Ching-Te
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 電機工程學系 Department of Electrical Engineering
論文出版年：	2017
畢業學年度：	106
語文別：	英文
論文頁數：	53
中文關鍵詞：	資料排列
外文關鍵詞：	data layout, butterfly, index permutation
相關次數：	點閱：1 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

現今訊號處理演算法在許多領域扮演重要的角色，例如: 無線通訊、影像處理、頻譜分析…等。隨著運算需求增大，許多研究針對各種訊號處理演算法提出快速演算法 (fast algorithm)，而GPU的應用更促使許多平行化的快速演算法的研究蓬勃發展。許多快速演算法都有資料排列的潛在問題，例如:傅立葉轉換、沃爾什-阿達瑪轉換、哈特利轉換、哈爾轉換…等，而許多研究都是在轉換前後額外處理。值得一提的是[2]實作出令人驚艷的函式庫，[2]利用定義的運算子以及排程將資料排列的工作並入轉換的演算法部分。
[1]利用硬體的方式解決異質系統中資料排列的問題，而本篇論文也應用[1]的概念解決上述快速演算法的潛在問題。在位址轉換過程中我們使用[1]的ping-pong傳輸模式以盡可能縮短位址轉換造成的延遲。資料排列方面我們實做出三種模式: bit-reversal order、natural Hadamard order、快速哈爾轉換的資料次序，並模擬使用硬體方式解決資料排列問題後對於六種演算法在GPU上的加速。
在我們的實驗中，訊號處理演算法得到的加速和資料排列的複雜度成正比，但是如果演算法轉換部分運算量較大，得到的加速則會下降。在挑選的演算法裡面最差的加速是快速傅立葉轉換，落在7.12%~11.2%；最佳的加速則是快速沃爾什-阿達瑪轉換，落在15.28%~40.78%。

Signal processing algorithm plays an important role in many fields nowadays, such as wireless communication, image processing, spectrum analysis and so on. As computational demands increase, many researches presented fast algorithm about signal processing algorithms. Besides, evolving GPU technology further boosts researchers parallelize existing fast algorithm. However, fast algorithms such as Fast Fourier Transform (FFT), Fast Walsh-Hadamard Transform (FWHT), Fast Hartley Transform, Fast Haar Transform, suffered from index permutation issue, which are often solved by programmer before or after transformation. However, researchers proposed impressive library solution in [2]. Work in [2] utilizes defined operators and scheduling to achieve index permutation and merges it into transformation algorithm.
[1] adopts hardware method to tackle with data layout issue between heterogeneous processors, and we solve index permutation issue with concept in [1]. We adopt ping-pong transpose mechanism to hide the latency due to index conversion. We implement three types of index permutation, such as bit-reversal order, natural Hadamard order, and index order in Fast Haar Transform. Finally, we show the speedup of six chosen algorithms on GPU after eliminating index permutation problem with our index converter.
In our experiment, the speedup of signal processing algorithm is proportional to the complexity of index distribution, and in inverse proportion with the execution time of fast transform. In selected benchmarks, we obtain worst case in FFT, its speedup ranges from 7.12% to 11.2%, and best case in FWHT, ranging from 15.28% to 40.78%.

Chapter 1 Introduction    1
1.1    Motivation    1
1.2    Goals and Contributions    2
1.3    Thesis Organization    3
Chapter 2 Related Work and Background    4
2.1 GPGPU    4
2.2 Smart Controller    6
2.2.1 Background    6
2.2.2 System Overview    7
Chapter 3 Signal Processing Algorithm    9
3.1 FFT[7]    9
3.2 FHT(Fast Hartley Transform)[13][14]    10
3.3 SCHT (Sequency-Ordered Complex Hadamard Transform)[15]    12
3.4 CS-SCHT (Conjugate Symmetric Sequency-Ordered Complex Hadamard Transform) [16][20]    13
3.5 WHT (Walsh-Hadamard Transform) [17][18]    14
3.6 Fast Haar Transform[5][6]    15
Chapter 4 Design and Implementation    17
4.1 Index converter    17
4.1.1 Ping-Pong Transpose Unit    17
4.1.2 Index Redistribution Generator    18
4.2 Transformation Implementation and Index Redistribution Kernel    19
Bit-reverse kernel    19
FFT kernel    20
Hartley transform kernel    22
SCHT and CSSCHT kernel    25
Fast Walst-Hadamard Transform and index redistribution kernel    29
Fast Haar transform and index redistribution kernel    33
Chapter 5 Evaluation    38
5.1 Experimental Environment    38
5.2 Experimental Result Analysis    40
5.2.1 Synthesis Result    40
5.2.2 Overall Performance Analysis    40
Chapter 6 Conclusions and Future Work    48
6.1 Conclusions    48
6.2 Future Work    49
Bibliography    50

                                

[1] Chia-Chen Hsu, Cheng-Yen Lin, Shin Kai Chen, Chih-Wei Liu, and Jenq-Kuen
Lee, “Optimized Memory Access Support for Data Layout Conversion on Heterogeneous Multi-core Systems,” in Embedded Systems for Real-time Multimedia (ESTIMedia), 2014 IEEE 12th Symposium.
[2] Jacobo Lobeiras, Margarita Amor, and Ramón Doallo, “Designing Efficient Index-Digit Algorithms for CUDA GPU Architectures,” in IEEE Transactions on Parallel and Distributed Systems, 2016.
[3] I-Jui Sung, Geng Daniel Liu, and Wen-Mei W. Hwu, “DL: A data layout transformation system for heterogeneous computing,” in Innovative Parallel Computing (InPar), 2012.
[4] Shuai Che, Jeremy W. Sheaffer, and Kevin Skadron, “Dymaxion: Optimizing memory access patterns for heterogeneous systems,” in SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis.
[5] Peter R. Roeser, and M. E. Jernigan, “Fast Haar Transform Algorithms,” in IEEE Transactions on Computers, 1982
[6] Yanwei Pang, Xuelong Li, Yuan Yuan, Dacheng Tao, and Jing Pan, “Fast Haar Transform Based Feature Extraction for Face Representation and Recognition,” in IEEE Transactions on Information Forensics and Security, 2009.
[7] Sreehari Ambuluri, “Implementations of the FFT algorithm on GPU,” 2012
[8] Eladio Gutierrez, Sergio Romero, Maria A. Trenas, and Emilio L. Zapata, “Memory Locality Exploitation Strategies for FFT on the CUDA Architecture,” International Conference on High Performance Computing for Computational Science, 2008.
[9] Wen-Mei W. Hwu and David B. Kirk, “Programming Massively Parallel Processors: A Hands-on Approach,” Burlington, Morgan Kaufmann, 2010.
[10] Muthu Manikandan Baskaran et al., “A compiler framework for optimization of affine loop nests for gpgpus,” Proceedings of the 22nd annual international conference on Supercomputing, 2008
[11] Jee W. Choi et al., “Model-driven autotuning of sparse matrix-vector multiply on GPUs,” Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2010.
[12] Alexander Monakov et al., “Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures,” International Conference on High-Performance Embedded Architectures and Compilers, 2010.
[13] H.-J. Meckelburg, and D. Lipka, “Fast Hartley transform algorithm,” Electronics Letters 11th, 1985.
[14] Hsieh S. Hou, “The Fast Hartley Transform Algorithm,” IEEE Transactions on Computers, 1987.
[15] N. Swetha, Panyam Narahari Sastry, Y. Rajasree Rao, and Samrat L. Sabat, “Fast Sequency-Ordered Complex Hadamard Transform-Based Parzen Window Entropy Detection for Spectrum Sensing in Cognitive Radio Networks,” IEEE Communications Letters, 2016.
[16] Aye Aung,Boon Poh Ng, and Susanto Rahardja, “Conjugate Symmetric Sequency-Ordered Complex Hadamard Transform,” IEEE Transactions on Signal Processing, 2009.
[17] M. Kunt, “On Computation of the Hadamard Transform and the R Transform in Ordered Form,” IEEE Transactions on Computers, 1975.
[18] Geadah, and Corinthios, “Natural, Dyadic, and Sequency Order Algorithms and Processors for the Walsh-Hadamard Transform,” IEEE Transactions on Computers, 1977.
[19] Soo-Chang Pei, Chia-Chang Wen, and Jian-Jiun Ding, “Conjugate Symmetric Discrete Orthogonal Transform,” IEEE Transactions on Circuits and Systems, 2014.
[20] Saad Bouguezel, M. Omair Ahmad, and M.N.S. Swamy, “An efficient algorithm for the conjugate symmetric sequency-ordered complex Hadamard transform,” Circuits and Systems (ISCAS), 2011.
[21] Jiasong Wu, Lu Wang, Guanyu Yang, Lotfi Senhadji, Limin Luo, and Huazhong Shu, “Sliding Conjugate Symmetric Sequency-Ordered Complex Hadamard Transform: Fast Algorithm and Applications,” IEEE Transactions on Circuits and Systems I: Regular Papers, 2012.
[22] Eduardo Colmenares, and P. Andersen, “Parallel Computing and its Potential Benefits: A Complex Radix-2 FFT Case Study,” Colmenares Eduardo,and Andersen Per. Parallel computing and its potential benefits: A complex radix-2 FFT case study. Proceedings of the 8th International Multi-Conference on Society, Cybernetics and Informatics, IMSCI 2014.
[23] Yuh-Ming Huang, and Ja-Ling Wu, “A refined fast 2-D discrete cosine transform algorithm,” IEEE Transactions on Signal Processing, 1999.

簡易檢索 / 詳目顯示

相關論文