在GPU上增進排序演算法的效能｜國立清華大學博碩士論文庫

簡易檢索 / 詳目顯示

回結果列表

研究生：	陳南熹 Chen, Nan-Hsi
論文名稱：	在GPU上增進排序演算法的效能 Performance Enhancement of Sorting Algorithms on Graphics Processing Units
指導教授：	李哲榮 Lee, Che-Rung
口試委員:	洪哲倫 Hung, Che-Lun 周志遠 Chou, Jerry
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊工程學系 Computer Science
論文出版年：	2012
畢業學年度：	100
語文別：	中文
論文頁數：	64
中文關鍵詞：	通用GPU計算、排序、重疊計算及傳輸時間
外文關鍵詞：	stream, concurrency, communication time, computation time
相關次數：	點閱：2 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

Sorting演算法在許多的應用程式所使用的技術中皆是相當重要一環，而在Graphics Processing Unit (GPU)上也不例外，多年來已有各樣高效能的sorting演算法實作在GPU上，然而隨著GPU計算性能和資料量的增加，計算和通訊的效能差距會呈現指數成長，這會導致host (CPU) 和device (GPU) 之間的資料傳輸時間會成為效能的瓶頸；以sorting algorithm為例，當資料量大於 2^20 時，花在資料搬移的時間比例將會超過整體執行時間的60%。
本文中提出一個framework，利用streams concurrency技術使communication和computation的時間能夠重疊，藉此增進GPU sorting演算法的效能。首先將資料分割成數個buckets，每個bucket的資料數量大致相同，而且每個bucket中的資料會大於或等於前一個bucket中的資料。第二步，使用任何一種GPU sorting演算法將各bucket中的資料分別排序好。最後將排序以及資料輸出重疊，藉此隱藏communication的時間。這個framework主要的挑戰在步驟一的時候如何將資料分割成有順序的bucket並且每個bucket的資料數要差不多，這裡使用sample sort的演算法來解決這個問題。
這裡將此framework實作在三個演算法上來驗證此framework的效能，分別是radix sort、merge sort、bitonic sort。實驗結果顯示當資料數 n=2^28 時，radix sort上有接近25%的效能增進，在merge sort上則是8%，在bitonic sort上最多可達8.32%的效能增進。

Efficient implementations of various sorting algorithms on Graphics Processing Unit (GPU) had been studied for years, owing to their technical importance in many applications. However, as the computational power of GPU and data size increases, the performance gap of computation and communication enlarges exponentially. As the result, the data movement between host (CPU) and device (GPU) becomes the performance bottleneck. For sorting algorithms, the time of data movement can take over 60% of total execution time when the data size is larger than 220 on Fermi C2070.
In this thesis, we propose a framework to enhance the performance of GPU sorting algorithms, which utilizes the streams concurrency technique to overlap the communication and computation time. First, data are partitioned into buckets. Each bucket has roughly the same size and data in each bucket has partial order to the data in other buckets. Second, data in each bucket are sorted separately on GPU using preferred sorting algorithms. Last, the sorting and data output are overlapped to hide the communication time. The major challenge of this framework is in the first step: to partition the data into ordered buckets of roughly equal size. The sample sort algorithm is employed to resolve this problem.
Three sorting algorithms were implemented to justify the effectiveness of this framework: radix sort, merge sort, and bitonic sort. Experiments show that nearly 25% time performance improvement of radix sort can be obtained when n=2^28. For merge sort, the improvement is 8% ; and for bitonic sort, up to 8.32% performance improvement can be achieved.

   Introduction    1
1.    GPU    1
2.    Sorting    3
2.1.    Comparative and Non-comparative Sorting    3
2.2.    Parallel Sorting and Sorting on GPU    4
3.    Streams Concurrency and Pipeline    5
4.    Thrust Library    6
5.    Contents    7
5.1.    Framework    7
5.2.    Obstacle and Solution    8
5.3.    Outline    9
6.    Contribution    9
   Preliminaries    10
1.    Terminology    10
2.    Background    11
2.1.    GPGPU    11
2.2.    CUDA    11
2.3.    Sorting    13
2.4.    Key-value Pair    18
2.5.    Sorting on GPU    18
2.6.    Random Number Generator    25
   GPU Sorting Framework    28
1.    Computation v.s. Communication    28
2.    Streams Concurrency    29
3.    Concurrency on Radix Sort    30
4.    Comparative Sorting Algorithms    33
   Implementations    37
1.    Radix Sort    37
1.1.    Kernel Stream Argument    37
1.2.    Implementation of Concurrency    38
1.3.    Misaligned Problem    38
1.4.    Streams Synchronization    40
2.    Sampling with Comparative Sort    41
2.1.    Get Samples from Keys    41
2.2.    Calculate Local Offset    42
2.3.    Scatter Keys    43
2.4.    Optimization and Implementation Issues    45
   Experiments    48
1.    GPU Radix Sort    48
2.    GPU Comparative Sorting    53
2.1.    On Merge Sort    53
2.2.    On Bitonic Sort    57
   Conclusion    61
1.    Summary    61
2.    Future Work    61
   Reference    63


                                

[1] Nvidia CUDA. [Online] http://developer.nvidia.com/category/zone/cuda-zone
[2] GeForce 256. [Online] http://www.nvidia.com/page/geforce256.html
[3] OpenGL. [Online] http://www.opengl.org/
[4] Microsoft DirectX. [Online] http://msdn.microsoft.com/zh-tw/directx
[5] CUDA SDK. Sorting Network [Online] http://developer.nvidia.com/cuda-cc-sdk-code-samples
[6] CUDA Fermi. [Online] http://www.nvidia.com/object/fermi-architecture.html
[7] cuRAND. [Online] http://developer.nvidia.com/curand
[8] Thrust Library. [Online] http://code.google.com/p/thrust/
[9] Intel TBB. [Online] http://threadingbuildingblocks.org/
[10] OpenMP. [Online] http://openmp.org/wp/
[11] Michael Swaine, "New Chip from Intel Gives High-Quality Displays", March 14, 1983, p.16
[12] Flynn M. "Some Computer Organizations and Their Effectiveness", 1972, IEEE Trans. Comput. C-21: 948.
[13] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. “Introduction to Algorithms, Second Edition.” MIT Press and McGraw-Hill, 2001. ISBN 0-262-03293-7. Section 8.1: Lower bounds for sorting, pp. 165–168.
[14] William Stallings. “Computer organization and architecture : designing for performance.”
[15] Donald Knuth. “The Art of Computer Programming, Volume 3: Sorting and Searching,” Second Edition. Addison-Wesley, 1997. ISBN 0-201-89685-0.
[16] Yao Wang, Jörn Ostermann, Ua-Qin Zhang. “Vidoe Processing and Communications, International Edition.” Pearson Education Taiwan Ltd. 2007. ISBN 978-986-154-606-3. Page 519.
[17] NVIDIA CUDATM. “NVIDIA CUDA C Programming Guide, Version 4.2.” NVIDIA Corporation. Page 31.
[18] D. Cederman and P. Tsigas. “A practical quicksort algorithm for graphics processors.” In Proc. European Symposium on Algorithms (ESA), volume 5193 of LNCS, pages 246–258, 2008.
[19] N. Leischner, V. Osipov, and P. Sanders. “Gpu sample sort.” In Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, pages 1–10, april 2010.
[20] Nadathur Satish, Mark Harris, and Michael Garland, "Designing efficient sorting algorithms for manycore GPUs," in IPDPS '09: Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing, 2009, pp. 1-10.
[21] H. Peters, O. Schulz-Hildebrandt, and N. Luttenberger. GPU Gems 4, volume 2, chapter Comparison-based in-place sorting with CUDA. Morgan Kaufmann, Maryland Heights, MO, USA, 2011.
[22] D. Merrill and A. Grimshaw, “High Performance and Scalable Radix Sorting: A case study of implementing dynamic parallelism for GPU computing,” Parallel Processing Letters, vol. 21, no. 2, pp. 245-272, 2011.
[23] Vasily Volkov, “Unrolling parallel loops”, November 14, 2011
[24] John K. Salmon, Mark A. Moraes, Ron O. Dror, and David E. Shaw, “Parallel Random Numbers: As Easy as 1, 2, 3.”
[25] K. E. Batcher, “Sorting networks and their applications,” in Proc, AFIPS Spring Joint Computer Conference, vol. 32, 1968, pp. 307–314
[26] Andrew Davidson, David Tarjan, Michael Garland, John D. Owens, "Efficient Parallel Merge Sort for Fixed and Variable Length Keys", in "Innovative Parallel Computing", pp 9, 2012

全文公開日期本全文未授權公開 (校內網路)
全文公開日期本全文未授權公開 (校外網路)

簡易檢索 / 詳目顯示

相關論文