研究生: |
陳南熹 Chen, Nan-Hsi |
---|---|
論文名稱: |
在GPU上增進排序演算法的效能 Performance Enhancement of Sorting Algorithms on Graphics Processing Units |
指導教授: |
李哲榮
Lee, Che-Rung |
口試委員: |
洪哲倫
Hung, Che-Lun 周志遠 Chou, Jerry |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Computer Science |
論文出版年: | 2012 |
畢業學年度: | 100 |
語文別: | 中文 |
論文頁數: | 64 |
中文關鍵詞: | 通用GPU計算 、排序 、重疊計算及傳輸時間 |
外文關鍵詞: | stream, concurrency, communication time, computation time |
相關次數: | 點閱:2 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
Sorting演算法在許多的應用程式所使用的技術中皆是相當重要一環,而在Graphics Processing Unit (GPU)上也不例外,多年來已有各樣高效能的sorting演算法實作在GPU上,然而隨著GPU計算性能和資料量的增加,計算和通訊的效能差距會呈現指數成長,這會導致host (CPU) 和device (GPU) 之間的資料傳輸時間會成為效能的瓶頸;以sorting algorithm為例,當資料量大於 2^20 時,花在資料搬移的時間比例將會超過整體執行時間的60%。
本文中提出一個framework,利用streams concurrency技術使communication和computation的時間能夠重疊,藉此增進GPU sorting演算法的效能。首先將資料分割成數個buckets,每個bucket的資料數量大致相同,而且每個bucket中的資料會大於或等於前一個bucket中的資料。第二步,使用任何一種GPU sorting演算法將各bucket中的資料分別排序好。最後將排序以及資料輸出重疊,藉此隱藏communication的時間。這個framework主要的挑戰在步驟一的時候如何將資料分割成有順序的bucket並且每個bucket的資料數要差不多,這裡使用sample sort的演算法來解決這個問題。
這裡將此framework實作在三個演算法上來驗證此framework的效能,分別是radix sort、merge sort、bitonic sort。實驗結果顯示當資料數 n=2^28 時,radix sort上有接近25%的效能增進,在merge sort上則是8%,在bitonic sort上最多可達8.32%的效能增進。
Efficient implementations of various sorting algorithms on Graphics Processing Unit (GPU) had been studied for years, owing to their technical importance in many applications. However, as the computational power of GPU and data size increases, the performance gap of computation and communication enlarges exponentially. As the result, the data movement between host (CPU) and device (GPU) becomes the performance bottleneck. For sorting algorithms, the time of data movement can take over 60% of total execution time when the data size is larger than 220 on Fermi C2070.
In this thesis, we propose a framework to enhance the performance of GPU sorting algorithms, which utilizes the streams concurrency technique to overlap the communication and computation time. First, data are partitioned into buckets. Each bucket has roughly the same size and data in each bucket has partial order to the data in other buckets. Second, data in each bucket are sorted separately on GPU using preferred sorting algorithms. Last, the sorting and data output are overlapped to hide the communication time. The major challenge of this framework is in the first step: to partition the data into ordered buckets of roughly equal size. The sample sort algorithm is employed to resolve this problem.
Three sorting algorithms were implemented to justify the effectiveness of this framework: radix sort, merge sort, and bitonic sort. Experiments show that nearly 25% time performance improvement of radix sort can be obtained when n=2^28. For merge sort, the improvement is 8% ; and for bitonic sort, up to 8.32% performance improvement can be achieved.
[1] Nvidia CUDA. [Online] http://developer.nvidia.com/category/zone/cuda-zone
[2] GeForce 256. [Online] http://www.nvidia.com/page/geforce256.html
[3] OpenGL. [Online] http://www.opengl.org/
[4] Microsoft DirectX. [Online] http://msdn.microsoft.com/zh-tw/directx
[5] CUDA SDK. Sorting Network [Online] http://developer.nvidia.com/cuda-cc-sdk-code-samples
[6] CUDA Fermi. [Online] http://www.nvidia.com/object/fermi-architecture.html
[7] cuRAND. [Online] http://developer.nvidia.com/curand
[8] Thrust Library. [Online] http://code.google.com/p/thrust/
[9] Intel TBB. [Online] http://threadingbuildingblocks.org/
[10] OpenMP. [Online] http://openmp.org/wp/
[11] Michael Swaine, "New Chip from Intel Gives High-Quality Displays", March 14, 1983, p.16
[12] Flynn M. "Some Computer Organizations and Their Effectiveness", 1972, IEEE Trans. Comput. C-21: 948.
[13] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. “Introduction to Algorithms, Second Edition.” MIT Press and McGraw-Hill, 2001. ISBN 0-262-03293-7. Section 8.1: Lower bounds for sorting, pp. 165–168.
[14] William Stallings. “Computer organization and architecture : designing for performance.”
[15] Donald Knuth. “The Art of Computer Programming, Volume 3: Sorting and Searching,” Second Edition. Addison-Wesley, 1997. ISBN 0-201-89685-0.
[16] Yao Wang, Jörn Ostermann, Ua-Qin Zhang. “Vidoe Processing and Communications, International Edition.” Pearson Education Taiwan Ltd. 2007. ISBN 978-986-154-606-3. Page 519.
[17] NVIDIA CUDATM. “NVIDIA CUDA C Programming Guide, Version 4.2.” NVIDIA Corporation. Page 31.
[18] D. Cederman and P. Tsigas. “A practical quicksort algorithm for graphics processors.” In Proc. European Symposium on Algorithms (ESA), volume 5193 of LNCS, pages 246–258, 2008.
[19] N. Leischner, V. Osipov, and P. Sanders. “Gpu sample sort.” In Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, pages 1–10, april 2010.
[20] Nadathur Satish, Mark Harris, and Michael Garland, "Designing efficient sorting algorithms for manycore GPUs," in IPDPS '09: Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing, 2009, pp. 1-10.
[21] H. Peters, O. Schulz-Hildebrandt, and N. Luttenberger. GPU Gems 4, volume 2, chapter Comparison-based in-place sorting with CUDA. Morgan Kaufmann, Maryland Heights, MO, USA, 2011.
[22] D. Merrill and A. Grimshaw, “High Performance and Scalable Radix Sorting: A case study of implementing dynamic parallelism for GPU computing,” Parallel Processing Letters, vol. 21, no. 2, pp. 245-272, 2011.
[23] Vasily Volkov, “Unrolling parallel loops”, November 14, 2011
[24] John K. Salmon, Mark A. Moraes, Ron O. Dror, and David E. Shaw, “Parallel Random Numbers: As Easy as 1, 2, 3.”
[25] K. E. Batcher, “Sorting networks and their applications,” in Proc, AFIPS Spring Joint Computer Conference, vol. 32, 1968, pp. 307–314
[26] Andrew Davidson, David Tarjan, Michael Garland, John D. Owens, "Efficient Parallel Merge Sort for Fixed and Variable Length Keys", in "Innovative Parallel Computing", pp 9, 2012