研究生: |
李維陞 Lee, Wei-Shen |
---|---|
論文名稱: |
在多核心圖形處理器架構下實現一有效率的排序演算法 An Efficient Sorting Algorithm based on Many-core Graphics Processing Units |
指導教授: |
唐傳義
Tang, Chuan-Yi |
口試委員: | |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Computer Science |
論文出版年: | 2010 |
畢業學年度: | 98 |
語文別: | 英文 |
論文頁數: | 46 |
中文關鍵詞: | 平行排序 、圖形處理器 、雙調排序 、謝爾排序 、平行處理 、統一計算架構 |
外文關鍵詞: | Parallel shellsort, GPU, Bitonic merge sort, Sorting, CUDA |
相關次數: | 點閱:3 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
Sorting is one of most classic algorithmic problem in computer science and its fundamental importance has led to the design and implementation of sorting algorithms on recently evolved many-core Graphic Processing Units (GPUs). CUDPP Radix sort has been the most ecient non-comparison sort for a period of time, and recently published GPU Sample sort is considerably the best comparison-based sort. Though the sorting implementations are quite ecient, they either need extra space for data rearrangement or atomic operation for acceleration. Applications runningon GPU usually deal with huge amount of data thus memory utilization is an important consideration. Also, sorting algorithms on GPUs that do not support atomic operations can result in performance degradation or even fail to work. In this paper, we propose an implementation of a shellsort algorithm for NVIDIA's GPUs. It is an in-place sort and requires no atomic operation support. Our experimental result shows that, on average, its performance is nearly twice faster than GPU quicksort and nearly three times faster than Thurst mergesort. Its performance is almost
the same as GPU sample for sorting up to 32M data elements.
Our implementation is robust to various input distributions and should also be well suited for other many-core architectures.
在計算機科學的研究領域,排序法是一直不斷被拿出來討論的問題之一;由於排序法的基礎性和重要性,使得它現在也廣泛被移植到新興的多核心圖形處理器(GPUs)上。在CUDPP函式庫裡的 GPU radix sort 是一種非比較式的(non-comparison-based)排序法,它是目前在GPU上最快的排序法;而 GPU sample sort 是GPU上最快的比較式(comparison-based)排序。這兩個排序法,分別是兩種不同排序分類裡的領先技術(state of the art);不過它們都需要使用到額外的空間來重組資料,或是使用不可分割運算(atomic operation)來加速。GPU通常被用來處理非常大量的資料,因此,記憶空間的運用顯得特別重要;另外,排序法在沒有不可分割運算(atomic operation)支援的GPU上執行,可能導致效能下降,或甚至無法正常執行。在這個論文裡,我們提出一個在 NVIDIA GPU 上實現shell排序的方法。這個方法不用額外的記憶空間,也不需要不可分割運算的支援。我們的實驗結果也顯示,GPU shellsort 平均上有 GPU quicksort 的兩倍快,與將近快過三倍 Thrust mergesort 。在小於三千兩百萬筆資料時,效能大概和目前比較式排序的中最快的 GPU sample sort 相當。我們所提出在GPU上的 shell 排序法 ,在各種資料分佈上的表現也都非常穩健,沒有一種標準測試用的資料分佈會使這個方法的效能變的非常差。基本上這個方法也適用其它的多核心平台。
[1] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms, 2nd ed. MA: The MIT Press, 2001.
[2] S. G. Akl, Parallel Sorting Algorithms. Orlando: Academic Press, 1985.
[3] M. A. Weiss, Data structures and algorithm analysis in C, 2nd ed. CA:Addison-Wesley, 1997.
[4] C. Breshears, The Art of Concurrency: A Thread Monkey's Guide to Writing Parallel Applications, Cambridge: O'Reilly Media, 2009.
[5] D. B. Kirk and W.-M. W. Hwu, Programming Massively Parallel Processors: A Hands-on Approach, MA: Morgan Kaufmann Publishers, 2010.
[6] D. E. Knuth, The Art of Computer Programming, Vol 3, Addison-Wesley, 1973.
[7] T. R. Halfhill, \Parallel Processing with CUDA," Microprocessor Report, Jan. 2008.
[8] W. A. Martin, \Sorting," ACM Comput. Surv., vol. 3, no. 4, pp. 147-174, Dec. 1971.
[9] D. L. Shell, \A high-speed sorting procedure," Commun. ACM, vol. 2, no. 7, pp. 30-32, July 1959.
[10] H. W. Lang. (2010). Shellsort [Online]. Available: http://www.inf.fh-
ensburg.de/lang/algorithmen/sortieren/shell/shellen.htm.
[11] V. R. Pratt, \Shellsort and Sorting Networks," Ph.D. dissertation. Stanford University, Standford, CA, USA, 1972.
[12] M. Ciura, \Best Increments for the Average Case of Shellsort," 13th International Symposium on Fundamentals of Computation Theory, vol. 2138, pp. 106-117, August 2001.
[13] R. Sedgewick, \Analysis of Shellsort and Related Algorithms," In Proceedings of the Fourth Annual European Symposium on Algorithms, vol. 1136, pp. 1-11, 1996.
[14] Wikipedia. (2010). WIKI: Shell sort [online]. Available: http://en.wikipedia.org/wiki/Shellsort
[15] NVIDIA CUDA Programming Guide, NVIDIA Corporation, 2009, version 2.3.
[16] NVIDIA CUDA C Programming Best Practices Guide, NVIDIA corporation, 2009, version 2.3.
[17] M. Harris, S. Sengupta, J. D. Owens, Parallel Prex Sum (Scan) with CUDA. In GPU Gems 3, Nguyen H., (Ed.), Addison Wesley, Aug. 2007, chapter 31.
[18] R. Baraglia, G. Capannini, F. M. Nardini, and F. Silvestri, Sorting using Bitonic Network with CUDA. In th 7th Workshop on Large Scale Distributed Systems for Information Retrieval, Boston, USA, July, 2009.
[19] N. Satish, M. Harris, and M. Garland, "Designing Ecient Sorting Algorithms for Manycore GPUs," IPDPS, 2009, pp. 1-10.
[20] J. Chhugani, A. D. Nguyen, V. W. Lee, W. Macy, M. Hagog, Y.-K. Chen, A. Baransi, S. Kumar, and P. Dubey. "Efficient Implementationof Sorting on Multi-core SIMD CPU Architecture," PVLDB, vol. 1, no. 2, pp. 1313-1324, August 2008.
[21] N. Leischner, V. Osipov, and P. Sanders, "GPU Sample Sort," IPDPS, 2010, pp. 1-10.
[22] D. R. Helman, D. A. Bader, and J. JaJa, "A randomized parallel sorting algorithm with an experimental study," J. of Parallel and Distributed Computing, vol. 52, no. 1, pp. 1-23, 1998.
[23] S. Sengupta, M. Harris, and M. Garland, \Ecient Parallel Scan Algorithms for GPUs," NVIDIA Technical Report, 2008.
[24] S. Sengupta, M. Harris, Y. Zhang, and J.D. Owens, "Scan primitives for GPU computing." in Graphics Hardware 2007, August 2007, pp. 97-106.
[25] D. Cederman and P. Tsigas, "A practical quicksort algorithm for graphics processors," in Proc. 16th Annual European Symposium on Algorithms, Sep. 2008, pp. 246-258.
[26] S. Chen, J. Qin, Y. Xie, J. Zhao, and P.-A. Heng, \A fast and exible sorting algorithm with CUDA," Proceedings of the 9th International Conference on Algorithms and Architectures for Parallel Processing Springer-Verlag, 2009, pp. 281-290.
[27] E. Sintorn and U. Assarsson, \Fast parallel GPU-sorting using a hybrid algorithm," J. Parallel Distrib. Comput., vol. 68, no. 10, pp. 1381-1388, 2008.
[28] F. Dehne and H. Zaboli. \Deterministic Sample Sort For GPUs," arXiv:1002.4464, 2010.
[29] J. Vitter, "External memory algorithms and data structures: Dealing with massive data," ACM Computing Surveys, pages 209-271, 2001.
[30] G. E. Blelloch, C. E. Leiserson, B. M. Maggs, C. G. Plaxton, S. J. Smith, and M. Zagha, "An experimental analysis of parallel sorting algorithms," Theory of Computing Systems, Vol. 31, No. 2, March/April
1998, pp. 135-167.
[31] J. Nickolls, I. Buck, M. Garland, and K. Skadron, "Scalable Parallel Programming with CUDA," Queue, Vol. 6, No. 2, pp. 40-53, Mar/Apr 2008.
[32] V. Volkov, and Demmel, J. W, "Benchmarking GPUs to tune dense linear algebra, "ACM/IEEE Conference on Supercomputing, Austin, 2008. pp. 1-11.
[33] M. Matsumoto and T. Nishimura, "Mersenne twister: A 623-dimensionally equidistributed uniform pseudo-random number genera-tor," ACM Transactions on Modeling and Computer Simulation, vol. 8, No. 1, pp. 330, 1998.
[34] S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W. W. Hwu, "Optimization principles and application performance evaluation of a multithreaded GPU using CUDA," In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2008, pp. 73-82.
[35] N. Govindaraju, J. Gray, R. Kumar, and D. Manocha, "GPUTeraSort: high performance graphics coprocessor sorting for large database management," In Proceedings of the 2006 ACM SIGMOD international Conference on Management of Data, June 2006.