研究生: |
廖宥誠 Liao, Yu Cheng |
---|---|
論文名稱: |
對於Burrows-Wheeler轉換並利用D-Critical子字串的異質後綴陣列建構法 hSA-DS: A Heterogeneous Suffix Array Construction Using D-Critical Substrings for Burrows-Wheeler Transform |
指導教授: |
許雅三
Hsu, Yarsun |
口試委員: |
邱瀞德
李政崑 |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
論文出版年: | 2015 |
畢業學年度: | 104 |
語文別: | 英文 |
論文頁數: | 70 |
中文關鍵詞: | 圖形處理器 、平行化 、後綴陣列 、Burrows-Wheeler Transform 、CUDA |
外文關鍵詞: | GPU, Parallelization, Suffix Array, Burrows-Wheeler Transform, CUDA |
相關次數: | 點閱:1 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
Burrows-Wheeler Transform (BWT) 是個廣泛運用於資料壓縮以及生物科技的字串轉換方法,現在常用的Bzip2壓縮器就是一個例子。然而,以BWT為基礎的壓縮器雖然擁有較高壓縮率,但會有較長的壓縮所需時間。若以數學方法的角度來檢視BWT的話,我們可以發現BWT的結果其實可以從已建構出的後綴陣列 (Suffix Array) 來推導得出。這樣一來,BWT可以得利於最近數十年研究員所發展出的各式各樣線性時間後綴陣列建構演算法。
在另一方面,圖型處理器 (GPU) 近幾年來已然成為執行平行運算程式時,成本上最有效率的解決方案,之前發展的線性時間後綴陣列建構演算法得以利用GPU的運算能力來達到更好的效能。本論文中分析了現在已存在且平行化的後綴陣列演算法,接著提出了第一個根據SA-DS演算法所改進的異質SA-DS演算法。本異質SA-DS演算法同時利用了GPU與CPU的運算效能,並且著重在對於BWT壓縮器所需的100K到2M字元的字串長度。同時,為了達到更好的表現,為我們的平台優化了原先CUDA所提供的平行基數排序法。
最後,整個異質SA-DS運行在NVIDIA的GPU上且與原本的SA-DS版本做比較。從結果來看,我們優化後的基數排序法在排序1M個字元所需的執行時間較Thrust Library的版本少了23%;我們的異質SA-DS演算法展現了相較於原先循序的版本最多4倍的加速,並展現了相較於最新CUDPP Library的BWT最多2倍的加速。
Burrows-Wheeler Transform (BWT) is a widely-used algorithm applied in data compression techniques like bzip2 and bioinformatics. The BWT-based compression strategy has better compression rate, but longer compression time. In mathematically point of view, BWT can be derived from the constructed suffix array. For decades, researchers developed many of suffix array construction algorithms (SACAs) that benefit the BWT algorithm. On the other hand, graphics processing units (GPU) has emerged as the most cost-efficient solution in the field of parallel computation recently, and former linear-time SACAs begin utilizing the computational power of GPUs. In this work, we analyze the current parallel implementations of SACAs and introduce the first heterogeneous implementation of SA-DS algorithm. The implementation leverages both of the CPU and GPU, and we focus on the typical block sizes, 100K to 2M characters, for BWT-based compression. In order to achieve better performance, we also optimizes the up-to-date radix sort on GPU for our platform. Finally, the implementation is evaluated on the heterogeneous platform equipped with a NVIDIA GPU using the CUDA programming model. As the result, the optimized radix sorting on GPU shows up to 23% decreased time compared with latest Thrust library for sorting millions of keys. Our heterogeneous SA-DS demonstrates up to 4x over sequential C++ version of SA-DS and has a performance gain up to 2x than parallel BWT provided by the CUDPP library.
[1] M. Burrows and D. J. Wheeler, "A block-sorting lossless data compression algorithm," 1994.
[2] "bzip2-1.0.6," 2015. [Online]. Available: http://www.bzip.org
[3] H. Li and R. Durbin, "Fast and accurate short read alignment with burrows-wheeler transform," Bioinformatics, vol. 25, no. 14, pp. 1754-1760, 2009.
[4] R. Li, C. Yu, Y. Li, T.-W. Lam, S.-M. Yiu, K. Kristiansen, and J. Wang, "Soap2: an improved ultrafast tool for short read alignment," Bioinformatics, vol. 25, no. 15, pp. 1966-1967, 2009.
[5] J. Ziv and A. Lempel, "A universal algorithm for sequential data compression," IEEE Transactions on information theory, vol. 23, no. 3, pp. 337-343, 1977.
[6] J. Ziv and A. Lempel, "Compression of individual sequences via variable-rate coding," Information Theory, IEEE Transactions on, vol. 24, no. 5, pp. 530-536, 1978.
[7] M. Nelson, "Data compression with the burrows-wheeler transform," Dr. Dobbs Journal, vol. 9, pp. 46-50, 1996.
[8] U. Manber and G. Myers, "Suffix arrays: a new method for on-line string searches," siam Journal on Computing, vol. 22, no. 5, pp. 935-948, 1993.
[9]M. Deo and S. Keely, "Parallel uffix array and least common preffix for the gpu," in ACM SIGPLAN Notices, vol. 48, no. 8. ACM, 2013, pp. 197-206.
[10] F. Kulla and P. Sanders, "Scalable parallel suffix array construction," Parallel Computing, vol. 33, no. 9, pp. 605-612, 2007.
[11] J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Kruger, A. E. Lefohn, and T. J. Purcell, "A survey of general-purpose computation on graphics hardware," in Computer graphics forum, vol. 26, no. 1. Wiley Online Library, 2007, pp. 80-113.
[12] J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips, "Gpu computing," Proceedings of the IEEE, vol. 96, no. 5, pp. 879-899, 2008.
[13] T. W. ONeil and E. H. Sha, "Minimizing inter-iteration dependencies for loop pipelining," in ISCA 13th International Conference on Parallel and Distributed Computing Systems, Las Vegas, Nevada, 2000, pp. 412-417.
[14] "Cudpp-2.2," 2014. [Online]. Available: http://cudpp.github.io
[15] J. Karkkainen, P. Sanders, and S. Burkhardt, "Linear work suffix array construction," Journal of the ACM (JACM), vol. 53, no. 6, pp. 918-936, 2006.
[16] P. Ko and S. Aluru, "Space efficient linear time construction of suffix arrays," in Combinatorial Pattern Matching. Springer, 2003, pp. 200-210.
[17] G. Nong, S. Zhang, and W. H. Chan, "Two efficient algorithms for linear time suffix array construction," Computers, IEEE Transactions on, vol. 60, no. 10, pp. 1471-1484, 2011.
[18] N. Corporation, "Cuda toolkit documentation," 2015. [Online]. Available: http://docs.nvidia.com/cuda/index.html
[19] J. Nickolls, I. Buck, M. Garland, and K. Skadron, "Scalable parallel programming with cuda," Queue, vol. 6, no. 2, pp. 40-53, 2008.
[20] J. Hoberock and N. Bell, "Thrust: A parallel template library," 2010, version 1.8.1. [Online]. Available: http://thrust.github.io/
[21] D. G. Merrill and A. S. Grimshaw, "Revisiting sorting for gpgpu stream architectures," in Proceedings of the 19th international conference on Parallel architectures and compilation techniques. ACM, 2010, pp. 545-546.
[22] N. Corporation, "Nvidia kepler gk110 architecture whitepaper," 2012.
[23] N. Futamura, S. Aluru, and S. Kurtz, "Parallel sux sorting," 2001.
[24] P. S. Pacheco, Parallel programming with MPI. Morgan Kaufmann, 1997.
[25] J. L. Bentley and R. Sedgewick, "Fast algorithms for sorting and searching strings," in SODA, vol. 97, 1997, pp. 360-369.
[26] G. Manzini and P. Ferragina, "Engineering a lightweight suffix array construction algorithm," Algorithmica, vol. 40, no. 1, pp. 33-50, 2004.
[27] R. Patel, Y. Zhang, J. Mak, A. Davidson, J. D. Owens, et al., Parallel lossless data compression on the GPU. IEEE, 2012.
[28] J. Seward, "On the performance of bwt sorting algorithms," in Data Compression Conference, 2000. Proceedings. DCC 2000. IEEE, 2000, pp. 173-182.
[29] M. Daga, A. M. Aji, and W.-c. Feng, "On the ecacy of a fused cpu+ gpu processor(or apu) for parallel computing," in Application Accelerators in High-Performance Computing (SAAHPC), 2011 Symposium on. IEEE, 2011, pp. 141-149.
[30] J. E. Stone, D. Gohara, and G. Shi, "Opencl: A parallel programming standard for heterogeneous computing systems," Computing in science & engineering, vol. 12, no.1-3, pp. 66-73, 2010.
[31] O. Green, R. McColl, and D. A. Bader, "Gpu merge path: a gpu merging algorithm," in Proceedings of the 26th ACM international conference on Supercomputing. ACM, 2012, pp. 331-340.
[32] M. Farach, "Optimal suffix tree construction with large alphabets," in Foundations of Computer Science, 1997. Proceedings., 38th Annual Symposium on. IEEE, 1997, pp. 137-143.
[33] H. Itoh and H. Tanaka, "An efficient method for in memory construction of suffix arrays," in String Processing and Information Retrieval Symposium, 1999 and International Workshop on Groupware. IEEE, 1999, pp. 81-88.
[34] G. M. Amdahl, "Validity of the single processor approach to achieving large scale computing capabilities," in Proceedings of the April 18-20, 1967, spring joint computer conference. ACM, 1967, pp. 483-485.
[35] M. Harris, S. Sengupta, and J. D. Owens, "Parallel preffix sum (scan) with cuda," GPU gems, vol. 3, no. 39, pp. 851{876, 2007.
[36] "Enwiki," 2015. [Online]. Available: https://dumps.wikimedia.org/enwiki/
[37] "Linux kernel tarball," 2015. [Online]. Available: https://www.kernel.org