簡易檢索 / 詳目顯示

研究生: 廖宥誠
Liao, Yu Cheng
論文名稱: 對於Burrows-Wheeler轉換並利用D-Critical子字串的異質後綴陣列建構法
hSA-DS: A Heterogeneous Suffix Array Construction Using D-Critical Substrings for Burrows-Wheeler Transform
指導教授: 許雅三
Hsu, Yarsun
口試委員: 邱瀞德
李政崑
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2015
畢業學年度: 104
語文別: 英文
論文頁數: 70
中文關鍵詞: 圖形處理器平行化後綴陣列Burrows-Wheeler TransformCUDA
外文關鍵詞: GPU, Parallelization, Suffix Array, Burrows-Wheeler Transform, CUDA
相關次數: 點閱:1下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • Burrows-Wheeler Transform (BWT) 是個廣泛運用於資料壓縮以及生物科技的字串轉換方法,現在常用的Bzip2壓縮器就是一個例子。然而,以BWT為基礎的壓縮器雖然擁有較高壓縮率,但會有較長的壓縮所需時間。若以數學方法的角度來檢視BWT的話,我們可以發現BWT的結果其實可以從已建構出的後綴陣列 (Suffix Array) 來推導得出。這樣一來,BWT可以得利於最近數十年研究員所發展出的各式各樣線性時間後綴陣列建構演算法。
    在另一方面,圖型處理器 (GPU) 近幾年來已然成為執行平行運算程式時,成本上最有效率的解決方案,之前發展的線性時間後綴陣列建構演算法得以利用GPU的運算能力來達到更好的效能。本論文中分析了現在已存在且平行化的後綴陣列演算法,接著提出了第一個根據SA-DS演算法所改進的異質SA-DS演算法。本異質SA-DS演算法同時利用了GPU與CPU的運算效能,並且著重在對於BWT壓縮器所需的100K到2M字元的字串長度。同時,為了達到更好的表現,為我們的平台優化了原先CUDA所提供的平行基數排序法。
    最後,整個異質SA-DS運行在NVIDIA的GPU上且與原本的SA-DS版本做比較。從結果來看,我們優化後的基數排序法在排序1M個字元所需的執行時間較Thrust Library的版本少了23%;我們的異質SA-DS演算法展現了相較於原先循序的版本最多4倍的加速,並展現了相較於最新CUDPP Library的BWT最多2倍的加速。


    Burrows-Wheeler Transform (BWT) is a widely-used algorithm applied in data compression techniques like bzip2 and bioinformatics. The BWT-based compression strategy has better compression rate, but longer compression time. In mathematically point of view, BWT can be derived from the constructed suffix array. For decades, researchers developed many of suffix array construction algorithms (SACAs) that benefit the BWT algorithm. On the other hand, graphics processing units (GPU) has emerged as the most cost-efficient solution in the field of parallel computation recently, and former linear-time SACAs begin utilizing the computational power of GPUs. In this work, we analyze the current parallel implementations of SACAs and introduce the first heterogeneous implementation of SA-DS algorithm. The implementation leverages both of the CPU and GPU, and we focus on the typical block sizes, 100K to 2M characters, for BWT-based compression. In order to achieve better performance, we also optimizes the up-to-date radix sort on GPU for our platform. Finally, the implementation is evaluated on the heterogeneous platform equipped with a NVIDIA GPU using the CUDA programming model. As the result, the optimized radix sorting on GPU shows up to 23% decreased time compared with latest Thrust library for sorting millions of keys. Our heterogeneous SA-DS demonstrates up to 4x over sequential C++ version of SA-DS and has a performance gain up to 2x than parallel BWT provided by the CUDPP library.

    Abstract i Contents ii 1 Introduction 1 1.1 Motivation 1 1.2 Goal and Contribution 3 1.3 Organization 4 2 Related Work 6 2.1 Distributed Systems 6 2.2 Heterogeneous Platforms with GPGPU 7 3 Background 10 3.1 GPU Architecture and Programming Model 10 3.2 Burrows-Wheeler Transform 13 3.3 BWT and Sux Array 15 3.4 Sux Array Construction Algorithms 17 3.4.1 Skew Algorithm 17 3.4.2 KA Algorithm 20 4 SA-DS Algorithm 24 4.1 SA-DS Algorithm Overview 24 4.2 Analysis of SA-DS Algorithm 29 4.3 Radix Sort on GPU 31 5 Design and Implementation 34 5.1 Parallelizing SA-DS 34 5.1.1 Locating D-Critical Substrings 35 5.1.2 Shrinking Problem 37 5.1.3 Radix Sorting D-Critical Substrings 38 5.1.4 Naming and Constructing SA 39 5.1.5 Inducing SA Step 1 41 5.1.6 Overlapping Sequential Portion 44 5.2 Optimization 46 6 Evaluation 49 6.1 Experiment Environment 49 6.2 Radix Sort 50 6.3 Heterogeneous SA-DS 52 6.3.1 Heterogeneous SA-DS Performance 52 6.3.2 Performance of Datasets 55 6.3.3 Comparisons with CUDPP Library 59 6.3.4 Comparisons Between Di erent Algorithms 62 7 Conclusions and Future Works 64 7.1 Conclusions 64 7.2 Future Works 65 Bibliography 66

    [1] M. Burrows and D. J. Wheeler, "A block-sorting lossless data compression algorithm," 1994.
    [2] "bzip2-1.0.6," 2015. [Online]. Available: http://www.bzip.org
    [3] H. Li and R. Durbin, "Fast and accurate short read alignment with burrows-wheeler transform," Bioinformatics, vol. 25, no. 14, pp. 1754-1760, 2009.
    [4] R. Li, C. Yu, Y. Li, T.-W. Lam, S.-M. Yiu, K. Kristiansen, and J. Wang, "Soap2: an improved ultrafast tool for short read alignment," Bioinformatics, vol. 25, no. 15, pp. 1966-1967, 2009.
    [5] J. Ziv and A. Lempel, "A universal algorithm for sequential data compression," IEEE Transactions on information theory, vol. 23, no. 3, pp. 337-343, 1977.
    [6] J. Ziv and A. Lempel, "Compression of individual sequences via variable-rate coding," Information Theory, IEEE Transactions on, vol. 24, no. 5, pp. 530-536, 1978.
    [7] M. Nelson, "Data compression with the burrows-wheeler transform," Dr. Dobbs Journal, vol. 9, pp. 46-50, 1996.
    [8] U. Manber and G. Myers, "Suffix arrays: a new method for on-line string searches," siam Journal on Computing, vol. 22, no. 5, pp. 935-948, 1993.
    [9]M. Deo and S. Keely, "Parallel uffix array and least common preffix for the gpu," in ACM SIGPLAN Notices, vol. 48, no. 8. ACM, 2013, pp. 197-206.
    [10] F. Kulla and P. Sanders, "Scalable parallel suffix array construction," Parallel Computing, vol. 33, no. 9, pp. 605-612, 2007.
    [11] J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Kruger, A. E. Lefohn, and T. J. Purcell, "A survey of general-purpose computation on graphics hardware," in Computer graphics forum, vol. 26, no. 1. Wiley Online Library, 2007, pp. 80-113.
    [12] J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips, "Gpu computing," Proceedings of the IEEE, vol. 96, no. 5, pp. 879-899, 2008.
    [13] T. W. ONeil and E. H. Sha, "Minimizing inter-iteration dependencies for loop pipelining," in ISCA 13th International Conference on Parallel and Distributed Computing Systems, Las Vegas, Nevada, 2000, pp. 412-417.
    [14] "Cudpp-2.2," 2014. [Online]. Available: http://cudpp.github.io
    [15] J. Karkkainen, P. Sanders, and S. Burkhardt, "Linear work suffix array construction," Journal of the ACM (JACM), vol. 53, no. 6, pp. 918-936, 2006.
    [16] P. Ko and S. Aluru, "Space efficient linear time construction of suffix arrays," in Combinatorial Pattern Matching. Springer, 2003, pp. 200-210.
    [17] G. Nong, S. Zhang, and W. H. Chan, "Two efficient algorithms for linear time suffix array construction," Computers, IEEE Transactions on, vol. 60, no. 10, pp. 1471-1484, 2011.
    [18] N. Corporation, "Cuda toolkit documentation," 2015. [Online]. Available: http://docs.nvidia.com/cuda/index.html
    [19] J. Nickolls, I. Buck, M. Garland, and K. Skadron, "Scalable parallel programming with cuda," Queue, vol. 6, no. 2, pp. 40-53, 2008.
    [20] J. Hoberock and N. Bell, "Thrust: A parallel template library," 2010, version 1.8.1. [Online]. Available: http://thrust.github.io/
    [21] D. G. Merrill and A. S. Grimshaw, "Revisiting sorting for gpgpu stream architectures," in Proceedings of the 19th international conference on Parallel architectures and compilation techniques. ACM, 2010, pp. 545-546.
    [22] N. Corporation, "Nvidia kepler gk110 architecture whitepaper," 2012.
    [23] N. Futamura, S. Aluru, and S. Kurtz, "Parallel sux sorting," 2001.
    [24] P. S. Pacheco, Parallel programming with MPI. Morgan Kaufmann, 1997.
    [25] J. L. Bentley and R. Sedgewick, "Fast algorithms for sorting and searching strings," in SODA, vol. 97, 1997, pp. 360-369.
    [26] G. Manzini and P. Ferragina, "Engineering a lightweight suffix array construction algorithm," Algorithmica, vol. 40, no. 1, pp. 33-50, 2004.
    [27] R. Patel, Y. Zhang, J. Mak, A. Davidson, J. D. Owens, et al., Parallel lossless data compression on the GPU. IEEE, 2012.
    [28] J. Seward, "On the performance of bwt sorting algorithms," in Data Compression Conference, 2000. Proceedings. DCC 2000. IEEE, 2000, pp. 173-182.
    [29] M. Daga, A. M. Aji, and W.-c. Feng, "On the ecacy of a fused cpu+ gpu processor(or apu) for parallel computing," in Application Accelerators in High-Performance Computing (SAAHPC), 2011 Symposium on. IEEE, 2011, pp. 141-149.
    [30] J. E. Stone, D. Gohara, and G. Shi, "Opencl: A parallel programming standard for heterogeneous computing systems," Computing in science & engineering, vol. 12, no.1-3, pp. 66-73, 2010.
    [31] O. Green, R. McColl, and D. A. Bader, "Gpu merge path: a gpu merging algorithm," in Proceedings of the 26th ACM international conference on Supercomputing. ACM, 2012, pp. 331-340.
    [32] M. Farach, "Optimal suffix tree construction with large alphabets," in Foundations of Computer Science, 1997. Proceedings., 38th Annual Symposium on. IEEE, 1997, pp. 137-143.
    [33] H. Itoh and H. Tanaka, "An efficient method for in memory construction of suffix arrays," in String Processing and Information Retrieval Symposium, 1999 and International Workshop on Groupware. IEEE, 1999, pp. 81-88.
    [34] G. M. Amdahl, "Validity of the single processor approach to achieving large scale computing capabilities," in Proceedings of the April 18-20, 1967, spring joint computer conference. ACM, 1967, pp. 483-485.
    [35] M. Harris, S. Sengupta, and J. D. Owens, "Parallel preffix sum (scan) with cuda," GPU gems, vol. 3, no. 39, pp. 851{876, 2007.
    [36] "Enwiki," 2015. [Online]. Available: https://dumps.wikimedia.org/enwiki/
    [37] "Linux kernel tarball," 2015. [Online]. Available: https://www.kernel.org

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)

    QR CODE