簡易檢索 / 詳目顯示

研究生: 林郁翔
Lin, Yu-Shiang
論文名稱: 通用計算在圖形處理器上虛擬化的API遠端處理策略以及在生物資訊工具上的應用
A New API Remoting Policy for General-Purpose Computing on GPU Virtualization and its Application on Biological Tools
指導教授: 鍾葉青
Chung, Yeh-Ching
口試委員: 許慶賢
Hsu, Ching-Hsien 
賴冠州
Lai, Kuan-Chou
林俊淵
Lin, Chun-Yuan
周志遠
Chou, Jerry
洪哲倫
Hung, Che-Lun
李哲榮
Lee, Che-Rung
學位類別: 博士
Doctor
系所名稱:
論文出版年: 2018
畢業學年度: 106
語文別: 英文
論文頁數: 102
中文關鍵詞: GPU運算虛擬化雲端運算GPU虛擬化生物資訊
外文關鍵詞: GPU computing, Virtualization, Cloud computing, GPU virtualization, bioinformatics
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在本論文中,我們設計了生物資訊工具,GPU-REMuSiC 與 CUDA ClustalW來處理使用圖形處理器(GPU)加速的序列對齊問題。對於生物學的應用而言,序列比對是分析DNA和蛋白質序列的重要策略,多序列比對(MSA)與限制型多序列比對(CMSA)也是研究生物數據的基本方法。GPU-REMuSiC 以及CUDA ClustalW 可以利用GPU的運算能力來提升處理MSA 和 CMSA的效能。然而,傳統的執行環境是生物學家建立和使用這些生物工具的主要門檻。因此我們應該使用虛擬化技術來使我們的工具擁有雲端服務的特性。 虛擬化已是雲端運算中的基本重要技術,其中GPU是需要被虛擬化的硬體之一,因為其廣泛應用於各種高速運算的情境,尤其是在通用的GPU運算(GPGPU)情況下。雖然過去已有許多GPGPU 虛擬化框架的提出,但是他們受到虛擬機和主機之間的數據頻寬交換的限制;即使存在於TCP/IP的通訊最佳化方法來提高原有的頻寬效能,這種最佳化的方法在於效能受限的網路環境下,仍然擁有許多的延遲。因此,在本論文中,我們設計了一個新的虛擬化框架qCUDA,以提高CUDA程式的效能。qCUDA基於提供虛擬化驅動程式和設備模組的virtio框架,用於執行與API遠端處理和記憶體管理的交互作用。此外qCUDA還為多GPU 上的動態負載平衡提供了可配置的策略。在我們的實驗中,我們從未經修改的CUDA SDK中選擇了幾個測試範例,分別為bandwidthTest, MatrixMul, vectorAdd 和simpleStreams,所有的這些測試範例都演示了GPGPU計算的基本步驟;此外,我們還執行了實際的應用範例,GPU-REMuSiC 與 CUDA ClustalW,為生物資訊工具,以證明qCUDA的實用性。在我們的測試環境中,透過與實體機相比,qCUDA實現的大多數結果都在實體機頻寬的95\%以上。此外,與過去其他的研究進行比較,qCUDA具有更多的彈性(flexibility) 和間接性(interposition);CUDA的兼容程式可以分別執行於QEMU-KVM虛擬機管理程序上的Linux和Windows 虛擬機。


    In this thesis we designed the biological tools, GPU-REMuSiC and CUDA ClustalW, to deal with the sequence alignment problem using Graphics processing units (GPU) acceleration. For biological applications, sequence alignment is an important strategy to analyze DNA and protein sequences. Multiple sequence alignment (MSA) and constraint sequence alignment (CMSA) are the essential methodologies to study biological data. Use GPU computing power, GPU-REMuSiC and CUDA ClustalW can improve the performance of solving the MSA and CMSA issues. However, the traditional execution environment is a threshold for biologists to set up and use these biological tools. Therefore, we should take advantage of virtualization technology to make our tools with the features of the potential cloud service. Current virtualization has become an important technology in cloud computing. GPU is one of the virtualized hardware since it is widely applied to the high performance computing applications, especially for the computing of general-propose GPU (GPGPU). Although many GPGPU virtualization frameworks have been proposed, the performance of them is limited by the bandwidth of data transactions between the virtual machine (VM) and host; even though there was an optimized method of TCP/IP-based communications to improve the performance via a high speed interconnect network. This optimized method still gave the considerable latency through the powerless network interface. Therefore, in this thesis, we design a new virtualization framework, qCUDA, to improve the performance of compute unified device architecture (CUDA) programs. qCUDA is based on the virtio framework to provide the para-virtualized driver and the device module for performing the interaction with API remoting and memory management. Moreover, qCUDA also provides a configurable policy for dynamic load balancing on multi-GPUs. In our experiment, we choose from unmodified CUDA SDK, which are bandwidthTest, MatrixMul, vectorAdd and simpleStreams, all of these benchmarks show the essential steps of GPGPU computing; furthermore, we also execute the practical biological applications, GPU-REMuSiC and CUDA ClustalW, to proof its practicability. In our test environment, qCUDA can achieve above 95\% of the bandwidth efficiency for most results by comparing with the native. In addition, by comparing with prior work, qCUDA has more flexibility and interposition that it can execute CUDA-compatible programs in the Linux and Windows VMs, respectively, on QEMU-KVM hypervisor for GPGPU virtualization.

    摘要 V Abstract VII 致謝 IX 1 Introduction 1 2 Background 5 2.1 SequenceAlignmentinComputationalBiology . . . . . . . . . . . . . . . 5 2.1.1 GPGPUProgrammingModel-CUDA ............... 8 2.1.2 GPUAccelerationofSequenceAlignment . . . . . . . . . . . . . 9 2.2 Virtualization ................................. 10 2.2.1 KVM................................. 12 2.2.2 QEMU ................................ 13 2.2.3 VirtIO................................. 13 2.2.4 GPGPU Virtualization ........................ 15 2.3 RelatedWork ................................. 18 3 Preliminary concepts 25 3.1 GPU-REMuSiC and CUDA-ClustalW .................... 25 3.1.1 Parallel dynamic programming algorithms on CPUs ........ 25 3.1.2 Calculation stepsof RE-MuSiC and ClustalW . . . . . . . . . . . . 25 3.1.3 Parallel sequence alignment algorithms on GPUs . . ........ 28 3.1.4 GPU-REMuSiC ........................... 30 3.1.5 CUDA-ClustalW ........................... 35 3.2 System Concept and Designon qCUDA................... 39 3.2.1 System Components ......................... 39 3.2.2 Library Interposition ......................... 41 3.2.3 Control Channel ........................... 43 3.2.4 Memory Management ........................ 44 3.2.5 NCVM & CVM ........................... 46 3.2.6 PinnedHostMemory......................... 47 3.2.7 Implicitly Identified Information................... 48 3.2.8 Scheduler plug-in on Multi-GPUs .................. 48 3.3 Createa CUDA runtime API via qCUDA framework . . . . . . . . . . . . 51 4 Experimental Results and Discussion 61 4.1 GPU-REMuSiC on physical machine .................... 61 4.2 CUDA-ClustalW on physical machine.................... 63 4.3 qCUDA framework.............................. 68 4.3.1 Bandwidth of Data Transaction ................... 70 4.3.2 Matrix Multiplication......................... 74 4.3.3 VectorAddition............................ 78 4.3.4 MultipleStreams........................... 81 4.3.5 GPU-REMuSiC and CUDA-ClustalW................ 82 4.3.6 MultipleVMs............................. 87 4.3.7 Windows guest OS in GPGPU virtualization . . . . . . . . . . . . 88 4.3.8 Pluggable scheduler for multiple GPUs . . . . . . . . . . . . . . . 91 5 CONCLUSION 93 References 95

    [1] K. Group, “OpenCL - The open standard for parallel programming of heterogeneous systems.” Retrieved 21-Sep-2016: https://www.khronos.org/opencl/ .
    [2] NVIDIA, “CUDA 7.5 Downloads.” Retrieved 21-Sep-2016: https://developer.nvidia.com/cuda-downloads .
    [3] K. Group, “OpenGL - The Industry Standard for High Performance Graphics.” Re- trieved 21-Sep-2016: https://www.opengl.org/ .
    [4] Microsoft, “Microsoft Download Center: Windows, Office, Xbox and More.” Re- trieved 21-Sep-2016: https://www.microsoft.com/en-us/download .
    [5] S. Che, J. Li, J. W. Sheaffer, K. Skadron, and J. Lach, “Accelerating Compute- Intensive Applications with GPUs and FPGAs,” in SASP’08, pp. 101–107, 2008.
    [6] K. Fraser, S. Hand, R. Neugebauer, I. Pratt, A. Warfield, and M. Williamson, “Safe hardware access with the Xen virtual machine monitor,” in OASIS’04, pp. 3–7, 2004.
    [7] KVM, “Linux-kvm-org.” Retrieved 21-Sep-2016: http://www.linux-kvm.org/page/Main_Page .
    [8] QEMU, “Wiki.qemu.org.” Retrieved 21-Sep-2016: http://wiki.qemu.org/Main_Page .
    [9] R. Russell, “virtio: towards a de-facto standard for virtual I/O devices,” ACM SIGOPS OSR, vol. 42, no. 5, pp. 95–103, 2008.
    [10] S. Needleman and C. Wunsch, “A general method applicable to the search for sim- ilarities in the amino acid sequence of two proteins,” Journal of Moecular Biology, vol. 48, no. 3, pp. 443–453, 1970.
    [11] T. Smith and M. Waterman, “ Identification of common molecular subsequences,” Journal of Moecular Biology, vol. 147, pp. 195–197, 1981.
    [12] H. Carrillo and D. Lipman, “ Identification of common molecular subsequences,” Journal on Applied Mathematics, pp. 1073–1082, 1988.
    [13] J. Thompson and D. Higgins, “CLUSTAL W: Improving the sensitivity of pro- gressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice,” Nucleic Acids Research, vol. 22, no. 22, pp. 4673–4680, 1994.
    [14] D.Feng and A.Doolittle,“Progressive sequence alignment as a prerequisite to correct phylogenetic trees,” Journal of Molecular Evolution, vol. 25, pp. 351–360, 1987.
    [15] C. Notredame, D. Higgins, and J. Heringa, “T-Coffee: a novel method for fast and accurate multiple sequence alignment,” Journal of Molecular Biology, vol. 302, pp. 205–217, 2000.
    [16] K. Wong and Z. Zhang, “SNPdryad: predicting deleterious non-synonymous hu- man SNPs using only orthologous protein sequences,” Bioinformatics, vol. 30, no. 8, pp. 1112–1119, 2014.
    [17] K.Wong,T.Chan,C.Peng,Y.Li,andZ.Zhang,“DNAmotifelucidationusingbelief propagation,” Nucleic Acids Research, vol. 41, no. 16, p. 153, 2013.
    [18] G.Myers, S.Selznick, Z.Zhang, and W.Miller,“Progressive multiple alignment with constraints,” Journal of Computational Biology, vol. 3, no. 4, pp. 563–572, 1996.
    [19] C. Y. Tang, C. L. Lu, M. D. T. Chang, Y. T. Tsai, Y. J. Sun, and K. M. Chao, “Con- strained multiple sequence alignment tool development and its application to RNase family alignment,” Journal of Bioinformatics and Computational Biology, pp. 267– 287, 2003.
    [20] Y. T. Tsai, “The constrained longest common subsequence problem,” Information Processing Letters, vol. 88, no. 4, pp. 173–176, 2003.
    [21] F. Y. L. Chin, A. D. Santis, A. L. Ferrara, N. L. Ho, and S. K. Kim, “The constrained longest common subsequence problem,” Information Processing Letters, vol. 90, no. 4, pp. 175–179, 2004.
    [22] F. Y. L. Chin, N. L. Ho, T. W. Lam, and P. W. H. Wong, “Efficient constrained multiple sequence alignment with performance guarantee,” Journal of Bioinformatics and Computational Biology, vol. 3, no. 1, pp. 1–18, 2005.
    [23] D.He, A.N.Arslan, and A.C.H.Ling,“A fast algorithm for the constrained multiple sequence alignment problem,” Acta Cybernetica, vol. 17, no. 4, pp. 701–717, 2006.
    [24] Y. T. Tsai, Y. P. Huang, C. T. Yu, and C. L. Lu, “MuSiC: A tool for multiple sequence alignment with constraints. Bioinformatics,” Bioinformatics, vol. 20, no. 14, pp. 2309–2311, 2004.
    [25] C. L. Lu and Y. P. Huang, “A memory-efficient algorithm for multiple sequence alignment with constraints,” Bioinformatics, vol. 21, no. 1, pp. 20–30, 2005.
    [26] Y.-S. Chung, W.-H. Lee, C. Y. Tang, and C. L. Lu, “REMuSiC: A tool for multiple sequence alignment with regular expression constraints. Nucleic Acids Research,” Nucleic Acids Research, pp. 639–644, 2007.
    [27] K. Li, “ClustalW-MPI: ClustalW analysis using distributed and parallel computing,” Bioinformatics, vol. 19, no. 12, pp. 1585–1586, 2003.
    [28] J. Nickolls, I. Buck, M. Garland, and K. Skadron, “Scalable parallel programming with CUDA,” ACM Queue, vol. 6, pp. 40–53, 2008.
    [29] S. A. Manavski and G. Valle, “CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment,” BMC Bioinformatics, vol. 9, 2008.
    [30] G. M. Striemer and A. Akoglu, “Sequence alignment with GPU: performance and design challenges,” in IPDPS, pp. 1–10, 2009.
    [31] L. Ligowski and W. Rudnicki, “An efficient implementation of Smith-Waterman algorithm on GPU using CUDA, for massively parallel scanning of sequence databases,” in HiCOMB, 2009.
    [32] Y. Liu, B. Schmidt, and D. Maskell, “MSA-CUDA: Multiple sequence alignment on graphics processing units with CUDA,” in Proceedings of the International Confer- ence on Application-Specific Systems, Architectures and Processors, no. 5200019, pp. 121–128, 2009.
    [33] Y. Liu, B. Schmidt, W. Liu, and D. Maskell, “CUDA-MEME: accelerating motif discovery in biological sequences using CUDA-enabled graphics processing units,” Pattern Recognition Letters, vol. 31, no. 14, pp. 2170–2177, 2010.
    [34] E. F. O. Sandes and A. C. M. de Melo, “CUDAlign: using GPU to accelerate the comparison of megabase genomic sequences,” in 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, no. 137-146, 2010.
    [35] A. Khajeh-Saeed, S. Poole, and J. B. Perot, “Acceleration of the Smith–Waterman algorithm using single and multiple graphics processors,” Journal of Computational Physics, vol. 229, pp. 4247–4258, 2010.
    [36] E. F. de O. Sandes and A. C. M. de Melo, “Smith-Waterman Alignment of Huge Sequences with GPU in Linear Space,” IPDPS, pp. 1199–1211, 2011.
    [37] J. Nickolls, I. Buck, M. Garland, and K. Skadron, “Scalable parallel programming with CUDA,” ACM Queue, vol. 6, pp. 40–53, 2008.
    [38] J.L.Payne,N.A.Sinnott-Armstrong, and J.H.Moore,“Exploiting graphics processing units for computational biology and bioinformatics,” Interdisciplinary Sciences: Computational Life Sciences, vol. 2, pp. 213–220, September 2010.
    [39] M. S. Nobile, P. Cazzaniga, A. Tangherloni, and D. Besozzi, “Graphics processing units in bioinformatics, computational biology and systems biology,” Brief Bioin- form, July 2016.
    [40] GPGPU, “GPGPU.org.” Retrieved 21-Sep-2016: http://gpgpu.org/tag/computational-biology .
    [41] M. Schatz, C. Trapnell, A. Delcher, and A. Varshney, “High-throughput sequence alignment using Graphics Processing Units,” BMC Bioinformatics, vol. 8, p. 474, 2007.
    [42] C. Trapnell and M. Schatz, “Optimizing data intensive GPGPU computations for DNA sequence alignment,” Parallel Comput, vol. 35, no. 8, pp. 429–440, 2009.
    [43] Y. Liu, B. Schmidt, and D. Maskell, “CUDASW++2.0: enhanced Smith-Waterman protein database search on CUDA-enabled GPUs based on SIMT and virtualized SIMD abstractions,” BMC Res Notes, 4 2010.
    [44] S.-T. Lee, C.-Y. Lin, and C. L. Hung, “GPU-Based Cloud Service for Smith- Waterman Algorithm Using Frequency Distance Filtration Scheme,” BioMed Re- search International, vol. 2013, 2013.
    [45] C. Graziano, “A performance analysis of Xen and KVM hypervisors for hosting the Xen Worlds Project.” Retrieved 14-Nov-2017:
    [46] Y. Dong, X. Yang, J. Li, G. Liao, K. Tian, and H. Guan, “High performance network virtualization with SR-IOV,” JPDC, vol. 72, no. 11, pp. 1471–1480, 2012.
    [47] F. Checconi and L. Rizzo, “Porting Linux KVM to FreeBSD.” Retrieved 14-Nov- 2017: .
    [48] bmc, “KVM on illumos.” Retrieved 14-Nov-2017: .
    [49] Wikipedia, “Kernel-based Virtual Machine.” Retrieved 14-Nov-2017: .
    [50] NIST, “National Institute of Standards and Technology.” Retrieved 21-Sep-2016: https://www.nist.gov .
    [51] Amazon, “Elastic Compute Cloud.” Retrieved 21-Sep-2016: https://aws.amazon.com/?nc2=h_lg .
    [52] NVIDIA, “Visual Computing Leadership from NVIDIA.” Retrieved 21-Sep-2016: http://www.nvidia.com/page/home.html .
    [53] Intel, “Graphics Virtualization Software.” Retrieved 21-Sep- 2016: https://software.intel.com/en-us/blogs/2014/05/02/intel-graphics-virtualization-update .
    [54] Xenproject.org, “VS16: Video Spotlight with Xen Project’s Lars Kurth.” Retrieved 21-Sep-2016: https://www.xenproject.org
    [55] Citrix.com, “XenServer.” Retrieved 21-Sep-2016: https://www.citrix.com/downloads/xenserver.html
    [56] Vmware, “VMware Virtualization for Desktop.” Retrieved 21-Sep-2016: http://www.vmware.com
    [57] K. Tian, Y. Dong, and D. Cowperthwaite, “A full GPU virtualization solution with mediated pass-through,” in USENIX ATC’14, pp. 121–132, 2014.
    [58] M.Dowty and J.Sugerman, “GPU virtualization on VMware’s hosted I/O architec- ture,” ACM SIGOPS OSR,, vol. 43, no. 3, p. 73, 2009.
    [59] V. Gupta and A. Gavrilovska, “Gvim: GPU-Accelerated Virtual Machines,” in HPCVirt, pp. 17–24, 2009.
    [60] G.Giunta, R.Montella, G.Agrillo, and G.Coviello,“AGPGPUTransparentVirtualization Component for High Performance Computing Clouds,” in Euro-Par, pp. 379– 391, 2010.
    [61] L. Shi, H. Chen, J. Sun, and K. Li, “vCUDA: GPU-Accelerated High-Performance Computing in Virtual Machines,” IEEE Trans. Comput, vol. 61, no. 6, pp. 804–816, 2009.
    [62] J. Duato, A. Pena, F. Silla, R. Mayo, and E. Quintana, “rCUDA: Reducing the Num- ber of GPU-Based Accelerators in High Performance Clusters,” in HPCS, pp. 224– 231, 2010.
    [63] P. Markthub, A. Nomura, and S. Matsuoka, “mrCUDA: Low-Overhead Middleware for Transparently Migrating CUDA Execution from Remote to Local GPUs,” in SC15, 2015.
    [64] T. Tien and Y. You, “Enabling OpenCL support for GPGPU in Kernel-based Virtual Machine,” Softw. Pract. Exper, vol. 44, no. 5, pp. 483–510, 2012.
    [65] F. Pěrez, C. Reaňo, and F. Silla, “Providing CUDA Acceleration to KVM Virtual Machines in InfiniBand Clusters with rCUDA,” in DASI, pp. 85–95, 2016.
    [66] J. Prades, C. Reaňo, and F. Silla, “CUDA acceleration for Xen virtual machines in infiniband clusters with rCUDA,” in PPoPP ’16, no. 35, 2016.
    [67] C. Reaňo, F. Silla, G. Shainer, and S. Schultz, “Local and Remote GPUs Perform Similar with EDR 100G InfiniBand,” in Middleware Industry?15, no. 4, 2015.
    [68] C. Reaňo and F. Silla, “A Performance Comparison of CUDA Remote GPU Virtualization Frameworks,” in CLUSTER, pp. 488–489, 2015.
    [69] C. Reaňo and F. Silla, “infiniBand Verbs Optimizations for Remote GPU Virtualiza- tion,” in CLUSTER, pp. 825–832, 2015.
    [70] A.Peňa, C. andF.Silla, R.Mayo, E.Quintana-Ortì, and J.Duato, “A complete and effi- cient CUDA-sharing solution for HPC clusters,” Parallel Computing, vol. 40, no. 10, pp. 574–588, 2014.
    [71] C.Reaňo, R. Mayo, E. Quintana-Ortì, F.Silla, J.Duato, and A. Peňa, “Influence of InfiniBand FDR on the performance of remote GPU virtualization,” in CLUSTER, pp. 1–8, 2013.
    [72] D.C. Wyld,“The utility of cloud computing as a new pricing and consumption-model for information technology,” IJDMS, vol. 1, no. 1, pp. 1–20, 2009.
    [73] C. Reaňo, A. J. Pěna, F. Silla, R. Mayo, and Quintana-Ortì, “CU2rCU: towards the Complete rCUDA Remote GPU Virtualization and Sharing Solution,” in HiPC, pp. 1–10, 2012.
    [74] G. Motika and S. Weiss, “Virtio network paravirtualization driver: Implementation and performance of a de-facto standard,” COMPUT STAND INTER, vol. 34, no. 1, pp. 36–47, 2012.
    [75] D. He and A. Arslan, “A parallel algorithm for the constrained multiple sequence alignment problem,” in BIBE, pp. 258–262, 2005.
    [76] D. He and A. Arslan, “FastPCMSA: an improved parallel algorithm for the con- strained multiple sequence alignment problem,” in FCS, pp. 88–94, 2006.
    [77] D.He and A.Arslan,“Space-efficient parallel algorithms for the constrained multiple sequence alignment problem,” in BIOCOMP, pp. 10–16, 2006.
    [78] H. Tsai, C. Lin, Y.-C. Chung, and T. C.Y., “An efficient parallel algorithm for con- straint multiple sequence alignment,” in ICS, pp. 1261–1266, 2006.
    [79] M. Flynn, “Some computer organizations and their effectiveness,” IEEE Transac- tions on Computers, vol. C-21, p. 948, 1972.
    [80] Y. Liu, B. Schmidt, and D. Maskell, “MSA-CUDA: multiple sequence alignment on graphics processing units with CUDA,” in ASAP, pp. 121–128, 2009.
    [81] W. Liu, B. Schmidt, and W. Muller-Wittig, “CUDABLASTP: accelerating BLASTP on CUDA-enabled graphics hardware,” IEEE/ACM Trans Comput Biol Bioinform, vol. 8, no. 6, pp. 1678–1684, 2011.
    [82] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, and R. Katz, “Above the clouds: A Berkeley view of cloud computing,” UC Berkeley Technical Repor, 2009.
    [83] Xen, “XenStore wiki.” Retrieved 21-Sep-2016: https://wiki.xen.org/wiki/XenStore .
    [84] R. Russell, “Virtio: towards a de-facto standard for virtual I/O devices,” ACM SIGOPS OSR, vol. 42, no. 5, pp. 95–103, 2008.
    [85] T.-Y. Liang and Y.-W. Chang, “GridCuda: A grid-enabled CUDA programming toolkit,” in WAINA, pp. 141–146, 2011.
    [86] S. Xiao, P. Balaji, Q. Zhu, R. Thakur, S. Coghlan, H. Lin, G. Wen, J. Hong, and W. Feng, “VOCL: an optimized environment for transparent virtualization of graph- ics processing units,” in InPar, pp. 1–12, 2012.
    [87] J. Kim, S. Seo, J. Lee, J. Nah, G. Jo, and J. Lee, “SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters,” in ICS, pp. 341–352, 2012.
    [88] P. Kegel, M. Steuwer, and S. Gorlatch, “dOpenCL: towards a uniform programming approach for distributed heterogeneous multi-/many-core systems,” in IPDPSW, pp. 174–186, 2012.
    [89] Y. Suzuki, H. Y. S. Kat and, and K. Konol, “Virtio: towards a de-facto standard for virtual I/O devices,” IEEE Trans. Comput, vol. 65, no. 9, pp. 2752–2766, 2016.
    [90] Y.Suzuki, S.Kato, H.Yamada, and K.Kono, “Why not virtualizing GPU at the hyper- visor?,” in USENIX ATC?14, pp. 109–120, 2014.
    [91] S. Kato, M. McThrow, C. Maltzahn, and S. Brandt, “Gdev: Firstclass GPU resource management in the operating system,” in USENIX ATC?12, pp. 37–37, 2012.
    [92] A.Galper and D.Brutlag,“Parallel similarity search and alignment with the dynamic programming method,” Technique report, KSL-report, pp. 74–90, 1990.
    [93] G. Lewandowski, A. Condon, and E. Bach, “Asynchronous analysis of parallel dy- namic programming algorithms,” IEEE Trans. Parallel Distributed Systems, vol. 7, no. 4, pp. 425–438, 1996.
    [94] S.RajkoandS.Aluru,“Spaceandtimeoptimalparallelsequencealignments,”IEEE Trans. Parallel Distributed Systems, vol. 15, no. 12, pp. 1070–1081, 2004.
    [95] Y.-S. Chung, C. Lu, and C. Tang, “Efficient algorithms for regular expression con- strained sequence alignment,” Lect. Notes Comput, vol. 4099, pp. 389–400, 2006.
    [96] Y.-S.Chung,W.-H.Lee,C.Tang,andC.Lu,“REMuSiC:atoolformultiplesequence alignment with regular expression constraints,” Nucleic Acids Res, vol. 35, pp. 639– 644, 2007.
    [97] Y.-S. Chung, C. Lu, and C. Tang, “Efficient algorithms for regular expression con- strained sequence alignment,” Inf. Process. Lett, vol. 103, pp. 240–246, 2007.
    [98] D. Hirschberg, “Algorithm for the longest common subsequence problem,” J.ACM, vol. 24, no. 4, pp. 664–675, 1977.
    [99] R. Korf and W. Zhang, “Divide and conquer frontier search applied to optimal se- quence alignment,” in Proceedings of the Seventeenth National Conference on Ar- tificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, pp. 910–916, 2000.
    [100] B.Liu, B.Schmidt, G.Voss, and W.Mueller-Wittig,“GPU-ClustalW: using graphics hardware to accelerate multiple sequence alignment,” Lect. Notes Comput, vol. 4297, pp. 363–374, 2006.
    [101] B. Liu, B. Schmidt, G. Voss, and W. Mueller-Wittig, “Streaming algorithms for bi- ological sequence alignment on GPUs,” IEEE Trans. Parallel Distributed Systems, vol. 18, no. 9, pp. 1270–1281, 2007.
    [102] C.-Y. Lin and Y.-S. Lin, “Efficient parallel algorithm for multiple sequence align- ments with regular expression constraints on graphics processing units,” Interna- tional Journal of Computational Science and Engineering, vol. 9, pp. 11–20, 2014.
    [103] P. Vouzis and N. Sahinidis, “GPU-BLAST: using graphics processors to accelerate protein sequence alignment,” Bioinformatics, vol. 27, no. 2, pp. 182–188, 2011.
    [104] NVIDIA, “CUDA Toolkit Documentation.” Retrieved 21-Sep-2016: http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#axzz4CfejMEvI
    [105] L. Youseff, R. Wolski, B. Gorda, and C. Krintz, “Paravirtualization for HPC Sys- tems,” in ISPA, pp. 474–486, 2006.
    [106] J. Thompson and P. Plewniak, F., “BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs,” Bioinformatics, vol. 15, pp. 87–88, 1999.
    [107] C. Robert, “MUSCLE: multiple sequence alignment with high accuracy and high throughput,” Nucleic Acids Res, vol. 32, no. 5, pp. 1792–1797, 2004.
    [108] K. Katoh, K. Misawa, K. Kuma, and T. Miyata, “MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform,” Nucleic Acids Res, vol. 30, pp. 3059–3066, 2002.
    [109] J. Blazewicz, W. Frohmberg, and M. Kierzynka, “G-MSA ? A GPU-based, fast and accurate algorithm for multiple sequence alignment,” Journal of Parallel and Dis- tributed Computing, vol. 73, no. 1, pp. 32–41, 2013.
    [110] G. Motika and S. Weiss, “Virtio network paravirtualization driver: Implementation and performance of a de-facto standard,” COMPUT STAND INTER, vol. 34, no. 1, pp. 36–47, 2012.
    [111] T. Duato, A. Pena, F. Silla, J. Fernandez, R. Mayo, and E. Quintana-Orti, “Enabling CUDA acceleration within virtual machines using rCUDA,” in HiPC 11, pp. 1–10, 2011.
    [112] S. Che, J. Sheaffer, and K. Skadron, “Dymaxion: optimizing memory access patterns for heterogeneous systems,” in SC ’11, no. 13, 2011.
    [113] C.-L. Hung, Y.-S. Lin, C.-Y. Lin, Y.-C. Chung, and Y.-F. Chung, “CUDA ClustalW: An efficient parallel algorithm for progressive multiple sequence alignment on Multi- GPUs,” Computational Biology and Chemistry, vol. 58, pp. 64–68, 2015.
    [114] NVIDIA, “Accelerated computing.” Retrieved 21-Sep-2016: https://developer.nvidia.com/accelerated-computing

    QR CODE