研究生: |
陸仕昀 Lu, Shih-Yun |
---|---|
論文名稱: |
基於卷積強化自注意力機制之糾正標註法以及應用於乙型鏈球菌之奈米孔核苷酸序列 An Error-Correction Method Based on Convolution-Augmented Self-Attention Mechanism for Nanopore DNA Sequencing of Streptococcus agalactiae |
指導教授: |
洪健中
Hong, Chien-Chong 劉通敏 Liou, Tong-Miin |
口試委員: |
陳治平
Chen, Chie-Pein 張淵仁 Chang, Yuan-Jen |
學位類別: |
碩士 Master |
系所名稱: |
工學院 - 動力機械工程學系 Department of Power Mechanical Engineering |
論文出版年: | 2024 |
畢業學年度: | 112 |
語文別: | 中文 |
論文頁數: | 107 |
中文關鍵詞: | 乙型鏈球菌 、核苷酸定序 、序列糾正 、深度學習 、卷積強化自注意力機制 |
外文關鍵詞: | Streptococcus agalactiae, DNA sequencing, Sequencing revising, Deep learning, Convolution-augmented self-attention mechanism |
相關次數: | 點閱:46 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
乙型鏈球菌(Streptococcus agalactiae)又稱 B 群鏈球菌(GBS)是屬於革蘭氏陽性包膜細菌。乙型鏈球菌會在新生兒、孕婦和免疫系統低下的成年患者中引起嚴重的侵襲性感染,可能會導致腦膜炎、肺炎、敗血症等併發症,嚴重會造成死亡的發生。檢測方式可透過第三代定序的牛津奈米孔核苷酸定序,以此確認細菌的菌株基因。第三代定序有著不需額外試劑與序列裁切處理的優勢,但其平均準確度只有86.00%,在這些錯誤中有45.78%為均聚物所造成,尚無法達到第二代定序技術之99.97%以上的準確度。因此本研究之目標便是設計鹼基錯誤糾正軟體去提升鹼基識別軟體的準確性,訓練過程中使用乙型鏈球菌當作標的。
本研究從馬偕紀念醫院的微生物儲存庫中萃取出基因,透過牛津奈米孔科技公司之晶片收集電訊號,首次將乙型鏈球菌資料集用於訓練鹼基錯誤糾正軟體。本研究所提出之深度學習模型以卷積強化自注意力機制為基礎,分為CASACall和CASACall-ID兩個模組。CASACall模組將第三代定序獲得之電訊號進行鹼基預測;CASACall-ID模組會將CASACall模組所預測之鹼基結果,與參考序列對齊後生成序列與錯誤之間的關係,並以此去訓練出能預測錯誤發生之CASACall-ID模組。最終將兩模組之結果進行比較,以輸出糾正完之序列並將其還原成完整基因組。
本研究在測試了不同模組與損失函數後,在乙型鏈球菌資料集中達到了最佳的序列一致性89.18%,組裝後進行奈米拋光之共識準確性為92.14%。
關鍵字: 乙型鏈球菌、核苷酸定序、序列糾正、深度學習、卷積強化自注意力機制
Streptococcus agalactiae, also known as group B streptococcus (GBS), is a gram-positive bacteria. GBS causes severe invasive infections in newborns, pregnant women, and immunocompromised adults, which may lead to complications such as meningitis, pneumonia, sepsis, and potentially fatal outcomes. The detection method can be carried out through third-generation sequencing using Oxford nanopore sequencing to identify the bacterial strain's genome. The advantages of third-generation sequencing include not requiring additional reagents and not trimming the sequence. However, its average accuracy is only 86.00%, with 45.78% of these errors caused by homopolymers. It cannot achieve an accuracy of more than 99.97% like second-generation sequencing. Therefore, this study aims to establish an error-correction method to enhance the accuracy of basecaller. GBS is used as the target organism during the training process.
This study extracted genes from the storage of microbial bank of Mackay Memorial Hospital, collected electrical signals using chips from Oxford Nanopore Technologies, and organized them into a GBS dataset for the first time to train an error correction method. This study proposes a deep learning model based on a convolution-augmented self-attention mechanism, which is divided into two modules: CASACall and CASACall-ID. In the CASACall module, the electrical signals obtained from third-generation sequencing are inputted into the module for basecalling. In the CASACall-ID module, the basecalled sequences from the CASACall module are aligned with the reference sequences to generate relationships between the basecalled sequences and the errors. This information is then used to train the CASACall-ID module to predict errors. Finally, the results from both modules are compared to output the corrected sequence, which is then reconstructed into a complete genome.
After testing various modules and loss functions, this study achieved the best sequence identity of 89.18% and consensus accuracy of 92.14% after assembly followed by nanopolishing.
Keywords: Streptococcus agalactiae, DNA sequencing, sequencing revising, deep learning, convolution-augmented self-attention mechanism
[1] L.-Y. Chang, "Maternal colonization and neonatal group B streptococcal infection: time to universal screening and intrapartum chemoprophylaxis in Taiwan?," Pediatrics & Neonatology, vol. 52, no. 4, pp. 181-182, 2011.
[2] M. Hood, A. Janney, and G. Dameron, "Beta hemolytic streptococcus group B associated with problems of the perinatal period," American journal of obstetrics and gynecology, vol. 82, no. 4, pp. 809-818, 1961.
[3] D. L. Stevens and E. L. Kaplan, Streptococcal infections: clinical aspects, microbiology, and molecular pathogenesis. Oxford University Press, USA, 2000.
[4] M. Rosa-Fraile and B. Spellerberg, "Reliable detection of group B Streptococcus in the clinical laboratory," Journal of clinical microbiology, vol. 55, no. 9, pp. 2590-2598, 2017.
[5] J. E. Lawn et al., "Every country, every family: time to act for group B Streptococcal disease worldwide," Clinical Infectious Diseases, vol. 74, no. Supplement_1, pp. S1-S4, 2022.
[6] L. K. F. Watkins et al., "Epidemiology of invasive group B streptococcal infections among nonpregnant adults in the United States, 2008-2016," JAMA internal medicine, vol. 179, no. 4, pp. 479-488, 2019.
[7] C. f. D. Control and Prevention, "Active bacterial core surveillance report, emerging infections program network, group B streptococcus, 2016," ed, 2019.
[8] K. M. Puopolo et al., "Management of infants at risk for group B streptococcal disease," Pediatrics, vol. 144, no. 2, 2019.
[9] L. Filkins, J. R. Hauser, B. Robinson-Dunn, R. Tibbetts, B. L. Boyanton, and P. Revell, "American Society for Microbiology provides 2020 guidelines for detection and identification of group B Streptococcus," Journal of clinical microbiology, vol. 59, no. 1, pp. e01230-20, 2020.
[10] J. D. Watson and F. H. Crick, "The structure of DNA," in Cold Spring Harbor symposia on quantitative biology, vol. 18: Cold Spring Harbor Laboratory Press, pp. 123-131, 1953.
[11] J. Shendure et al., "DNA sequencing at 40: past, present and future," Nature, vol. 550, no. 7676, pp. 345-353, 2017.
[12] F. Sanger, S. Nicklen, and A. R. Coulson, "DNA sequencing with chain-terminating inhibitors," Proceedings of the national academy of sciences, vol. 74, no. 12, pp. 5463-5467, 1977.
[13] A. M. Maxam and W. Gilbert, "A new method for sequencing DNA," Proceedings of the National Academy of Sciences, vol. 74, no. 2, pp. 560-564, 1977.
[14] B. E. Slatko, J. Kieleczawa, J. Ju, A. F. Gardner, C. L. Hendrickson, and F. M. Ausubel, "“First generation” automated DNA sequencing technology," Current protocols in molecular biology, vol. 96, no. 1, pp. 7.2. 1-7.2. 28, 2011.
[15] B. E. Slatko, A. F. Gardner, and F. M. Ausubel, "Overview of next‐generation sequencing technologies," Current protocols in molecular biology, vol. 122, no. 1, p. e59, 2018.
[16] A. Grada and K. Weinbrecht, "Next-generation sequencing: methodology and application," The Journal of investigative dermatology, vol. 133, no. 8, p. e11, 2013.
[17] F. S. Collins, M. Morgan, and A. Patrinos, "The Human Genome Project: lessons from large-scale biology," Science, vol. 300, no. 5617, pp. 286-290, 2003.
[18] R. D. Fleischmann et al., "Whole-genome random sequencing and assembly of Haemophilus influenzae Rd," science, vol. 269, no. 5223, pp. 496-512, 1995.
[19] N. S. Muhamad Rizal et al., "Advantages and limitations of 16S rRNA next-generation sequencing for pathogen identification in the diagnostic microbiology laboratory: perspectives from a middle-income country," Diagnostics, vol. 10, no. 10, p. 816, 2020.
[20] E. E. Schadt, S. Turner, and A. Kasarskis, "A window into third-generation sequencing," Human molecular genetics, vol. 19, no. R2, pp. R227-R240, 2010.
[21] J. J. Kasianowicz, E. Brandin, D. Branton, and D. W. Deamer, "Characterization of individual polynucleotide molecules using a membrane channel," Proceedings of the National Academy of Sciences, vol. 93, no. 24, pp. 13770-13773, 1996.
[22] D. Rotem, L. Jayasinghe, M. Salichou, and H. Bayley, "Protein detection by nanopores equipped with aptamers," Journal of the American Chemical Society, vol. 134, no. 5, pp. 2781-2787, 2012.
[23] C. Raillon, P. Cousin, F. Traversi, E. Garcia-Cordero, N. Hernandez, and A. Radenovic, "Nanopore detection of single molecule RNAP–DNA transcription complex," Nano letters, vol. 12, no. 3, pp. 1157-1164, 2012.
[24] N. An, A. M. Fleming, H. S. White, and C. J. Burrows, "Crown ether–electrolyte interactions permit nanopore detection of individual DNA abasic sites in single molecules," Proceedings of the National Academy of Sciences, vol. 109, no. 29, pp. 11504-11509, 2012.
[25] D. Branton et al., "The potential and challenges of nanopore sequencing," Nature biotechnology, vol. 26, no. 10, pp. 1146-1153, 2008.
[26] D. Deamer, M. Akeson, and D. Branton, "Three decades of nanopore sequencing," Nature biotechnology, vol. 34, no. 5, pp. 518-524, 2016.
[27] Y. Wang, Y. Zhao, A. Bollas, Y. Wang, and K. F. Au, "Nanopore sequencing technology, bioinformatics and applications," Nature biotechnology, vol. 39, no. 11, pp. 1348-1365, 2021.
[28] E. L. Moss, D. G. Maghini, and A. S. Bhatt, "Complete, closed bacterial genomes from microbiomes using nanopore sequencing," Nature biotechnology, vol. 38, no. 6, pp. 701-707, 2020.
[29] R. Bowden et al., "Sequencing of human genomes with nanopore technology," Nature communications, vol. 10, no. 1, p. 1869, 2019.
[30] C. Delahaye and J. Nicolas, "Sequencing DNA with nanopores: Troubles and biases," PloS one, vol. 16, no. 10, p. e0257521, 2021.
[31] Y. Bengio, I. Goodfellow, and A. Courville, Deep learning. MIT press Cambridge, MA, USA, 2017.
[32] Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," nature, vol. 521, no. 7553, pp. 436-444, 2015.
[33] S. Agatonovic-Kustrin and R. Beresford, "Basic concepts of artificial neural network (ANN) modeling and its application in pharmaceutical research," Journal of pharmaceutical and biomedical analysis, vol. 22, no. 5, pp. 717-727, 2000.
[34] C. Nwankpa, W. Ijomah, A. Gachagan, and S. Marshall, "Activation functions: Comparison of trends in practice and research for deep learning," arXiv preprint arXiv:1811.03378, 2018.
[35] A. L. Maas, A. Y. Hannun, and A. Y. Ng, "Rectifier nonlinearities improve neural network acoustic models," in Proc. icml, vol. 30, no. 1: Atlanta, GA, p. 3, 2013.
[36] S. Elfwing, E. Uchibe, and K. Doya, "Sigmoid-weighted linear units for neural network function approximation in reinforcement learning," Neural networks, vol. 107, pp. 3-11, 2018.
[37] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, "Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks," in Proceedings of the 23rd international conference on Machine learning, pp. 369-376, 2006.
[38] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, "Focal loss for dense object detection," in Proceedings of the IEEE international conference on computer vision, pp. 2980-2988, 2017.
[39] Y. Wen, K. Zhang, Z. Li, and Y. Qiao, "A discriminative feature learning approach for deep face recognition," in Computer vision–ECCV 2016: 14th European conference, amsterdam, the netherlands, October 11–14, 2016, proceedings, part VII 14: Springer, pp. 499-515, 2016.
[40] O. N. Technologies. "London Calling: Clive Brown and team plenary." https://nanoporetech.com/about-us/news/london-calling-clive-brown-and-team-plenary, 2019.
[41] S. R. Eddy, "Hidden markov models," Current opinion in structural biology, vol. 6, no. 3, pp. 361-365, 1996.
[42] M. David, L. J. Dursi, D. Yao, P. C. Boutros, and J. T. Simpson, "Nanocall: an open source basecaller for Oxford Nanopore sequencing data," Bioinformatics, vol. 33, no. 1, pp. 49-55, 2017.
[43] V. Boža, B. Brejová, and T. Vinař, "DeepNano: deep recurrent neural networks for base calling in MinION nanopore reads," PloS one, vol. 12, no. 6, p. e0178751, 2017.
[44] H. Teng, M. D. Cao, M. B. Hall, T. Duarte, S. Wang, and L. J. Coin, "Chiron: translating nanopore raw signal directly into nucleotide sequence using deep learning," GigaScience, vol. 7, no. 5, p. giy037, 2018.
[45] N. Huang, F. Nie, P. Ni, F. Luo, and J. Wang, "Sacall: a neural network basecaller for oxford nanopore sequencing data based on self-attention mechanism," IEEE/ACM transactions on computational biology and bioinformatics, vol. 19, no. 1, pp. 614-623, 2020.
[46] F. J. Rang, W. P. Kloosterman, and J. de Ridder, "From squiggle to basepair: computational approaches for improving nanopore sequencing read accuracy," Genome biology, vol. 19, no. 1, p. 90, 2018.
[47] H. Zhang, C. Jain, and S. Aluru, "A comprehensive evaluation of long read error correction methods," BMC genomics, vol. 21, pp. 1-15, 2020.
[48] P. Morisse, T. Lecroq, and A. Lefebvre, "Long-read error correction: a survey and qualitative comparison," BioRxiv, p. 2020.03. 06.977975, 2020.
[49] T. Hackl, R. Hedrich, J. Schultz, and F. Förster, "proovread: large-scale high-accuracy PacBio correction through iterative short read consensus," Bioinformatics, vol. 30, no. 21, pp. 3004-3011, 2014.
[50] S. Koren, B. P. Walenz, K. Berlin, J. R. Miller, N. H. Bergman, and A. M. Phillippy, "Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation," Genome research, vol. 27, no. 5, pp. 722-736, 2017.
[51] L. Wang, L. Qu, L. Yang, Y. Wang, and H. Zhu, "NanoReviser: an error-correction tool for nanopore sequencing based on a deep learning algorithm," Frontiers in Genetics, vol. 11, p. 900, 2020.
[52] X. Dong, Z. Yu, W. Cao, Y. Shi, and Q. Ma, "A survey on ensemble learning," Frontiers of Computer Science, vol. 14, pp. 241-258, 2020.
[53] L. Rokach, "Decomposition methodology for classification tasks: a meta decomposer framework," Pattern Analysis and Applications, vol. 9, pp. 257-271, 2006.
[54] R. Polikar, "Ensemble learning," Ensemble machine learning: Methods and applications, pp. 1-34, 2012.
[55] G. Tang, J. Shi, W. Wu, X. Yue, and W. Zhang, "Sequence-based bacterial small RNAs prediction using ensemble learning strategies," BMC bioinformatics, vol. 19, pp. 13-23, 2018.
[56] BIOTOOLS. "細菌基因體定序Bacterial Genome Sequencing (Assembly)." https://www.toolsbiotech.com/product_service_detail.php?id=452&cateId=1564.
[57] O. N. Technologies. "Oxford Nanopore Technologies Products - PromethION." https://nanoporetech.com/products/promethion.
[58] O. N. Technologies. "NCM 2022: How to get started with nanopore sequencing and plan your experiment." https://nanoporetech.com/resource-centre/video/ncm22/how-to-get-started-with-nanopore-sequencing-and-plan-your-experiment, 2022.
[59] R. R. Wick, L. M. Judd, and K. E. Holt, "Deepbinner: Demultiplexing barcoded Oxford Nanopore reads with deep convolutional neural networks," PLoS computational biology, vol. 14, no. 11, p. e1006583, 2018.
[60] S. F. Altschul et al., "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs," Nucleic acids research, vol. 25, no. 17, pp. 3389-3402, 1997.
[61] NCBI. "Basic Local Alignment Search Tool." https://blast.ncbi.nlm.nih.gov/Blast.cgi.
[62] T. Tatusova, S. Ciufo, B. Fedorov, K. O’Neill, and I. Tolstoy, "RefSeq microbial genomes database: new representation and annotation strategy," Nucleic acids research, vol. 42, no. D1, pp. D553-D559, 2014.
[63] J. Sun, "Nanopore DNA Sequencing with Convolution-Augmented Self-Attention Mechanism Based Deep Learning Model for Escherichia coli Identification," Master, National Tsing Hua University, 2022.
[64] A. Vaswani et al., "Attention is all you need," Advances in neural information processing systems, vol. 30, 2017.
[65] A. Gulati et al., "Conformer: Convolution-augmented transformer for speech recognition," arXiv preprint arXiv:2005.08100, 2020.
[66] H. Li, "Minimap2: pairwise alignment for nucleotide sequences," Bioinformatics, vol. 34, no. 18, pp. 3094-3100, 2018.
[67] Tombo. "Re-Squiggle Algorithm." https://nanoporetech.github.io/tombo/tutorials.html, 2017.
[68] M. Stoiber et al., "De novo identification of DNA modifications enabled by genome-guided nanopore signal processing," BioRxiv, p. 094672, 2016.
[69] J. Zeng, H. Cai, H. Peng, H. Wang, Y. Zhang, and T. Akutsu, "Causalcall: Nanopore basecalling using a temporal convolutional network," Frontiers in Genetics, vol. 10, p. 1332, 2020.
[70] H. Konishi, R. Yamaguchi, K. Yamaguchi, Y. Furukawa, and S. Imoto, "Halcyon: an accurate basecaller exploiting an encoder–decoder model with monotonic attention," Bioinformatics, vol. 37, no. 9, pp. 1211-1217, 2021.
[71] Y.-z. Zhang et al., "Nanopore basecalling from a perspective of instance segmentation," BMC bioinformatics, vol. 21, pp. 1-9, 2020.
[72] A. G. Howard et al., "Mobilenets: Efficient convolutional neural networks for mobile vision applications," arXiv preprint arXiv:1704.04861, 2017.
[73] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, "Language modeling with gated convolutional networks," in International conference on machine learning: PMLR, pp. 933-941, 2017.
[74] R. R. Wick, L. M. Judd, and K. E. Holt, "Performance of neural network basecalling tools for Oxford Nanopore sequencing," Genome biology, vol. 20, pp. 1-10, 2019.
[75] rrwick. "Rebaler." https://github.com/rrwick/Rebaler, 2021.
[76] R. Vaser, I. Sović, N. Nagarajan, and M. Šikić, "Fast and accurate de novo genome assembly from long uncorrected reads," Genome research, vol. 27, no. 5, pp. 737-746, 2017.
[77] L. Zhang, Gu,X. and Jiang,H. "Escherichia coli strain DH5alpha chromosome, complete genome." https://www.ncbi.nlm.nih.gov/nuccore/cp045741, 2020.
[78] S. Teatero, McGeer,A., Li,A., Gomes,J., Seah,C., Demczuk,W., Martin,I., Wasserscheid,J., Dewar,K., Melano,R.G. and Fittipaldi,N. "Streptococcus agalactiae strain NGBS572, complete genome." https://www.ncbi.nlm.nih.gov/nuccore/CP007632.1, 2014.
[79] O. N. Technologies. "Bonito." https://github.com/nanoporetech/bonito, 2020.
[80] M. Pagès-Gallego and J. de Ridder, "Comprehensive benchmark and architectural analysis of deep learning models for nanopore sequencing basecalling," Genome Biology, vol. 24, no. 1, p. 71, 2023.