簡易檢索 / 詳目顯示

研究生: 孫若璞
Sun, Jo-Pu
論文名稱: 基於卷積強化之自注意力機制深度學習應用於大腸桿菌之奈米孔核甘酸序列定序
Nanopore DNA Sequencing with Convolution-Augmented Self-Attention Mechanism Based Deep Learning Model for Escherichia coli Identification
指導教授: 洪健中
Hong, Chien-Chong
劉通敏
Liou, Tong-Miin
口試委員: 丁川康
Ting, Chuan-Kang
陳治平
Chen, Chie-Pein
學位類別: 碩士
Master
系所名稱: 工學院 - 動力機械工程學系
Department of Power Mechanical Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 英文
論文頁數: 74
中文關鍵詞: 核甘酸定序鹼基呼叫器深度學習自注意力機制
外文關鍵詞: DNA sequencing, basecaller, deep learning, self-attention
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 鹼基定序意旨分析鹼基序列中的核甘酸序列以及種類,在醫學、生物學以及鑑識科學等領域皆扮演了重要的角色,目前主流的方法為雙脫氧鍛鍊法以及由其所衍伸出的其他方法。這些方法的缺點在於速度慢且對長序列進行定序會有困難。第三代鹼基定序技術中的奈米孔定序法使鹼基通過數十奈米的孔洞後量測孔洞間離子流的變化回推鹼基序列。牛津奈米孔科技公司目前已經將生物式奈米孔商業化生產。目前在單讀序準確率上還有進步的空間,在單鹼基變異以及物種分類上表現尚不及二代定序技術。另外,提升單讀序準確率能使得樣本前處理的步驟能夠更加簡化進一步降低時間以及金錢上的成本
    將奈米孔電訊號轉換為對應鹼基的過程稱為鹼基呼叫,由於奈米孔電訊號在雜訊干擾下與鹼基的對應關係相當複雜,大多鹼基呼叫器利用深度學習方法達成。目前鹼基呼叫器多利用遞迴式神經網路擷取電訊號中序列資訊,而不是使用在自然語言處理領域中大量取代遞迴式神經網路的自注意力機制。
    本篇研究建立了基於卷積強化之自注意力機制之鹼基呼叫器,並且於參數調校後,在官方提供的Bonito資料集當中達到了94.68%的辨識率,另外在自行準備的大腸桿菌資料集當中達到了95.09%的辨識率,超越先前發表的鹼基呼叫器約在87%至90%不等的辨識率。


    The purpose of base sequencing is to analyze the nucleotide sequence and species in the base sequence. It plays an important role in the fields of medicine, biology, and forensic science. The current mainstream method is the Sanger method and its derivatives other methods. The disadvantage of these methods is that they are slow and difficult to sequence long sequences. The nanopore sequencing method in the third-generation sequencing technology allows the base to pass through a hole of tens of nanometers and measures the change in ion current between the holes to deduce the base sequence. Oxford Nanopore Technologies has commercialized biological nanopores. At present, there is still room for improvement in the accuracy of single-read sequencing, and the performance of single nucleotide variation and species classification is not as good as that of second-generation sequencing technology. In addition, improving the accuracy of single-read sequencing can simplify the steps of sample preprocessing and further reduce time and money costs
    The process of converting the electrical signal of the nanopore into the corresponding base is called basecalling. Since the correspondence between the electrical signal of the nanopore and the base is quite complicated under the interference of noise, most basecallers use the deep learning method to achieve this. Currently, basecallers mostly use recurrent neural networks to capture sequential information in electrical signals, instead of using the self-attention mechanism that has largely replaced recurrent neural networks in natural language processing.
    This study establishes a basecaller based on a convolution-augmented self-attention mechanism, achieved an identity of 94.68% on the Bonito dataset, and 95.09% on a self-prepared Escherichia coli dataset when the previously published basecaller has an identity ranging from 87% to 90%.

    摘要 ii Abstract iii Glossary vi List of Figures vii List of Tables viii Chapter 1 Introduction 1 1.1 Sequencing 1 1.1.1 Traditional Sequencing Methods 1 1.1.2 Third-generation Sequencing 3 1.1.3 Oxford Nanopore 5 1.2 Deep Learning 10 1.2.1 Basics of Deep Learning 10 1.2.2 Sequence-to-Sequence Task 13 1.2.3 Deep Learning in Nanopore Sequencing 19 1.3 Conclusion 21 1.4 Research Motivation 23 1.5 Research Objectives 23 1.6 Thesis Organization 24 Chapter 2 Methods 26 2.1 Scaled Dot-Product Self-Attention 26 2.2 Modified Self-Attention Mechanism 28 2.2.1 Convolution Subsampling 30 2.2.2 Encoder 32 2.2.3 Decoder 36 2.3 Overall Workflow 37 2.3.1 Preprocess 38 2.3.2 Postprocess 40 Chapter 3 Model Establishment 42 3.1 Dataset from Bonito 42 3.2 Environments Configurations 43 3.3 Learning Rate Policy 43 3.4 Model Variations 45 3.4.1 Number of Parameters 45 3.4.2 Convolution Kernel Size 47 3.4.3 In-Stack Positional Encoding 48 3.4.4 Dimension 50 3.5 Comparison with Other Architectures 50 3.6 Discussion 51 3.7 Summary 54 Chapter 4 Experimental Results and Discussion 56 4.1 Escherichia coli Dataset Produced by Oxford Nanopore Chip 56 4.1.1 Data Statics 56 4.1.2 Selected Databank for Escherichia coli Genome 58 4.1.3 Preprocess of Data 60 4.2 Fine-Tuning with Model 62 4.3 Escherichia coli Dataset Result 64 4.4 Conclusion 65 Chapter 5 Conclusion 66 5.1 Summary 66 5.2 Research Contribution 66 References 69 作者簡歷 74

    [1] J. D. Watson and F. H. Crick, "Molecular structure of nucleic acids: a structure for deoxyribose nucleic acid," Nature, vol. 171, no. 4356, pp. 737-738, 1953.
    [2] A. M. Maxam and W. Gilbert, "A new method for sequencing DNA," Proceedings of the National Academy of Sciences, vol. 74, no. 2, pp. 560-564, 1977.
    [3] F. Sanger, S. Nicklen, and A. R. Coulson, "DNA sequencing with chain-terminating inhibitors," Proceedings of the national academy of sciences, vol. 74, no. 12, pp. 5463-5467, 1977.
    [4] M. Kchouk, J.-F. Gibrat, and M. Elloumi, "Generations of sequencing technologies: from first to next generation," Biology and Medicine, vol. 9, no. 3, 2017.
    [5] F. S. Collins, M. Morgan, and A. Patrinos, "The Human Genome Project: lessons from large-scale biology," Science, vol. 300, no. 5617, pp. 286-290, 2003.
    [6] C.-K. Ting, C.-S. Lin, M.-T. Chan, J.-W. Chen, S.-Y. Chuang, and Y.-T. Huang, "A genetic algorithm for diploid genome reconstruction using paired-end sequencing," Plos one, vol. 11, no. 11, p. e0166721, 2016.
    [7] W. H. Coulter, "Means for counting particles suspended in a fluid," ed: Google Patents, 1953.
    [8] E. Neher and B. Sakmann, "Single-channel currents recorded from membrane of denervated frog muscle fibres," Nature, vol. 260, no. 5554, pp. 799-802, 1976.
    [9] D. Stoddart, A. J. Heron, E. Mikhailova, G. Maglia, and H. Bayley, "Single-nucleotide discrimination in immobilized DNA oligonucleotides with a biological nanopore," Proceedings of the National Academy of Sciences, vol. 106, no. 19, pp. 7702-7707, 2009.
    [10] E. L. Moss, D. G. Maghini, and A. S. Bhatt, "Complete, closed bacterial genomes from microbiomes using nanopore sequencing," Nature biotechnology, vol. 38, no. 6, pp. 701-707, 2020.
    [11] R. Bowden et al., "Sequencing of human genomes with nanopore technology," Nature communications, vol. 10, no. 1, pp. 1-9, 2019.
    [12] M. Wang et al., "Nanopore targeted sequencing for the accurate and comprehensive detection of SARS‐CoV‐2 and other respiratory viruses," Small, vol. 16, no. 32, p. 2002169, 2020.
    [13] S. Chandak et al., "Overcoming high nanopore basecaller error rates for DNA storage via basecaller-decoder integration and convolutional codes," in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020: IEEE, pp. 8822-8826.
    [14] https://nanoporetech.com/ (accessed.
    [15] O. N. Technologies, https://www.youtube.com/watch?v=RcP85JHLmnI. How nanopore sequencing works.
    [16] O. N. Technologies, https://nanoporetech.com/about-us/news/london-calling-clive-brown-and-team-plenary. London Calling: Clive Brown and team plenary.
    [17] J. Besser, H. A. Carleton, P. Gerner-Smidt, R. L. Lindsey, and E. Trees, "Next-generation sequencing technologies and their application to the study and control of bacterial infections," Clinical microbiology and infection, vol. 24, no. 4, pp. 335-341, 2018.
    [18] C. P. Stefan, A. T. Hall, A. S. Graham, and T. D. Minogue, "Comparison of Illumina and Oxford Nanopore Sequencing Technologies for Pathogen Detection from Clinical Matrices Using Molecular Inversion Probes," The Journal of Molecular Diagnostics, vol. 24, no. 4, pp. 395-405, 2022.
    [19] Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," nature, vol. 521, no. 7553, pp. 436-444, 2015.
    [20] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning (no. 2). MIT press Cambridge, 2016.
    [21] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, "Learning representations by back-propagating errors," nature, vol. 323, no. 6088, pp. 533-536, 1986.
    [22] F. A. Gers, J. Schmidhuber, and F. Cummins, "Learning to forget: Continual prediction with LSTM," Neural computation, vol. 12, no. 10, pp. 2451-2471, 2000.
    [23] R. Dey and F. M. Salem, "Gate-variants of gated recurrent unit (GRU) neural networks," in 2017 IEEE 60th international midwest symposium on circuits and systems (MWSCAS), 2017: IEEE, pp. 1597-1600.
    [24] S. Bai, J. Z. Kolter, and V. Koltun, "An empirical evaluation of generic convolutional and recurrent networks for sequence modeling," arXiv preprint arXiv:1803.01271, 2018.
    [25] Y. LeCun et al., "Backpropagation applied to handwritten zip code recognition," Neural computation, vol. 1, no. 4, pp. 541-551, 1989.
    [26] I. Sutskever, O. Vinyals, and Q. V. Le, "Sequence to sequence learning with neural networks," arXiv preprint arXiv:1409.3215, 2014.
    [27] K. Cho et al., "Learning phrase representations using RNN encoder-decoder for statistical machine translation," arXiv preprint arXiv:1406.1078, 2014.
    [28] D. Bahdanau, K. Cho, and Y. Bengio, "Neural machine translation by jointly learning to align and translate," arXiv preprint arXiv:1409.0473, 2014.
    [29] A. Vaswani et al., "Attention is all you need," arXiv preprint arXiv:1706.03762, 2017.
    [30] L. Floridi and M. Chiriatti, "GPT-3: Its nature, scope, limits, and consequences," Minds and Machines, vol. 30, no. 4, pp. 681-694, 2020.
    [31] A. See, P. J. Liu, and C. D. Manning, "Get to the point: Summarization with pointer-generator networks," arXiv preprint arXiv:1704.04368, 2017.
    [32] K. Choromanski et al., "Rethinking attention with performers," arXiv preprint arXiv:2009.14794, 2020.
    [33] L. Dong, S. Xu, and B. Xu, "Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition," in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018: IEEE, pp. 5884-5888.
    [34] M. Sperber, J. Niehues, G. Neubig, S. Stüker, and A. Waibel, "Self-attentional acoustic models," arXiv preprint arXiv:1803.09519, 2018.
    [35] J. Salazar, K. Kirchhoff, and Z. Huang, "Self-attention networks for connectionist temporal classification in speech recognition," in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019: IEEE, pp. 7115-7119.
    [36] H. Teng, M. D. Cao, M. B. Hall, T. Duarte, S. Wang, and L. J. Coin, "Chiron: translating nanopore raw signal directly into nucleotide sequence using deep learning," GigaScience, vol. 7, no. 5, p. giy037, 2018.
    [37] C.-F. Yeh et al., "Transformer-transducer: End-to-end speech recognition with self-attention," arXiv preprint arXiv:1910.12977, 2019.
    [38] A. Gulati et al., "Conformer: Convolution-augmented transformer for speech recognition," arXiv preprint arXiv:2005.08100, 2020.
    [39] M. David, L. J. Dursi, D. Yao, P. C. Boutros, and J. T. Simpson, "Nanocall: an open source basecaller for Oxford Nanopore sequencing data," Bioinformatics, vol. 33, no. 1, pp. 49-55, 2017.
    [40] V. Boža, B. Brejová, and T. Vinař, "DeepNano: deep recurrent neural networks for base calling in MinION nanopore reads," PloS one, vol. 12, no. 6, p. e0178751, 2017.
    [41] M. Stoiber and J. Brown, "BasecRAWller: streaming nanopore basecalling directly from raw signal," BioRxiv, p. 133058, 2017.
    [42] Y.-M. Yeh and Y.-C. Lu, "MSRCall: a multi-scale deep neural network to basecall Oxford Nanopore sequences," Bioinformatics, vol. 38, no. 16, pp. 3877-3884, 2022.
    [43] N. Huang, F. Nie, P. Ni, F. Luo, and J. Wang, "SACall: a neural network basecaller for Oxford Nanopore sequencing data based on self-attention mechanism," IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2022.
    [44] A. L. Maas, A. Y. Hannun, and A. Y. Ng, "Rectifier nonlinearities improve neural network acoustic models," in Proc. icml, 2013, vol. 30, no. 1: Citeseer, p. 3.
    [45] Y. Lu et al., "Understanding and improving transformer from a multi-particle dynamic system point of view," arXiv preprint arXiv:1906.02762, 2019.
    [46] P. Ramachandran, B. Zoph, and Q. V. Le, "Searching for activation functions," arXiv preprint arXiv:1710.05941, 2017.
    [47] D. Hendrycks and K. Gimpel, "Gaussian error linear units (gelus)," arXiv preprint arXiv:1606.08415, 2016.
    [48] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770-778.
    [49] Q. Wang et al., "Learning deep transformer models for machine translation," arXiv preprint arXiv:1906.01787, 2019.
    [50] T. Q. Nguyen and J. Salazar, "Transformers without tears: Improving the normalization of self-attention," arXiv preprint arXiv:1910.05895, 2019.
    [51] R. Xiong et al., "On layer normalization in the transformer architecture," in International Conference on Machine Learning, 2020: PMLR, pp. 10524-10533.
    [52] A. G. Howard et al., "Mobilenets: Efficient convolutional neural networks for mobile vision applications," arXiv preprint arXiv:1704.04861, 2017.
    [53] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, "Language modeling with gated convolutional networks," in International conference on machine learning, 2017: PMLR, pp. 933-941.
    [54] Z. Wu, Z. Liu, J. Lin, Y. Lin, and S. Han, "Lite transformer with long-short range attention," arXiv preprint arXiv:2004.11886, 2020.
    [55] O. N. Technologies, "Tombo: Detection of Non-Standard Nucleotides Using the Genome-Resolved Raw Nanopore Signal," ed: Oxford Nanopore Technologies Oxford, 2018.
    [56] R. R. Wick, L. M. Judd, and K. E. Holt, "Performance of neural network basecalling tools for Oxford Nanopore sequencing," Genome biology, vol. 20, no. 1, pp. 1-10, 2019.
    [57] H. Li, "Minimap2: pairwise alignment for nucleotide sequences," Bioinformatics, vol. 34, no. 18, pp. 3094-3100, 2018.
    [58] C. Seymour et al. "Bonito." https://github.com/nanoporetech/bonito (accessed.
    [59] L. N. Smith and N. Topin, "Super-convergence: Very fast training of neural networks using large learning rates," in Artificial intelligence and machine learning for multi-domain operations applications, 2019, vol. 11006: SPIE, pp. 369-386.
    [60] N. A. O'Leary et al., "Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation," Nucleic acids research, vol. 44, no. D1, pp. D733-D745, 2016.
    [61] H. Gamaarachchi et al., "GPU accelerated adaptive banded event alignment for rapid comparative nanopore signal analysis," BMC bioinformatics, vol. 21, no. 1, pp. 1-13, 2020.
    [62] Y.-z. Zhang et al., "Nanopore basecalling from a perspective of instance segmentation," BMC bioinformatics, vol. 21, no. 3, pp. 1-9, 2020.
    [63] J. Zeng, H. Cai, H. Peng, H. Wang, Y. Zhang, and T. Akutsu, "Causalcall: nanopore basecalling using a temporal convolutional network," Frontiers in Genetics, p. 1332, 2020.
    [64] S. Ferguson et al., "Plant species-specific basecaller improves actual accuracy of nanopore sequencing," 2022.

    QR CODE