簡易檢索 / 詳目顯示

研究生: 藍以珊
Lan, Yi-Shan
論文名稱: Nana和Migu:語意資料增強技術以增強圖形神經網路中的蛋白質分類
NaNa and MiGu: Semantic Data Augmentation Techniques to Enhance Protein Classification in Graph Neural Networks
指導教授: 王廷基
Wang, Ting-Chi
口試委員: 何宗易
Ho, Tsung-Yi
李淑敏
Li, Shu-Min
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊系統與應用研究所
Institute of Information Systems and Applications
論文出版年: 2024
畢業學年度: 112
語文別: 英文
論文頁數: 39
中文關鍵詞: 分子動態模擬蛋白質表徵學習蛋白質功能預測
外文關鍵詞: Protein Dynamic Simulation, Protein Representation Learning, Protein Functional Prediction
相關次數: 點閱:51下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 蛋白質分類任務對於藥物發現至關重要。真實世界的蛋白質結構物質是動態的,這將決定蛋白質的特性。但是,那現有的機器學習方法,如ProNet [1],只能存取有限的構象特徵和蛋白質側鏈特徵,導致不切實際的蛋白質結構以及預測中蛋白質類別的不準確。在本文中,我們提出新穎的語意資料增強方法,新節點屬性的新穎增強(NaNa) 和分子相互作用和幾何升級(MiGu) 合併將主鏈化學和側鏈生物物理資訊納入蛋白質分類任務和共嵌入殘差學習架構。具體來說,我們利用分子蛋白質的生物物理、二級結構、化學鍵和離子特徵促進蛋白質分類任務。此外,我們的語意增強方法共嵌入殘差學習框架可以提高效能GIN [2]在EC 和Fold 資料集[3, 4] 上分別提高了16.41% 和11.33%。我們的程式碼請參閱 https://github.com/r08b46009/Code_for_MIGU_NANA [1]


    Protein classification tasks play a pivotal role in drug discovery, where the dynamic nature of real-world protein structures determines their properties. Existing machine learning methods, such as ProNet [2], have been limited by their ability to capture only a narrow range of conformational characteristics and protein side-chain features.
    This limitation often results in unrealistic protein structures and inaccuracies in predicted protein classes. This paper presents innovative methods for augmenting semantic data, namely Novel Augmentation of New Node Attributes (NaNa) and Molecular Interactions and Geometric Upgrading (MiGu). These methods are designed to incorporate both backbone chemical and side-chain biophysical information into protein classification tasks, within a co-embedding residual learning framework. Specifically, we leverage molecular biophysical data, secondary structure information, chemical bonds, and ionic features of proteins to enhance the performance of protein classification tasks. Our semantic augmentation techniques, coupled
    with the co-embedding residual learning framework, have been shown to significantly improve the performance of GIN [3] on EC and Fold datasets [4, 5] by 16.41% and 11.33%, respectively. The code for implementing these methods is available at https://github.com/r08b46009/Code_for_MIGU_NANA.

    1 Introduction 1 2 RelatedWork 6 2.1 Graph Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Protein Structure Representation . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Harnessing Graph Neural Networks for Protein Structure Classification 8 2.4 The Interplay between Chemical Information and Protein Classification 8 2.5 Traditional Data Augmentation for Graph Neural Network . . . . . . . 9 2.6 Semantic Augmentation for Other Tasks . . . . . . . . . . . . . . . . . 10 3 Methodology 12 3.1 Procedure Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 Semantic Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . 13 IV 3.2.1 Novel Node Attributes . . . . . . . . . . . . . . . . . . . . . . . 13 3.2.2 Novel Edge Attributes . . . . . . . . . . . . . . . . . . . . . . . 15 3.2.3 Our Method: MiGu & NaNa Data Augmentation . . . . . . . . . 16 3.3 Co-Embedding Residual Learning Framework . . . . . . . . . . . . . . 16 4 Experiments 18 4.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2.1 SCOPe Classification Dataset . . . . . . . . . . . . . . . . . . . 19 4.2.2 EC Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2.3 Enhanced SCOPe Classification Dataset . . . . . . . . . . . . . 20 4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.3.1 The Effectiveness of Residual Learning Framework . . . . . . . 21 4.3.2 Impact of Node and Edge Attributes . . . . . . . . . . . . . . . . 22 4.3.3 Leave-One-Out Feature Analysis . . . . . . . . . . . . . . . . . 23 4.3.4 Influence of Node Features . . . . . . . . . . . . . . . . . . . . . 26 5 Discussion 28 5.1 General Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 5.2 Potential Impacts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 5.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 6 Conclusions 30 7 Appendix 31

    [1] Y.-S. Lan, P.-Y. Chen, and T.-Y. Ho, “Nana and migu: Semantic data augmentation techniques to enhance protein classification in graph neural networks,” arXiv, 2024.
    [2] L.Wang, H. Liu, Y. Liu, J. Kurtin, and S. Ji, “Learning hierarchical protein representations via complete 3d graph networks,” in The Eleventh International Conference on Learning Representations, 2022.
    [3] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graph neural networks?,” International Conference on Learning Representations (ICLR), 2019.
    [4] A. Bairoch, “The enzyme database in 2000,” Nucleic acids research, 2000. [5] A. Andreeva, D. Howorth, J.-M. Chandonia, S. E. Brenner, T. J. Hubbard, C. Chothia, and A. G. Murzin, “Data growth and its impact on the scop database: new developments,”
    Nucleic acids research, 2007.
    [6] M. Hatch, T. Kagawa, and S. Craig, “Subdivision of c4-pathway species based on differing c4 acid decarboxylating systems and ultrastructural features,” Functional Plant Biology, 1975.
    [7] S. Cheng and C. L. Brooks III, “Viral capsid proteins are segregated in structural fold space,” PLoS computational biology, 2013.
    34
    [8] D. A. Erlanson, S. W. Fesik, R. E. Hubbard, W. Jahnke, and H. Jhoti, “Twenty years on: the impact of fragments on drug discovery,” Nature reviews Drug discovery, 2016.
    [9] L.Wang, Y. Liu, Y. Lin, H. Liu, and S. Ji, “Comenet: Towards complete and efficient message passing for 3d molecular graphs,” in Advances in Neural Information Processing Systems (NeurIPS), 2022.
    [10] K. T. Schütt, H. E. Sauceda, P.-J. Kindermans, A. Tkatchenko, and K.-R. Müller, “Schnet–a deep learning architecture for molecules and materials,” The Journal of Chemical Physics, 2018.
    [11] D. Whitford, Proteins: structure and function. John Wiley & Sons, 2013.
    [12] Y. Wang, X. Pan, S. Song, H. Zhang, G. Huang, and C. Wu, “Implicit semantic data augmentation for deep networks,” Advances in Neural Information Processing Systems (NeurIPS), 2019.
    [13] B. Trabucco, K. Doherty, M. Gurinas, and R. Salakhutdinov, “Effective data augmentation with diffusion models,” 2023.
    [14] T. Ahmed, K. S. Pai, P. Devanbu, and E. T. Barr, “Automatic semantic augmentation of language model prompts (for code summarization),” in International Conference on Software Engineering (ICSE), 2024.
    [15] D. A. Case, T. E. Cheatham III, T. Darden, H. Gohlke, R. Luo, K. M. Merz Jr, A. Onufriev, C. Simmerling, B. Wang, and R. J. Woods, “The amber biomolecular simulation programs,” Journal of computational chemistry, 2005.
    [16] C. R. Søndergaard, M. H. Olsson, M. Rostkowski, and J. H. Jensen, “Improved treatment of ligands and coupling effects in empirical calculation and rationalization of pka values,” Journal of chemical theory and computation, 2011.
    35
    [17] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, “Neural message passing for quantum chemistry,” in International conference on machine learning (ICML), 2017.
    [18] T. N. Kipf and M.Welling, “Semi-supervised classification with graph convolutional networks,” in International Conference on Learning Representations (ICLR).
    [19] W. Kabsch and C. Sander, “Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features,” Biopolymers: Original Research on Biomolecules, 1983.
    [20] V. Gligorijević, P. D. Renfrew, T. Kosciolek, J. K. Leman, D. Berenberg, T. Vatanen, C. Chandler, B. C. Taylor, I. M. Fisk, H. Vlamakis, et al., “Structure-based protein function prediction using graph convolutional networks,” Nature communications, 2021.
    [21] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” Advances in neural information processing systems (NeurIPS), 2017.
    [22] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio, “Graph attention networks,” in ICLR, 2018.
    [23] C. Chen, J. Zhou, F. Wang, X. Liu, and D. Dou, “Structure-aware protein self-supervised learning,” Bioinformatics, 2023.
    [24] B. Jing, S. Eismann, P. Suriana, R. J. Townshend, and R. Dror, “Learning from protein structure with geometric vector perceptrons,” International Conference on Learning Representations (ICLR), 2020.
    [25] F. Baldassarre, D. Menéndez Hurtado, A. Elofsson, and H. Azizpour, “Graphqa: protein model quality assessment using graph convolutional networks,” Bioinformatics, 2021.

    [26] Z. Liu and Y. Huang, “Advantages of proteins being disordered,” Protein Science, 2014.
    [27] E. N. Baker and R. E. Hubbard, “Hydrogen bonding in globular proteins,” Progress in biophysics and molecular biology, 1984.
    [28] E. Eyal, R. Najmanovich, B. J. Mcconkey, M. Edelman, and V. Sobolev, “Importance of solvent accessibility and contact surfaces in modeling side-chain conformations in proteins,” Journal of computational chemistry, 2004.
    [29] L. Zhang, Y. Jiang, and Y. Yang, “Gnngo3d: Protein function prediction based on 3d structure and functional hierarchy learning,” IEEE Transactions on Knowledge and Data Engineering, 2023.
    [30] D. Bashford, “Macroscopic electrostatic models for protonation states in proteins,” Frontiers in Bioscience, 2004.
    [31] A. V. Onufriev and E. Alexov, “Protonation and pk changes in protein–ligand binding,” Quarterly reviews of biophysics, 2013.
    [32] H.-X. Zhou and X. Pang, “Electrostatic interactions in protein structure, folding, binding, and condensation,” Chemical reviews, 2018.
    [33] R. E. Hubbard and M. K. Haider, “Hydrogen bonds in proteins: role and strength,” eLS, 2010.
    [34] R. A. Copeland, Enzymes: a practical introduction to structure, mechanism, and data analysis. John Wiley & Sons, 2023.
    [35] A. Pyle, “Metal ions in the structure and function of rna,” JBIC Journal of Biological Inorganic Chemistry, 2002.
    [36] E. Ueda, P. Gout, and L. Morganti, “Current and prospective applications of metal ion–protein binding,” Journal of chromatography A, 2003.
    37
    [37] B. Silvi and A. Savin, “Classification of chemical bonds based on topological analysis of electron localization functions,” Nature, 1994.
    [38] D. Frishman and P. Argos, “Knowledge-based protein secondary structure assignment,” Proteins: Structure, Function, and Bioinformatics, 1995.
    [39] Y. You, T. Chen, Y. Sui, T. Chen, Z.Wang, and Y. Shen, “Graph contrastive learning with augmentations,” Advances in neural information processing systems (NeurIPS), 2020.
    [40] Z. Hou, X. Liu, Y. Cen, Y. Dong, H. Yang, C. Wang, and J. Tang, “Graphmae: Selfsupervised masked graph autoencoders,” in International Conference on Knowledge
    Discovery and Data Mining (KDD), 2022.
    [41] Y. Wang, J. Wang, Z. Cao, and A. Barati Farimani, “Molecular contrastive learning of representations via graph neural networks,” Nature Machine Intelligence, 2022.
    [42] Y. Wang, X. Pan, S. Song, H. Zhang, C. Wu, and G. Huang, Implicit semantic data augmentation for deep networks,” in NeurIPS, 2020.
    [43] Y. Wang, X. Pan, S. Song, H. Zhang, C. Wu, and G. Huang, “Implicit semantic data augmentation for deep networks,” in arXiv, 2020.
    [44] J. Xie,W. Li, X. Li, Z. Liu, Y. S. Ong, and C. C. Loy, “Mosaicfusion: Diffusion models as data augmenters for large vocabulary instance segmentation,” 2023.
    [45] T. Ahmed, K. S. Pai, P. Devanbu, and E. T. Barr, “Automatic semantic augmentation of language model prompts (for code summarization),” 2024.
    [46] S. M. Shivashankar, “Semantic data augmentation with generative models,” in
    CVPR, 2024.

    [47] S. Doerr, M. Harvey, F. Noé, and G. De Fabritiis, “Htmd: high-throughput molecular dynamics for molecular discovery,” Journal of chemical theory and computation,
    2016.
    [48] E. Eyal, R. Najmanovich, B. J. Mcconkey, M. Edelman, and V. Sobolev, “Importance of solvent accessibility and contact surfaces in modeling side-chain conformations in proteins,” Journal of computational chemistry, 2004.
    [49] I. S. Moreira, P. A. Fernandes, and M. J. Ramos, “Hot spots—a review of the protein-protein interface determinant amino-acid residues,” Proteins: Structure, Function, and Bioinformatics, 2007.
    [50] J. Martin, G. Letellier, A. Marin, J.-F. Taly, A. G. de Brevern, and J.-F. Gibrat, “Protein secondary structure assignment revisited: a detailed analysis of different assignment methods,” BMC structural biology, 2005.
    [51] M. M. Yamashita, L. Wesson, G. Eisenman, and D. Eisenberg, “Where metal ions bind in proteins.,” Proceedings of the National Academy of Sciences, 1990.
    [52] R. Singh, “A review of algorithmic techniques for disulfide-bond determination,” Briefings in Functional Genomics and Proteomics, 2008.
    [53] R. Thakuria, N. K. Nath, and B. K. Saha, “The nature and applications of π–π
    interactions: a perspective,” Crystal Growth & Design, 2019.
    [54] J. Lu, B. Zhong, Z. Zhang, and J. Tang, “Str2str: A score-based framework for zero-shot protein conformation sampling,” in ICLR, 2024.
    [55] Y. You, T. Chen, Y. Sui, T. Chen, Z.Wang, and Y. Shen, “Graph contrastive learning
    with augmentations,” Advances in neural information processing systems (NeurIPS), 2020.

    QR CODE