簡易檢索 / 詳目顯示

研究生: 黃柏升
Huang, Po-Sheng
論文名稱: 運用深度學習方法有效預測蛋白質的振盪尺度
Using deep learning to establish a predictive model to effectively estimate the sizes of residue fluctuations in proteins
指導教授: 林澤
Lin, Che
口試委員: 楊立威
Yang, Lee-Wei
李祈均
Lee, Chi-Chun
阮雪芬
Juan, Hsueh-Fen
黃宣誠
Huang, Hsuan-Cheng
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2018
畢業學年度: 106
語文別: 中文
論文頁數: 54
中文關鍵詞: 深度類神經網路蛋白質交互作用蛋白質殘基絕對振盪幅度
外文關鍵詞: Deep neural network, Protein–protein interactions, Residue fluctuations in proteins
相關次數: 點閱:3下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 許多重要的生物反應影響蛋白質間的交互作用。其中,蛋白質固有的構型變化以及蛋白質殘基受熱擾動的振盪大小,是探討此交互作用十分重要的因素。分子動力學模擬 (Molecular Dynamics Simulation, MD)是被廣泛應用於探索蛋白質之間結構變化,以及觀察蛋白質與蛋白質/小分子(ligand)交互作用的一套方法。藉由MD模擬可以觀察到蛋白質的振盪,但此使用方法需要大量的計算時間與資源。
    蛋白質的振盪可細分成振盪方向與振幅大小,目前已有物理模型 (ex. Elastic Network Model, ENM)能夠有效的預測蛋白質的振盪方向,但尚未有快速且可從單一蛋白質的三維結構中預測振盪大小的方法被提出。本論文利用深度類神經網路 (Deep Neural Network, DNN)的方法建立出快速且準確的蛋白質殘基振盪大小預測模型,我們挑選出2792個已知三維結構的蛋白質家族,並從蛋白質序列、結構以及ENM-定義的蛋白質動態中提取出蛋白質與各殘基的特徵作為我們訓練模型的依據。而模型的預測目標為每個殘基的絕對振盪幅度 (root-mean square fluctuation, RMSF),由於MD模擬過於耗費時間及資源,因此無法用於計算各個蛋白質家族中每個殘基的RMSFMD。故本論文將以三種方法:Native Ensemble RMSF (RMSFNE)、Average B-factor RMSF (RMSFB)以及GNM profile RMSF(RMSFG)為基礎,並詳細分析以RMSFNE與RMSFG兩者平移到與RMSFB相同振幅絕對水平的Shifted RMSFNE (RMSFSNE)、Shifted RMSFG (RMSFSG)以及RMSFB作為最終目標的預測準確度。實驗結果呈現DNN預測模型在RMSFSG 與RMSFB中可以達到不錯的效果。由此可知,透過我們所提出之深度學習模型預測蛋白質的殘基振盪大小具有高度可行性。我們預期此成果將對探討蛋白質間交互作用的相關研究作出重大貢獻。


    Protein-protein interaction is influenced by many important biological reactions. Among them, intrinsic conformational changes of proteins and thermal fluctuations of residues in protein are two important factors to explore such interactions. Molecular dynamics simulation (MD) has been widely used to explore the structural changes between proteins and to observe the interactions between proteins and proteins/ligands, but using MD to simulate protein fluctuations is usually costly and inefficient, especially for large protein complexes.
      Residue fluctuations in proteins can be classified by its fluctuation size and fluctuation direction. Existing physical model Elastic Network Model (ENM) can effectively estimate the direction of residue fluctuation. However, there has not been an efficient and well-accepted method to predict the absolute sizes of residue fluctuations. In this paper, we use deep neural network (DNN) to establish a fast and accurate predictive model to effectively estimate the size of residue fluctuations in proteins. We selected 2792 protein structural clusters with known three-dimensional structures, and then extracted 39 features from protein sequences, structures and ENM-defined vibrational dynamics to train our DNN model. The prediction target is the root-mean square fluctuation (RMSF) of each residue in water, but we cannot directly obtain it from MD for all the 2792 clusters because MD simulation is too time-consuming and expensive. Therefore, we used three methods: Native Ensemble RMSF (RMSF_NE), Average B-factor RMSF (RMSF_B), and GNM profile RMSF (RMSF_G) to calculate the absolute size of fluctuations to approximate the MD simulation results RMSF_MD. After analysis, we shifted RMSF_NE and RMSF_G to the absolute level of RMSF_B to form new targets as “shifted” RMSF_NE (RMSF_SNE) and “shifted” RMSF_G (RMSF_SG). Our results demonstrate that the DNN predictive model can achieve desirable performance in RMSF_B and RMSF_SG. This indicates that our proposed model is highly feasible in predicting the size of the residue fluctuations in proteins. We believe that such efficient DNN predictive model can have significant impacts to the study of the interactions between proteins.

    誌謝辭.......................................................i 中文摘要.....................................................ii Abstract....................................................iii 目錄.........................................................iv 圖目錄.......................................................v 表目錄.......................................................vi 第一章 緒論...................................................1 第二章 蛋白質資料集介紹........................................5 2.1 蛋白質資料的篩選與分類.................................5 2.2 Root-mean square fluctuation (RMSF)..................8 2.2.1 Native Ensemble RMSF (RMSF_NE).................8 2.2.2 MD-sampled RMSF of proteins (RMSF_MD)..........8 2.2.3 Average B-factor RMSF (RMSF_B).................9 2.2.4 Shifted Native Ensemble RMSF (RMSF_SNE)........9 2.2.5 Shifted GNM profile RMSF (RMSF_SG).............9 2.3 特徵說明..............................................10 第三章 研究方法............................................13 3.1 Deep Neural Network..................................13 3.2 Dropout..............................................18 3.3 Batch Normalization..................................20 3.4評估模型方法...........................................21 3.4.1 Pearson Correlation Coefficient.................21 3.4.2 Spearman Correlation Coefficient................21 3.4.3 Mean Absolute Error.............................22 3.4.4 Mean Squared Error..............................22 3.4.5 Percentage difference...........................23 3.5 Extended Garson’s algorithm..........................23 第四章 實驗結果與討論..........................................26 4.1 RMSF_MD與RMSF_NE 、RMSF_B以及RMSF_G的相關性分析........26 4.2 平移 RMSF_NE與 RMSF_G 到 RMSF_B 尺度的結果分析.........29 4.3 評估機器學習預測結果的標準..............................32 4.4 機器學習的結果分析與討論................................33 4.4.1 資料預處理.......................................33 4.4.2 訓練資料分布情形..................................34 4.4.3 建立DNN模型......................................35 4.4.4 DNN結果分析.....................................39 4.5 RMSF_MD 、實驗所得RMSF及預測RMSF的相關性分析............41 4.6 比較不同機器學習方法的結果.............................46 4.7 特徵的重要程度........................................46 第五章 結論與未來發展..........................................50 參考文獻......................................................51

    [1] J. Hollien and S. Marqusee, “Structural distribution of stability in a thermophilic enzyme,” Proc. Natl. Acad. Sci., vol. 96, no. 24, pp. 13674–13678, Nov. 1999.
    [2] S. F. Sousa, P. A. Fernandes, and M. J. Ramos, “Protein–ligand docking: Current status and future challenges,” Proteins Struct. Funct. Bioinforma., vol. 65, no. 1, pp. 15–26, Oct. 2006.
    [3] S. Ruiz-Carmona et al., “rDock: A Fast, Versatile and Open Source Program for Docking Ligands to Proteins and Nucleic Acids,” PLOS Comput. Biol., vol. 10, no. 4, p. e1003571, Apr. 2014.
    [4] C. Dominguez, R. Boelens, and A. M. J. J. Bonvin, “HADDOCK: a protein-protein docking approach based on biochemical or biophysical information,” J. Am. Chem. Soc., vol. 125, no. 7, pp. 1731–1737, Feb. 2003.
    [5] S. Forli, R. Huey, M. E. Pique, M. F. Sanner, D. S. Goodsell, and A. J. Olson, “Computational protein-ligand docking and virtual drug screening with the AutoDock suite,” Nat. Protoc., vol. 11, no. 5, pp. 905–919, May 2016.
    [6] R. Chen, L. Li, and Z. Weng, “ZDOCK: an initial-stage protein-docking algorithm,” Proteins, vol. 52, no. 1, pp. 80–87, Jul. 2003.
    [7] E. Eyal, L.-W. Yang, and I. Bahar, “Anisotropic network model: systematic evaluation and a new web interface,” Bioinformatics, vol. 22, no. 21, pp. 2619–2627, Nov. 2006.
    [8] E. Abola, P. Kuhn, T. Earnest, and R. C. Stevens, “Automation of X-ray crystallography,” Nat. Struct. Biol., vol. 7 Suppl, pp. 973–977, Nov. 2000.
    [9] Kwan Ann H., Mobli Mehdi, Gooley Paul R., King Glenn F., and Mackay Joel P., “Macromolecular NMR spectroscopy for the non‐spectroscopist,” FEBS J., vol. 278, no. 5, pp. 687–703, Jan. 2011.
    [10] L.-W. Yang, E. Eyal, I. Bahar, and A. Kitao, “Principal component analysis of native ensembles of biomolecular structures (PCA_NEST): insights into functional dynamics,” Bioinforma. Oxf. Engl., vol. 25, no. 5, pp. 606–614, Mar. 2009.
    [11] N. Popovych, S. Sun, R. H. Ebright, and C. G. Kalodimos, “Dynamically driven protein allostery,” Nat. Struct. Mol. Biol., vol. 13, no. 9, pp. 831–838, Sep. 2006.
    [12] P. J. Flory, M. Gordon, and N. G. McCrum, “Statistical Thermodynamics of Random Networks [and Discussion],” Proc. R. Soc. Lond. Ser. Math. Phys. Sci., vol. 351, no. 1666, pp. 351–380, 1976.
    [13] null Tirion, “Large Amplitude Elastic Motions in Proteins from a Single-Parameter, Atomic Analysis,” Phys. Rev. Lett., vol. 77, no. 9, pp. 1905–1908, Aug. 1996.
    [14] I. Bahar, A. R. Atilgan, M. C. Demirel, and B. Erman, “Vibrational Dynamics of Folded Proteins: Significance of Slow and Fast Motions in Relation to Function and Stability,” Phys. Rev. Lett., vol. 80, no. 12, pp. 2733–2736, Mar. 1998.
    [15] Y. Wang, A. J. Rader, I. Bahar, and R. L. Jernigan, “Global ribosome motions revealed with elastic network model,” J. Struct. Biol., vol. 147, no. 3, pp. 302–314, Sep. 2004.
    [16] L.-W. Yang and C.-P. Chng, “Coarse-Grained Models Reveal Functional Dynamics - I. Elastic Network Models – Theories, Comparisons and Perspectives,” Bioinforma. Biol. Insights, vol. 2, pp. 25–45, Mar. 2008.
    [17] H. Li, Y.-Y. Chang, L.-W. Yang, and I. Bahar, “iGNM 2.0: the Gaussian network model database for biomolecular structural dynamics,” Nucleic Acids Res., vol. 44, no. D1, pp. D415-422, Jan. 2016.
    [18] S. Nicolay and Y.-H. Sanejouand, “Functional modes of proteins are among the most robust,” Phys. Rev. Lett., vol. 96, no. 7, p. 078104, Feb. 2006.
    [19] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” in Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, USA, 2012, pp. 1097–1105.
    [20] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning Hierarchical Features for Scene Labeling,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1915–1929, Aug. 2013.
    [21] G. Hinton et al., “Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups,” IEEE Signal Process. Mag., vol. 29, no. 6, pp. 82–97, Nov. 2012.
    [22] T. N. Sainath, A. r Mohamed, B. Kingsbury, and B. Ramabhadran, “Deep convolutional neural networks for LVCSR,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013, pp. 8614–8618.
    [23] K. Han, D. Yu, and I. Tashev, “Speech Emotion Recognition Using Deep Neural Network and Extreme Learning Machine,” Microsoft Res., Sep. 2014.
    [24] H. Y. Xiong et al., “The human splicing code reveals new insights into the genetic determinants of disease,” Science, vol. 347, no. 6218, p. 1254806, Jan. 2015.
    [25] M. Spencer, J. Eickholt, and J. Cheng, “A Deep Learning Network Approach to ab initio Protein Secondary Structure Prediction,” IEEE/ACM Trans. Comput. Biol. Bioinform., vol. 12, no. 1, pp. 103–112, Jan. 2015.
    [26] Y. Bengio, “Learning Deep Architectures for AI,” Found Trends Mach Learn, vol. 2, no. 1, pp. 1–127, Jan. 2009.
    [27] V. Svetnik, A. Liaw, C. Tong, J. C. Culberson, R. P. Sheridan, and B. P. Feuston, “Random forest: a classification and regression tool for compound classification and QSAR modeling,” J. Chem. Inf. Comput. Sci., vol. 43, no. 6, pp. 1947–1958, Dec. 2003.
    [28] G. A. F. Seber and A. J. Lee, Linear Regression Analysis. John Wiley & Sons, 2012.
    [29] U. Hobohm and C. Sander, “Enlarged representative set of protein structures.,” Protein Sci. Publ. Protein Soc., vol. 3, no. 3, pp. 522–524, Mar. 1994.
    [30] R. C. Edgar, “MUSCLE: multiple sequence alignment with high accuracy and high throughput,” Nucleic Acids Res., vol. 32, no. 5, pp. 1792–1797, Mar. 2004.
    [31] W. Humphrey, A. Dalke, and K. Schulten, “VMD: Visual molecular dynamics,” J. Mol. Graph., vol. 14, no. 1, pp. 33–38, Feb. 1996.
    [32] A. Kuzmanic and B. Zagrovic, “Determination of Ensemble-Average Pairwise Root Mean-Square Deviation from Experimental B-Factors,” Biophys. J., vol. 98, no. 5, pp. 861–871, Mar. 2010.
    [33] M. Karplus and J. N. Kushick, “Method for estimating the configurational entropy of macromolecules,” Macromolecules, vol. 14, no. 2, pp. 325–332, Mar. 1981.
    [34] S. E. Dobbins, V. I. Lesk, and M. J. E. Sternberg, “Insights into protein flexibility: The relationship between normal modes and conformational change upon protein-protein docking,” Proc. Natl. Acad. Sci. U. S. A., vol. 105, no. 30, pp. 10390–10395, Jul. 2008.
    [35] I. K. McDonald and J. M. Thornton, “Satisfying hydrogen bonding potential in proteins,” J. Mol. Biol., vol. 238, no. 5, pp. 777–793, May 1994.
    [36] M. Y. Lobanov, N. S. Bogatyreva, and O. V. Galzitskaya, “Radius of gyration as an indicator of protein structure compactness,” Mol. Biol., vol. 42, no. 4, pp. 623–628, Aug. 2008.
    [37] W. Kabsch and C. Sander, “Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features,” Biopolymers, vol. 22, no. 12, pp. 2577–2637, Dec. 1983.
    [38] R. Grantham, “Amino acid difference formula to help explain protein evolution,” Science, vol. 185, no. 4154, pp. 862–864, Sep. 1974.
    [39] Z. R. Yang, R. Thomson, P. McNeil, and R. M. Esnouf, “RONN: the bio-basis function neural network technique applied to the detection of natively disordered regions in proteins,” Bioinforma. Oxf. Engl., vol. 21, no. 16, pp. 3369–3376, Aug. 2005.
    [40] V. Kunik and Y. Ofran, “The indistinguishability of epitopes from protein surface is explained by the distinct binding preferences of each of the six antigen-binding loops,” Protein Eng. Des. Sel. PEDS, vol. 26, no. 10, pp. 599–609, Oct. 2013.
    [41] F. Rosenblatt, “The perceptron: a probabilistic model for information storage and organization in the brain,” Psychol. Rev., vol. 65, no. 6, pp. 386–408, Nov. 1958.
    [42] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, no. 6088, pp. 533–536, Oct. 1986.
    [43] C. Hsu, C. Chang, and C.-J. Lin, “A Practical Guide to Support Vector Classification Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin,” Nov. 2003.
    [44] G. E. Hinton and R. R. Salakhutdinov, “Reducing the Dimensionality of Data with Neural Networks,” Science, vol. 313, no. 5786, pp. 504–507, Jul. 2006.
    [45] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting,” J. Mach. Learn. Res., vol. 15, pp. 1929–1958, 2014.
    [46] S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” 2015.
    [47] G. D. Garson, “Interpreting Neural-network Connection Weights,” AI Expert, vol. 6, no. 4, pp. 46–51, Apr. 1991.
    [48] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” ArXiv14126980 Cs, Dec. 2014.
    [49] S. J. Reddi, S. Kale, and S. Kumar, “On the Convergence of Adam and Beyond,” Feb. 2018.
    [50] T. Dozat, “Incorporating Nesterov Momentum into Adam,” Feb. 2016.
    [51] X. Glorot, A. Bordes, and Y. Bengio, “Deep Sparse Rectifier Neural Networks,” in Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 2011, pp. 315–323.
    [52] D. Freedman, Statistical Models: Theory and Practice. Cambridge University Press, 2009.
    [53] T. K. Ho, “The random subspace method for constructing decision forests,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 8, pp. 832–844, Aug. 1998.

    QR CODE