簡易檢索 / 詳目顯示

研究生: 張燕華
論文名稱: 以蛋白質序列、結構及固有動態來正確地預測酵素催化位點
Traits derived from protein sequence, structure and intrinsic dynamics facilitate accurate predictions of enzyme active sites
指導教授: 楊立威
口試委員: 楊立威
林志侯
蘇士哲
學位類別: 碩士
Master
系所名稱: 生命科學暨醫學院 - 生物資訊與結構生物研究所
Institute of Bioinformatics and Structural Biology
論文出版年: 2013
畢業學年度: 102
語文別: 中文
論文頁數: 49
中文關鍵詞: 酵素活性位點預測序列結構固有動態保留性分數相對溶劑可接觸表面積酸解離常數空間叢集化分數多元回歸催化位點
外文關鍵詞: active site prediction, intrinsic dynamics, solvent accessibility, acid dissociation constant, spatial clustering score, Gaussian Network Model, GNM, partial least squares regression, PLS, catalytic propensity, catalytic site
相關次數: 點閱:3下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來以酵素序列及結構為基礎的活性位點預測演算法,陸續地被學者專家們提出,而且對於預測的酵素種類也不再僅被侷限於特定的酵素家族。在這篇論文中,我們提出了一個以partial least squares regression model(PLS regression model)來預測活性位點的位置,並以225個非同源性酵素來訓練建立此預測模型。
    我們使用了多種序列、結構及動態上的特徵作為預測模型的輸入數據,其中包括殘基保留性分數、catalytic propensity、相對溶劑可接觸表面積(RSA)、pKa的變化、酵素的固有動態(intrinsic dynamics)、相鄰殘基間RSA的平均差值、殘基到domain center的距離以及空間叢集化分數(spatial clustering scores)。
    當我們由PLS regression model挑選出預測的前2名候選者時,就有1名候選者包含催化殘基;平均3個催化殘基就有1個被預測到(specificity=0.54, sensitivity=0.35),Matthews correlation coefficient(MCC)=0.38。當我們由PLS regression model挑選出預測的前7名候選者時,平均4名候選者就有1個包含催化殘基;平均3個催化殘基就有2個被預測到(specificity=0.29, sensitivity=0.66),MCC=0.35。在我們建立的預測模型中,以殘基保留性分數、空間叢集化分數、殘基到domain center的距離以及相鄰殘基RSA的平均差值為主要影響預測結果的特徵。


    The researches of active site predictions that are based on the analysis of sequence and structure are increasingly developing for the past few years. The scope of enzyme categories in prediction is no longer limited to some specific enzyme families. In this thesis, we provide a partial least squares regression model trained over 225 nonhomologous enzymes to predict the location of actives sites. We use conservation scores, catalytic propensity, intrinsic dynamics of enzymes, relative solvent accessibility(RSA), pKa changes, the average RSA deviation in sequential residues, distances between residues and domain center and the spatial clustering scores as prediction model inputs.
    The performance of our predictions is interpreted by sensitivity=0.35, specificity=0.54 and Matthews correlation coefficient(MCC)=0.38 when we select the top 2 candidates. Sensitivity=0.66, specificity=0.29 and MCC=0.38 when we select the top 7 candidates. The dominant features of residues of enzymes in PLS model are conservation, spatial clustering scores of prediction candidates, distance between residues and domain center and the average RSA deviation in sequential residues.

    摘要 1 Abstract 2 誌謝 3 1 緒論 4 2 研究方法 6 2.1 酵素來源 6 2.2 酵素中各種特徵的萃取 8 2.2.1 酵素中殘基的保留性分數(conservation scores of residues in enzymes) 8 2.2.2 酵素中的動態資訊(intrinsic dynamics) 9 2.2.3 溶劑可接觸性(solvent accessibility) 10 2.2.4 酵素中活性位點的酸解離常數(acid dissociation constant) 11 2.2.5 酵素中的殘基到酵素domain center的距離 13 2.2.6 酵素中催化殘基的空間叢集化分數(spatial clustering scores) 15 2.3 理論與統計方法 17 2.3.1 Gaussian Network Model (GNM) 17 2.3.2 Partial least squares regression (PLS) 19 3 酵素中殘基們的各項特徵統計及給分 23 3.1 各項特徵在於各酵素中殘基上的表現及給分方式 23 3.1.1 殘基保留性分數的統計及給分原則 23 3.1.2 殘基的相對溶劑可接觸表面的統計及給分原則 24 3.1.3 殘基至酵素domain center的距離的統計及給分原則 25 3.1.4 殘基酸解離常數的統計及給分原則 26 3.1.5 酵素中各種殘基的catalytic propensity的統計及給分原則 27 3.2 各項特徵彼此間的相關性 28 3.3 選用PLS的原因 32 4 催化殘基的預測演算法 33 4.1 建立第一次PLS回歸模型 33 4.2 新增第二次PLS回歸預測模型的兩項新特徵 33 4.2.1 相鄰殘基間其相對溶劑可接觸表面積的差值 33 4.2.2 第一次預測模型其預測結果的加權分數 34 4.3 建立第二次PLS回歸模型 35 5 預測結果與討論 36 5.1 第一次PLS回歸模型預測結果與討論 36 5.2 第二次PLS回歸模型預測結果與討論 40 5.3 預測模型表現效果指標:sensitivity, specificity及MCC 42 6 結論 43 參考文獻 46

    [1] Tami J. Marrone, James M. Briggs, and J. Andrew McCammon. Structure-based drug design: Computational advances. Annual Review of Pharmacology and Toxicology. 37: 71-90.
    [2] Alex Gutteridge, Gail J. Bartlett ,and Janet M. Thornton. (2003). Using a neural network and spatial clustering to predict the location of active sites in enzymes. J. Mol. Biol. 330, 719-734.
    [3] Shann-Ching Chen and Ivet Bahar. (2004). Mining frequent patterns in protein structures: a study of protease families. NIH-PA Author Manuscript, Bioinformatics, 20(suppl 1): i77-i85.
    [4] Raj Chakrabarti, Alexander M. Klibanov, and Richard A. Friesner. (2005). Computational prediction of native protein ligand-binding and enzyme active site sequences. PNAS. 102(29): 10153-8.
    [5] Tong W., Williams R.J., Wei Y., Murga L.F., Ko J., and Ondrechen M.J. (2008). Enhanced performance in prediction of protein active sites with THEMATICS and support vector machines. Protein Sci. 17(2): 333-41.
    [6] Lee-Wei Yang and Ivet Bahar. (2005). Coupling between catalytic site and collective dynamics: a requirement for mechanochemical activity of enzymes. Structure. 13, 893-904.
    [7] Porter, C.T., Bartlett, G.J., and Thornton, J.M. (2004). The catalytic site atlas: a resource of catalytic sites and residues identified in enzymes using structural data. Nucleic Acids Res. 32, D129–D133.
    [8] Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., and Bourne, P.E. (2000). The Protein Data Bank. Nucleic Acids Res. 28, 235–242.
    [9] Evgeny K., Kim H. (2007). Inference of Macromolecular Assemblies from Crystalline State. J. Mol. Biol. 372, 774-797.
    [10]Glaser, F., Pupko, T., Paz, I., Bell, R.E., Bechor-Shental, D., Martz, E. and Ben-Tal, N. (2003). ConSurf: identification of functional regions in proteins by surface-mapping of phylogenetic information. Bioinformatics, 19, 163-164.
    [11]Altschul,S.F., Wootton,J.C., Gertz,E.M., Agarwala,R., Morgulis,A., Schaffer,A.A. and Yu,Y.K. (2005). Protein database searches using compositionally adjusted substitution matrices. FEBS J., 272, 5101-5109.
    [12]Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389-3402.
    [13]Li,W. and Godzik,A. (2006). Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 22, 1658-1659.
    [14]Pupko T., Bell R.E., Mayrose I., Glaser F. and Ben-Tal N. (2002). Rate4Site: an algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues. Bioinformatics 18 Suppl 1:S71-S77.
    [15]Mayrose I., Graur D., Ben-Tal N. and Pupko T. (2004). Comparison of site-specific rate-inference methods for protein sequences: empirical Bayesian methods are superior. Mol. Biol. Evol. 21:1781-1791.
    [16] Flory, P.J. (1976). Statistical thermodynamics of random networks. Proc. R. Soc. Lond. A 351, 351–380.
    [17] Mattice, WL.; Suter, UW. (1994). Conformational theory of large molecules. John Wiley & Sons, Inc.
    [18] Bahar, I., Atilgan, AR., Erman, B. (1997). Direct evaluation of thermal fluctuations in proteins using a single-parameter harmonic potential. Fold Des. 2(3):173-81.
    [19] B. Lee and F. M. Richards. (1971). The interpretation of protein structures: estimation of static accessibility. J. Mol. Biol. 55, 379-400.
    [20] Bartlett GJ, Porter CT, Borkakoti N, Thornton JM. (2002). Analysis of catalytic residues in enzyme active sites. J. Mol. Biol. 15;324(1):105-21
    [21] Reto Koradi, Martin Billeter and Kurt Wüthrich. (1996). MOLMOL: A program for display and analysis of macromolecular structures. J. Mol Graphics, 14, 51-55.
    [22] Ondrechen M. J., Clifton J. G., and Ringe D. (2001). THEMATICS: A simple computational predictor of enzyme function from structure. PNAS. Vol. 98 no. 22 12473-12478.
    [23] Ko J., Murga L. F., Andre P., Yang H., Ondrechen M. J., Williams R. J., Agunwamba A., and Budil D. E. (2005). Statistical criteria for the identification of protein active sites using theoretical microscopic titration curves. PROTEINS: Structure, Function, and Bioinformatics 59:183–195.
    [24] Hui Li, Andrew D. Robertson, and Jan H. Jensen. (2005). Very fast empirical prediction and interpretation of protein pKa values. Proteins, 61, 704-721.
    [25] Delphine C. Bas, David M. Rogers, and Jan H. Jensen. (2008). Very fast prediction and rationalization of pKa values for protein-ligand complexes. Proteins, 73, 765-783.
    [26] Mats H.M. Olsson, Chresten R. Søndergard, Michal Rostkowski, and Jan H. Jensen. (2011). PROPKA3: Consistent treatment of Internal and surface residues in empirical pKa predictions. Journal of Chemical Theory and Computation, 7 (2), 525-537
    [27] Chresten R. Søndergaard, Mats H.M. Olsson, Michaz Rostkowski, and Jan H. Jensen. (2011). Improved treatment of ligands and coupling effects in empirical calculation and rationalization of pKa values. Journal of Chemical Theory and Computation, 7 (7), 2284-2295
    [28] Svante Wold, Michael Sjöström, Lennart Eriksson. PLS-regression: a basic tool of chemometrics. Chemometrics and Intelligent Laboratory Systems,58(2), 109-130.
    [29] Sijmen de Jong. (1993). SIMPLS: An alternative approach to partial least squares regression. Chemometrics and Intelligent Laboratory Systems, 18, 251-263.
    [30] 張德豐. (2013). MATLAB機率與數理統計. 五南圖書出版股份有限公司.
    [31] Fornell C., Bookstein F. (1982). Two structural equation models: LISREL and PLS applied to consumer exit-voice theory. Journal of Marketing Research, 19(4), 440-452
    [32] Cassel C., Hackl P., Westlund A., (1999). Robustness of partial least-squares method for estimating latent variables equality structures. Journal of Applied Statistics, 26(4), 435-446.
    [33] Chin W., Newsted P., (1999). Structural equation modeling analysis with small samples using partial least squares. In Hoyle R. H. (Ed.), Statistics strategies for small sample research. (pp. 307-341) Thousand Oaks: SAGE.
    [34] Schomburg I., Chang A., Placzek S., Söhngen C., Rother M., Lang M., Munaretto C., Ulas S., Stelzer M., Grote A., Scheer M., Schomburg D., (2013). BRENDA in 2013: integrated reactions, kinetic data, enzyme function data, improved disease classification: new options and contents in BRENDA. Nucletic Acids Res., 41: 764-772.
    [35] 黃俊英. (2000). 多變量分析. 翰蘆圖書出版有限公司.

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)

    QR CODE