研究生: |
泰 吉 Pitti, Thejkiran |
---|---|
論文名稱: |
基於蛋白質關聯網路的方法與應用在預測N-端醣基化位點及DNA結合蛋白 Protein association network based method with applications to N-linked glycosites and DNA binding protein predictions |
指導教授: |
宋定懿
Sung, Ting-Yi 楊立威 Yang, Lee-Wei |
口試委員: |
許聞廉
Hsu, Wen-Lian 林仲彥 Lin, Chung-Yen 張家銘 Chang, Jia-Ming |
學位類別: |
博士 Doctor |
系所名稱: |
生命科學暨醫學院 - 生物資訊與結構生物研究所 Institute of Bioinformatics and Structural Biology |
論文出版年: | 2020 |
畢業學年度: | 108 |
語文別: | 英文 |
論文頁數: | 43 |
中文關鍵詞: | 蛋白質關聯網路 、N-端醣基化位點 、DNA結合蛋白 |
外文關鍵詞: | N-linked glycosites |
相關次數: | 點閱:2 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
對於利用生物實驗確定某些特定蛋白質的功能是很困難的,因為這是一個很耗時而且技術層面上面臨挑戰。因此,開發計算方法去預測蛋白質特定功能是必要的。常規序列相似性搜索工具通常用於預測蛋白質的特定功能,即某未知功能的蛋白質找到一群蛋白質具有類似的已知功能註解並且擁有高度序列相似性。但是,由於難以確定序列同源性的程度,使得相似性搜索工具受限於非冗餘蛋白質序列數據集,因此相關預測問題變得具有挑戰性。
本研究中,我們提出了蛋白質關聯網絡概念-以基於具有已知功能註解的大型非冗餘蛋白質數據集來預測某蛋白質的特定功能。為了實現此任務,我們使用序列相似性搜索工具(例如HHblits)來搜索某蛋白質與非冗餘蛋白質數據集中具有相似性的蛋白質。蛋白質關聯網絡是一個星形圖,中心節點是某未知功能蛋白質,其他衛星狀的節點是非冗餘蛋白質數據集中具有與中心節點蛋白質具有特定相似性的蛋白質。對於網絡中的中心節點到任一個節點的邊緣定義成一個具有權重的特質相似性。如果某蛋白質(即中心節點)與非冗餘數據集中的任何蛋白質(即衛星狀的節點)擁有一些相似特性,那麼我們將使用這些蛋白質(即衛星狀的節點)來預測某蛋白質(即中心節點)的功能註解。
為了證明蛋白質關聯網絡的方法在生物功能註解上是重要性與可行性,我們選擇了兩個具有生物學重要性的蛋白質預測課題,即N 鏈結醣基化位點預測(N-linked glycosylation site prediction)和DNA結合蛋白預測(DNA-binding protein prediction)。首先,N 鏈結醣基化是一種重要轉譯後修飾,涉及很多重要的生物功能,包括蛋白質折疊,細胞間相互作用和免疫反應。其次,DNA結合蛋白在許多細胞功能中扮演至關重要的角色,例如基因調節,DNA複製,DNA修復和轉錄。這兩種不同功能類型的蛋白質都在不同疾病的藥物開發領域上有重要的應用。儘管現今已經了幾種基於序列和基於結構的預測器,但仍需要開發更好性能的預測器。因此,我們應用了蛋白質關聯網絡的方法來解決上述兩個預測課題。值得注意的是,對於這兩課題上,我們已經從UniProt構建了嚴格的非冗餘蛋白質數據集,此外還從蛋白結構數據庫(Protein Data Bank, PDB)構建了非冗餘蛋白質數據集只用於第二預測課題上。
對於人類蛋白質的N 鏈結醣基化位點預測課題上,我們開發了具有雙層級的預測器:N-GlyDE 預測器。此預測器在高質量的人類非冗餘蛋白質組數據集上進行了預測訓練。N-GlyDE使用了蛋白質關聯網絡作為第一層級預測,並整合到使用向量機(SVM)為主的第二預測層級。第一預測層級預測器計算出查詢蛋白質的特定屬性分數。第二預測層級的SVM預測器則著重預測蛋白序列上具有N-X-S/T可醣基化特徵序列中的可天門冬醯胺,依據間隙雙肽組合,表面可接觸性和二級結構特徵來進行預測。後兩種特徵是使用基於樣式型態的方法進行編碼,以便減少特徵尺寸,以便適用於相對較小的非冗餘蛋白質數據集。 N-GlyDE的最終預測結果是依據第二預測層級結果的進行權重調整而得出的,該調整是通過第一預測層級的預測結果來進行。對於性能評估方面,我們只對含有N-X-S/T特徵序列的天門冬醯胺進行評估,而不同於大多數現有其他預測器那樣針對每一個天門冬醯胺進行性能評估。在收集了在UniProt的53個糖蛋白和33個非糖蛋白序列所組成的獨立測試數據集上進行性能評估,N-GlyDE的準確度和馬修斯相關係數(MCC)分別為0.740和0.499,優於同類工具。
在DNA結合蛋白預測課題上,我們提出了兩種預測方法,稱為PANet-DNAseq和PANet-DNAchn,分別用於兩種不同類別的蛋白序列來源:哺乳動物蛋白質全長序列和蛋白質鏈(chains)。這兩種預測器都使用蛋白質關聯網絡,而不使用機器學習來預測該蛋白是否屬於DNA結合蛋白。在由UniProt的31個DNA結合蛋白質序列和93個非DNA結合蛋白質序列所組成的獨立數據蛋白質全長序列集中進行性能評估,PANet-DNAseq,其準確度和馬修斯相關係數分別為0.895和0.731。另外,PANet-DNAchn則在一個由PDB所收集的25個DNA結合蛋白鏈和75個非DNA結合蛋白鏈序列所組成的獨立數據集上進行性能評估,其準確度和MCC分別為0.770和0.428。 PANet-DNAseq和PANet-DNAchn在準確性,精密度和馬修斯相關係數都優於其他同類預測器。當利用獨立蛋白質鏈數據集進行對PANet-DNAseq的性能評估,或者在利用獨立蛋白質全長序列數據集進行PANet-DNAchn評估時,這兩個預測器的性能都會下降,這結果明確表明在使用相同類型數據型態用於訓練和測試所產生預測器,會獲得更好的預測結果。
Experimental determination of some specific protein functions is difficult as it is time consuming, and technically challenging. It is thus essential to evolve new computational approaches to predict a specific function of proteins. Conventional sequence similarity search tools are usually applied to predict a specific function of a protein, when the protein can find a set of proteins with known function annotations that share high sequence similarity. However, when restricting the search to a dataset containing non-redundant (NR) protein sequence dataset the complexity of the prediction problem increases as it is difficult to obtain sequence homology.
We thus in this study present the concept of protein association network to predict a specific function of query proteins based on a large NR dataset with known functional annotation. To achieve this task, we use a sequence similarity search tool, e.g., HHblits, to find similar proteins of the query protein (QP) and each of the proteins in the NR dataset. The protein association network of the QP is a star graph, where the center vertex corresponds to the QP and other vertices are proteins in the NR dataset having similar proteins in common with the QP. We also define a weight on each edge in the network. If the QP shares some similar protein(s) in common with any protein in the NR dataset, we use these proteins in the dataset to predict the function annotation of the QP.
We selected two protein prediction problems, i.e., N-linked glycosylation site prediction and DNA-binding protein prediction because of their biological importance, to demonstrate the significance of protein association network-based method. First, N-linked glycosylation is one of the post-translational modifications associated with several biological functions like protein folding, immune response, and cell-cell interactions. Second, DNA-binding proteins (DBP) play a vital role in diverse functions like DNA replication, DNA repair, transcription and gene regulation. Both types of proteins have significant applications in the field of drug development in treating various diseases. Though several structure-based and sequence-based predictors are available, there is still a need for developing predictors to achieve better performance. We thus applied protein association network-based methods to solve the above two prediction problems. Notably, for both problems, we have constructed rigorous NR datasets from UniProt, additionally from Protein Data Bank (PDB) for the second prediction problem.
For the first application on N-linked glycosylation site prediction of human proteins, we propose a two-stage prediction tool N-GlyDE, uses the protein association networks as the first-stage predictor and integrates with a second-stage predictor using support vector machines (SVMs). N-GlyDE is trained on NR protein sequence datasets rigorously-constructed from UniProt. For the QP, the first-stage predictor determines a prediction score. The second-stage SVM predictor uses gapped dipeptides, predicted secondary structure, and predicted surface accessibility as features to predict glycosites on asparagine in the N-X-S/T sequons. A pattern-based approach is used to encode the latter two types of features to reduce feature dimensions for adapting to the relatively smaller NR datasets. The second-stage prediction results are further processed for weight adjustment based on the first-stage prediction score obtained from the protein association network of the QP to derive final predictions of N-GlyDE. We confine the performance evaluation on only N-X-S/T sequons, rather than on every asparagine as reported by most of the existing predictors. N-GlyDE outperforms the compared tools on an independent dataset of 53 glycoprotein and 33 non-glycoprotein sequences by achieving Matthews correlation coefficient (MCC) of 0.499 and accuracy of 0.740.
In the second application on DNA-binding protein prediction, we propose two prediction methods, called PANet-DNAseq and PANet-DNAchn, for prediction on mammalian protein sequences and chains, respectively. Both predictors use protein association networks, without using machine learning, to predict whether a QP is a DBP. Evaluated on an independent dataset, comprised of 31 DBP and 93 non-DNA-binding protein (nDBP) sequences from UniProt, PANet-DNAseq attains MCC of 0.731 and accuracy of 0.895. Similarly, on an independent dataset of 25 DBP and 75 nDBP chain sequences from PDB, PANet-DNAchn obtains MCC of 0.428 and accuracy of 0.770. Both PANet-DNAseq and PANet-DNAchn outperform their respective compared predictors in MCC, precision, and accuracy. The performance of both predictors decrease when PANet-DNAseq is evaluated on the independent dataset of protein chains and PANet-DNAchn is evaluated on the independent dataset of protein sequences. The results signify the importance of using consistent data type for training and testing datasets to achieve better prediction performance.
Ali, F. et al. (2018) DBPPred-PDSD: Machine learning approach for prediction of DNA-binding proteins using Discrete Wavelet Transform and optimized integrated features space. Chemometr. Intell. Lab., 182, 21-30.
Bateman, A. et al. (2019) UniProt: A worldwide hub of protein knowledge. Nucleic Acids Res., 47(D1), D506-D515.
Berman, H.M. et al. (2000) The Protein Data Bank. Nucleic Acids Res., 28(1), 235-242.
Blom, N. et al. (2004) Prediction of post-translational glycosylation and phosphorylation of proteins from the amino acid sequence. Proteomics, 4(6), 1633-1649.
Brennan, A.J. et al. (2011) Protection from endogenous perforin: Glycans and the C terminus regulate exocytic trafficking in cytotoxic lymphocytes. Immunity, 34(6), 879-892.
Breuza, L. et al. (2016) The UniProtKB guide to the human proteome. Database (Oxford), 2016, bav120.
Bulyk, M.L. (2006) DNA microarray technologies for measuring protein-DNA interactions. Curr. Opin. Biotechnol., 17(4), 422-430.
Burley, S.K. et al. (2019) RCSB Protein Data Bank: Biological macromolecular structures enabling research and education in fundamental biology, biomedicine, biotechnology and energy. Nucleic Acids Res., 47(D1), D464-D474.
Caragea, C. et al. (2007) Glycosylation site prediction using ensembles of Support Vector Machine classifiers. BMC Bioinformatics, 8, 438.
Chandler, K.B. et al. (2017) Site-specific N-glycosylation of endothelial cell receptor tyrosine kinase VEGFR-2. J. Proteome Res., 16(2), 677-688.
Chang, C.C. and Lin, C.J. (2011) LIBSVM: A library for support vector machines. Acm T. Intel. Syst. Tec., 2, 27.
Chauhan, J.S. et al. (2013) In silico platform for prediction of N-, O- and C-glycosites in eukaryotic protein sequences. PLoS ONE, 8(6), e67008.
Chou, C.C. et al. (2003) Crystal structure of the hyperthermophilic archaeal DNA-binding protein Sso10b2 at a resolution of 1.85 angstroms. J. Bacteriol., 185(14), 4066-4073.
Chuang, G.Y. et al. (2012) Computational prediction of N-linked glycosylation incorporating structural properties and patterns. Bioinformatics, 28(17), 2249-2255.
Du, X. et al. (2019) MsDBP: Exploring DNA-binding proteins by integrating multiscale sequence information via Chou's five-step rule. J. Proteome Res., 18(8), 3119-3132.
Dwek, R.A. (1998) Biological importance of glycosylation. Dev. Biol. Stand., 96, 43-47.
Franklin, M.C. et al. (2011) The structural basis for the function of two anti-VEGF receptor 2 antibodies. Structure, 19(8), 1097-1107.
Gavel, Y. and Von Heijne, G. (1990) Sequence differences between glycosylated and non-glycosylated Asn-X-Thr/Ser acceptor sites: Implications for protein engineering. Protein Eng., 3(5), 433-442.
Gupta, R. and Brunak, S. (2002) Prediction of glycosylation across the human proteome and the correlation to protein function. Pac. Symp. Biocomput., 310-322.
Hamby, S.E. and Hirst, J.D. (2008) Prediction of glycosylation sites using random forests. BMC Bioinformatics, 9, 500.
Huang, Y. et al. (2010) CD-HIT Suite: A web server for clustering and comparing biological sequences. Bioinformatics, 26(5), 680-682.
Krajewska, W.M. (1992) Regulation of transcription in eukaryotes by DNA-binding proteins. Int. J. Biochem., 24(12), 1885-1898.
Langlois, R.E. and Lu, H. (2010) Boosting the prediction and understanding of DNA-binding domains from sequence. Nucleic Acids Res., 38(10), 3149-3158.
Leppanen, V.M. et al. (2010) Structural determinants of growth factor binding and specificity by VEGF receptor 2. Proc. Natl. Acad. Sci. U.S.A., 107(6), 2425-2430.
Leung, C.H. et al. (2013) DNA-binding small molecules as inhibitors of transcription factors. Med. Res. Rev., 33(4), 823-846.
Li, F. et al. (2016) GlycoMine(struct): A new bioinformatics tool for highly accurate mapping of the human N-linked and O-linked glycoproteomes by incorporating structural features. Sci. Rep., 6, 34595.
Li, F.Y. et al. (2015) GlycoMine: A machine learning-based approach for predicting N-, C- and O-linked glycosylation in the human proteome. Bioinformatics, 31(9), 1411-1419.
Liu, B. et al. (2015) PseDNA-Pro: DNA-binding protein identification by combining Chou's PseAAC and physicochemical distance transformation. Molecular Informatics, 34(1), 8-17.
Lou, W.C. et al. (2014) Sequence based prediction of DNA-binding proteins based on hybrid feature selection using random forest and Gaussian naïve Bayes. PLoS ONE, 9(1), e86703.
Luscombe, N.M. et al. (2000) An overview of the structures of protein-DNA complexes. Genome Biol, 1(1), REVIEWS001.
Ma, X. et al. (2016) DNABP: Identification of DNA-binding proteins based on feature selection using a random forest and predicting binding residues. PLoS ONE, 11(12), e0167345.
Mishra, A. et al. (2019) StackDPPred: A stacking based prediction of DNA-binding protein from sequence. Bioinformatics, 35(3), 433-441.
Mosher, D.F. (1984) Physiology of fibronectin. Annu. Rev. Med., 35, 561-575.
Nelson, D.L. and Cox, M.M. (2004) Lehninger Principles of Biochemistry, fourth edition, Freeman, W.H. & Company, New York, pp. 79.
Pang, R.T. et al. (1999) Role of N-linked glycosylation on the function and expression of the human secretin receptor. Endocrinology, 140(11), 5102-5111.
Parola, M. et al. (1999) 4-Hydroxynonenal as a biological signal: Molecular basis and pathophysiological implications. Antioxid. Redox Signal., 1(3), 255-284.
Petersen, B. et al. (2009) A generic method for assignment of reliability scores applied to solvent accessibility predictions. BMC Struct. Biol., 9, 51.
Petrescu, A.J. et al. (2004) Statistical analysis of the protein environment of N-glycosylation sites: Implications for occupancy, structure, and folding. Glycobiology, 14(2), 103-114.
Pitti, T. et al. (2019) N-GlyDE: A two-stage N-linked glycosylation site prediction incorporating gapped dipeptides and pattern-based encoding. Sci. Rep., 9(1), 15975.
Qu, Y.H. et al. (2017) On the prediction of DNA-binding proteins only from primary sequences: A deep learning approach. PLoS ONE, 12(12), e0188129.
Remmert, M. et al. (2011) HHblits: Lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat. Methods, 9(2), 173-175.
Rudd, P.M. et al. (2001) Glycosylation and the immune system. Science, 291(5512), 2370-2376.
Ruiz-Blanco, Y.B. et al. (2017) Novel "extended sequons" of human N-glycosylation sites improve the precision of qualitative predictions: An alignment-free study of pattern recognition using ProtDCal protein features. Amino Acids, 49(2), 317-325.
Sarai, A. and Kono, H. (2005) Protein-DNA recognition patterns and predictions. Annu. Rev. Biophys. Biomol. Struct., 34, 379-398.
Schulz, B.L. (2012) Beyond the sequon: Sites of N-glycosylation. In, Petrescu, S. (ed.), Glycosylation, IntechOpen, Rijeka, Croatia, pp. 21-40.
Shao, X. et al. (2009) Predicting DNA- and RNA-binding proteins from sequences with kernel methods. J. Theor. Biol., 258(2), 289-293.
Shibuya, M. (1995) Role of VEGF-FLT receptor system in normal and tumor angiogenesis. Adv. Cancer Res., 67, 281-316.
Shibuya, M. (2013) VEGFR and type-V RTK activation and signaling. Csh. Perspect. Biol., 5(10), a009092.
Smith C. et al. (2005) Marks' Basic Medical Biochemistry: A Clinical Approach, second edition, Lippincott Williams & Wilkins, Philadelphia, Pennsylvania, pp. 77.
Song, L. et al. (2014) nDNA-Prot: Identification of DNA-binding proteins based on unbalanced classification. BMC Bioinformatics, 15, 298.
Stawiski, E.W. et al. (2003) Annotating nucleic acid-binding function based on protein structure. J. Mol. Biol., 326(4), 1065-1079.
Taherzadeh, G. et al. (2019) SPRINT-Gly: Predicting N- and O-linked glycosylation sites of human and mouse proteins by using sequence and predicted structural properties. Bioinformatics, 35(20), 4140-4146.
Walsh, G. and Jefferis, R. (2006) Post-translational modifications in the context of therapeutic proteins. Nat. Biotechnol., 24(10), 1241-1252.
Wei, L.Y. et al. (2017) Local-DPP: An improved DNA-binding protein prediction method by exploring local evolutionary information. Inform. Sciences, 384, 135-144.
Xu, R. et al. (2015) Identifying DNA-binding proteins by combining support vector machine and PSSM distance transformation. BMC Syst. Biol., 9 Suppl 1, S10.
Yu, X. et al. (2006) Predicting rRNA-, RNA-, and DNA-binding proteins from primary structure with support vector machines. J. Theor. Biol., 240(2), 175-184.
Zaman, R. et al. (2017) HMMBinder: DNA-binding protein prediction using HMM profile based features. Biomed. Res. Int., 4590609.
Zhang, Y.P. et al. (2016) gDNA-Prot: Predict DNA-binding proteins by employing support vector machine and a novel numerical characterization of protein sequence. J. Theor. Biol., 406, 8-16.
Zielinska, D.F. et al. (2010) Precision mapping of an in vivo N-glycoproteome reveals rigid topological and sequence constraints. Cell, 141(5), 897-907.
Zimmer, C. and Wahnert, U. (1986) Nonintercalating DNA-binding ligands: Specificity of the interaction and their use as tools in biophysical, biochemical and biological investigations of the genetic material. Prog. Biophys. Mol. Biol., 47(1), 31-112.