Predicting RNA-binding sites of proteins using support vector machines and evolutionary information

簡易檢索 / 詳目顯示

回結果列表

研究生：	鄭成偉 Cheng-Wei Cheng
論文名稱：	Predicting RNA-binding sites of proteins using support vector machines and evolutionary information 使用支援向量機與演化資訊預測蛋白質核醣核酸結合位
指導教授：	許聞廉 Wen-Lian Hsu
口試委員:
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊系統與應用研究所 Institute of Information Systems and Applications
論文出版年：	2008
畢業學年度：	96
語文別：	英文
論文頁數：	62
中文關鍵詞：	核醣核酸、蛋白質、交互作用、平滑化特定位置計分矩陣、生物資訊、計算生物
外文關鍵詞：	RNA-protein interaction, RNA-binding sites, Smoothed PSSM, Bioinformatics, Computational biology
相關次數：	點閱：4 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

RNA-protein interaction plays an essential role in several biological processes, such as protein synthesis, gene expression, post-transcriptional regulation, and antiviral drug discovery. Identification of RNA-binding sites in proteins can provide valuable insights for biologists. However, experimental determination RNA-protein interaction remains time-consuming and labor-intensive. Thus, computational approaches for the prediction of RNA-binding sites from protein sequences have become highly desirable. In this paper, we propose a method, RNAProB, to predict RNA-binding sites based on support vector machines and a new encoding scheme for smoothed position-specific scoring matrix. Evaluated by five-fold cross-validation, our method achieves Matthew’s correlation coefficient (MCC) values of 0.68, 0.58, and 0.42 compared to 0.45, 0.35, and 0.32 by the state-of-the-art systems for three benchmark data sets, respectively. Moreover, to avoid data overfitting, we use a three-way data split procedure to estimate our predictive performance, and our approach obtains MCC values of 0.67, 0.56, and 0.40, respectively. In conclusion, our method significantly improves the predictive performance of RNA-binding site prediction. The proposed encoding scheme for smoothed PSSM can be used in other research problems, such as DNA-protein interaction, protein-protein interaction, and prediction of post-translational modification, etc.

核醣核酸與蛋白質交互作用在生物體內扮演者重要的角色，像是蛋白質的生成、基因表現、後轉錄調控以及抗病毒藥物的開發皆和此有密切的相關。鑑別蛋白質序列中的核醣核酸結合胺基酸可以幫助生物學家進一步了解蛋白質與核醣核酸交互作用時的機制。然而以傳統生物實驗方法來決定核醣核酸與蛋白質交互作用時的結構非常耗費時間與人力成本。因此最近許多科學家們以計算方法和蛋白質序列來預測蛋白質中的核醣核酸結合位。在這篇論文中我們使用支援向量機（Support Vector Machine）與平滑化特定位置計分矩陣（Smoothed PSSM, Smoothed Position-Specific Scoring Matrix）來預測此問題，並且使用五重交叉驗證（Five-Fold Cross-Validation）來訓練和測驗所提出方法的表現。在結果部分顯示我們的方法在三個不同的資料集（Data Sets）中有馬修相關係數（MCC, Matthew’ s Correlation Coefficient）0.68、0.58和0.42的成果，相較於之前在三個不同的資料集中的研究，馬修相關係數分別只有0.45、0.35和 0.32。此外為防止過度配適（data overfitting）的評量結果，我們使用三重資料分割法（Three-way data split）來評估我們所提出的方法，而結果顯示在三個不同的資料集中，馬修相關係數也分別達到了0.67、0.56和 0.40。綜上言之，我們所提出的方法改進了現有方法的預測結果，且這種編碼方式可以應用在許多其他的生物預測問題中，像是去氧核醣核酸與蛋白質交互作用、蛋白質與蛋白質交互作用、後轉錄調控修飾預測…等。

摘要    I
Abstract    II
致謝詞    III
Table of Contents    IV
List of Figures    VI
List of Tables    VII
Chapter 1. INTRODUCTION    1
1.1.    Central dogma of molecular biology    1
1.2.    Background    2
1.3.    Previous works    3
1.4.    Challenges    4
1.5.    Our method and future applications    4
Chapter 2. METHOD    6
2.1.    Data sets    6
2.2.    Support vector machines (SVM)    7
2.3.    Feature extraction and representation    8
2.4.    Window size and parameter optimization    10
2.5.    System architecture    12
2.6.    Performance evaluation    13
2.7.    Training and testing    14
Chapter 3. RESULTS    16
3.1.    Effect of smoothed PSSM    16
3.2.    Performance of five-fold cross-validation and three-way data split    18
3.3.    Comparison with other approaches    23
Chapter 4. DISCUSSION    25
4.1.    Amino acid composition of data sets    25
4.2.    Comparisons of the effects of smoothed PSSM and standard PSSM    27
Chapter 5. CONCLUSION    30
REFERENCES    31
APPENDIX A.    Experiment results of the RBP86    33
APPENDIX B.    Experiment results of the RBP109    43
APPENDIX C.    Experiment results of the RBP107    53

                                

[1] F. Crick, “Central dogma of molecular biology,” Nature, vol. 227, Aug. 1970, pp. 561-3.
[2] S. Sunita et al., “Functional specialization of domains tandemly duplicated within 16S rRNA methyltransferase RsmC,” Nucleic Acids Research, vol. 35, Jul. 2007.
[3] E. Bechara et al., “Fragile X related protein 1 isoforms differentially modulate the affinity of fragile X mental retardation protein for G-quartet RNA structure,” Nucleic Acids Research, vol. 35, Jan. 2007.
[4] K.L. McKnight and B.A. Heinz, “RNA as a target for developing antivirals,” Antiviral chemistry & chemotherapy, vol. 14, Mar. 2003, pp. 61-73.
[5] H.M. Berman et al., “The Protein Data Bank,” Acta crystallographica. Section D, Biological crystallography, vol. 58, Jun. 2002, pp. 899-907.
[6] E. Jeong, I. Chung, and S. Miyano, “A neural network method for identification of RNA-interacting residues in protein,” Genome informatics. International Conference on Genome Informatics, vol. 15, 2004, pp. 105-16.
[7] M. Terribilini et al., “Prediction of RNA binding sites in proteins from amino acid sequence,” RNA (New York, N.Y.), vol. 12, Aug. 2006, pp. 1450-62.
[8] E. Jeong and S. Miyano, “A Weighted Profile Based Method for Protein-RNA Interacting Residue Prediction,” Transactions on Computational Systems Biology IV, 2006, pp. 123-139.
[9] M. Kumar, M.M. Gromiha, and G.P.S. Raghava, “Prediction of RNA binding sites in a protein using SVM and PSSM profile,” Proteins, vol. 71, Apr. 2008, pp. 189-94.
[10] L. Wang and S.J. Brown, “BindN: a web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences,” Nucleic acids research, vol. 34, Jul. 2006, pp. W243-8.
[11] R.C. Gonzalez and R.E. Woods, Digital Image Processing, Prentice Hall, 2002.
[12] M. Terribilini et al., “RNABindR: a server for analyzing and predicting RNA-binding sites in proteins,” Nucleic acids research, vol. 35, Jul. 2007, pp. W578-84.
[13] V.N. Vapnik, The Nature of Statistical Learning Theory, Springer, 1995.
[14] C. Chang and C. Lin, “LIBSVM : a library for support vector machines,” 2001; http://www.csie.ntu.edu.tw/~cjlin/libsvm.
[15] S.F. Altschul et al., “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs,” Nucleic acids research, vol. 25, Sep. 1997, pp. 3389-402.
[16] S. Henikoff and J.G. Henikoff, “Amino acid substitution matrices from protein blocks,” Proceedings of the National Academy of Sciences of the United States of America, vol. 89, Nov. 1992, pp. 10915-9.
[17] L. Wang and S.J. Brown, “Prediction of RNA-binding residues in protein sequences using support vector machines,” Conference proceedings : Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Conference, vol. 1, 2006, pp. 5830-3.
[18] B.W. Matthews, “Comparison of the predicted and observed secondary structure of T4 phage lysozyme,” Biochimica et biophysica acta, vol. 405, Oct. 1975, pp. 442-51.
[19] J.A. Swets, “Measuring the accuracy of diagnostic systems,” Science (New York, N.Y.), vol. 240, Jun. 1988, pp. 1285-93.
[20] A.P. Bradley, “The use of the area under the ROC curve in the evaluation of machine learning algorithms,” Pattern Recognition, vol. 30, Jul. 1997, pp. 1145-1159.
[21] M.D. Ritchie et al., “Optimization of neural network architecture using genetic programming improves detection and modeling of gene-gene interactions in studies of human diseases,” BMC bioinformatics, vol. 4, Jul. 2003, p. 28.
[22] C. Yu et al., “Prediction of protein subcellular localization,” Proteins, vol. 64, Aug. 2006, pp. 643-51.

全文公開日期本全文未授權公開 (校內網路)
全文公開日期本全文未授權公開 (校外網路)

簡易檢索 / 詳目顯示

相關論文