簡易檢索 / 詳目顯示

研究生: 黃敏慈
Huang, Min-Tzu
論文名稱: 比較貝氏二元迴歸(BBR)以及微陣列預測分析(PAM)方法於基因表現量之分類功能
Comparison of Bayesian Binary Regression (BBR) and Prediction Analysis of Microarray (PAM) in Classification Problems
指導教授: 熊昭
Hsiung, Chao
口試委員:
學位類別: 碩士
Master
系所名稱: 理學院 - 統計學研究所
Institute of Statistics
論文出版年: 2010
畢業學年度: 98
語文別: 中文
論文頁數: 34
中文關鍵詞: 分類問題BBRPAM交叉驗證法訓練樣本測試樣本
外文關鍵詞: classification, BBR, PAM, cross validation, training set, testing set
相關次數: 點閱:1下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 利用微陣列基因表現量的資料做為疾病分類的工具在以往的文獻中已被認為是有用的方法,許多的分類方法被廣泛提出並做比較,其中Prediction Analysis of Microarray (PAM)為一個常用的方法;Bayesian Binary Regression (BBR)則是在文章分類的領域中所提出的一個新方法。本文第一部份利用BBR與PAM做為分析工具,用以分析基因表現量資料庫,並對兩種分析工具在訓練樣本(training set)與測試樣本(test set)的錯誤率比較其優劣。第二部份則是藉由PAM和BBR做為分析工具並且重複抽取樣本,探討訓練樣本的組成對於測試樣本錯誤率的影響。利用白血病及肺癌的基因表現量資料庫,PAM與BBR在分類上都可以達到很好的分類效果,但是對於預測時所使用的基因數目方面,PAM比BBR要用較多的基因。關於訓練樣本的組成則分為兩個部份討論:改變訓練樣本的樣本數與在訓練樣本中不同類別的人數比例。重複抽樣的結果顯示,在固定同一組測試樣本下,訓練樣本的樣本數越多預測結果越好;另外,訓練樣本的組成也很重要,當訓練樣本和測試樣本的類別比例不同時,將有可能導致兩者估計出來的預測錯誤率有差距。


    Using microarray gene expression data as a tool for disease classification has been recognized as a useful method. There have been many methods proposed for analyzing these data. Among which PAM (Prediction Analysis of Microarray) is a popular method in recent years. Similar problem arose in the area of text classification and BBR (Bayesian Binary Regression) was proposed recently. In the first part of this study, we used BBR to analyze gene expression datasets and compared the performance with that of PAM. The performance is based on the error rates of both training set and testing set. The results showed that PAM and BBR have similar performance in classification. However, PAM usually used more genes than BBR. In the second part, we investigated the effect of sample size and composition of training set on the error rate of testing set. In examing the performance, we split training set according two ways: fix composition and change sample size or fix sample size and change composition. The results showed that for the same testing set, the more sample size of training set, the lower error rate. Furthermore, it is important to aware that the composition of training set to the testing set will also affect prediction performance.

    中 文 摘 要 i ABSTRACT ii 致 謝 辭 iii 內 容 目 錄 iv 表 目 錄 v 圖 目 錄 vi 第一章 緒論 1 1.1 前言 1 1.2 研究動機與架構 3 第二章 材料與方法 4 2.1 BBR(Bayesian Binary Regression)介紹 4 2.1.1 BBR介紹—先驗分佈的選擇 4 2.1.2 BBR介紹—參數估計方法 6 2.1.3 BBR介紹—CLG演算法 7 2.2 PAM(Prediction Analysis of Microarrays)介紹 9 2.3 參數的選擇:交叉驗證法 11 2.4 資料來源與描述 12 2.4.1 白血病資料 12 2.4.2 肺癌資料 13 2.5 資料前處理(Data preprocessing) 13 第三章 結果 15 3.1  BBR與PAM分析結果比較 16 3.2 重複抽樣五十次的正確率比較 18 3.3 改變訓練樣本的樣本數比較 20 3.4 改變訓練樣本中兩種類別人數比例的比較 23 第四章 結論與討論 27 參考文獻 31

    Ambroise, C. and G. J. McLachlan (2002). "Selection bias in gene extraction on the basis of microarray gene-expression data." PNAS 99(10): 6562-6566.

    Braga-Neto, U. M. and E. R. Dougherty (2004). "Is cross-validation valid for small-sample microarray classification?" Bioinformatics 20(3): 374-380.

    Dettling, M. (2004). "BagBoosting for Tumor Classification with Gene Expression Data." Bioinformatics 20(18): 3583-3593.

    Dudoit, S., J. Fridlyand, et al. (2002). "Comparison of discrimination methods for the classification of tumors using gene expression data." Journal of the American Statistical Association 97(457): 77-87.

    Fielding, A. H. and J. F. BELL (1997). "A review of methods for the assessment of prediction errors in conservation presence/absence models." Environmental Conservation 24(1): 38-49.

    Genkin, A., D. D. Lewis, et al. (2007). "Large-scale Bayesian logistic regression for text categorization." Technometrics 49(3): 291-304.

    Golub, T. R., D. K. Slonim, et al. (1999). "Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring." Science 286(5439): 531-537.

    Gordon, G. J., R. V. Jensen, et al. (2002). "Translation of Microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma." Cancer Research 62(17): 4963-4967.

    Gordon, G. J., R. V. Jensen, et al. (2003). "Using gene expression ratios to predict outcome among patients with mesothelioma." Journal of the National Cancer Institute 95(8): 598-605.

    Guan, P., D. Huang, et al. (2009). "Lung cancer gene expression database analysis incorporating prior knowledge with support vector machine-based classification method." Journal of Experimental & Clinical Cancer Research 28(103).

    Guo, Y., T. HASTIE, et al. (2007). "Regularized Discriminant Analysis and Its Application in Microarrays." Biostatistics 8(1): 86-100.

    Guyon, I., J. WESTON, et al. (2002). "Gene selection for cancer classification using support vector machines." Machine Learning 46: 389-422.

    Hastie, T., R. Tibshirani , et al. (2009). "The elements of Statistical Learning: Data Mining,Inference,and Prediction." Springer Series in Statistics.

    Hoeral, A. E. and R. W. Kennard (1970). "Ridge Regression:Biased Estimation for Nonorthogonal Problems." Technometrics 12: 55-67.

    Kohavi, R. (1995). "A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection." In Proceedings of Fourteenth International Joint Conference on Artificial Intelligence (IJCAI) Montreal(CA): pp. 1137-1143.

    Lee, J. W., J. B. Lee, et al. (2005). "An extensive comparison of recent classification tools applied to microarray data." Computational Statistics & Data Analysis 48: 869-885.

    Lee, K., N. Sha, et al. (2003). "Gene selection: a Bayesian variable selection approach." Bioinformatics 19: 90-97.

    Li, J., H. Liu, et al. (2003). "Discovery of significant rules for classifying cancer diagnosis data." Bioinformatics 19(Suppl. 2): i93–ii102.

    Liao, J. G. and K. V. Chin (2007). "Logistic regression for disease classification using microarray data: model selection in a large p and small n case." Bioinformatics 23(15): 1945-1951.

    McCullagh, P. and J. A. Nelder (1989). "Generalized Linear Models." Chapman and Hall, New York.

    Mukherjee, S., P. Tamayo, et al. (1999). "Support vector machine classification of microarray data." Massachusetts Institute of Technology AI Memo 1677.

    Park, M. Y. and T. Hastie (2007). "L1-regularization path algorithm for generalized linear models." Journal of the Royal Statistical Society Series B-Statistical Methodology 69: 659-677.

    Shen, L. and E. C. Tan (2005). "Dimension reduction-based penalized logistic regression for cancer classification using microarray data." Ieee-Acm Transactions on Computational Biology and Bioinformatiocs 2(2): 166-175.

    Tan, A. C., D. Q. Naiman, et al. (2005). "Simple decision rules for classifying human cancers from gene expression profiles." Bioinformatics 21(20): 3896-3904.

    Tibshirani, R. (1996). "Regression Shrinkage and Selection via the Lasso." Journal of the Royal Statistical Society Ser. B(58): 267-288.

    Tibshirani, R., T. Hastie, et al. (2002). "Diagnosis of multiple cancer types by shrunken centroids of gene expression." Proceedings of the National Academy of Sciences of the United States of America 99(10): 6567-6572.

    Tuna, S. and M. Niranjan (2009). "Classification with binary gene expressions." JBiSE 2(6): 390-399.

    Wang, Y., F. S. Makedon, et al. (2005). "HykGene: a hybrid approach for selecting marker genes for phenotype classification using microarray gene expression data " Bioinformatics 21(8): 1530-1537.

    Wei, C. (2007). "Predicting Customer Responses To Direct Marketing:A Bayesian Approach." Lingnan University.

    Xu, R., G. Anagnostopoulos, et al. (2002). "Tissue Classification Through Analysis of Gene Expression Data Using A New Family of ART Architectures " IJCNN 1: 300-304.

    Yeung, K., R. Bumgarner, et al. (2005). "Bayesian model averaging: development of an improved multi-calss, gene selection and classification tool for microarray data." Bioinformatics 21: 2394-2402.

    Zhang, H. H., J. Ahn, et al. (2006). "Gene selection using support vector machines with non-convex penalty." Bioinformatics 22(1): 88–95.

    Zhang, T. and F. Oles (2001). "Text Categorization Based on Regularized Linear Classifiers." Information Retrieval 4: 5-31.

    Zhou, X., K. Liu, et al. (2004). "Cancer classification and prediction using logistic regression with Bayesian gene selection." J. Biomed. Inform. 37: 249-259.

    Zou, H. and T. Hasties (2005). "Regularization and Variable Selection via the Elatic Net." Journal of the Royal Statistical Society Ser. B(67): 301~320. 301~320.

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)

    QR CODE