簡易檢索 / 詳目顯示

研究生: 林國勝
Kuo-Sheng Lin
論文名稱: 資料挖礦於生物晶片資料分析之研究
A Study on Data Mining Analysis in Bio-chip Data
指導教授: 簡禎富
口試委員:
學位類別: 博士
Doctor
系所名稱: 工學院 - 工業工程與工程管理學系
Department of Industrial Engineering and Engineering Management
論文出版年: 2008
畢業學年度: 96
語文別: 中文
論文頁數: 102
中文關鍵詞: 生物晶片資料挖礦決策樹分析集群分析基因微陣列技術顯著性分析
外文關鍵詞: Bio-chip, Data Mining, Decision Tree Technology, Cluster Analysis, Microarray Technology, Significant Analysis of Microarray
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 生物科技和生物晶片的研究和應用在過去十年有非常蓬勃的發展,其中最顯著的成果包含生物晶片(bio-chip)之基因微陣列技術(microarray)及基因選殖技術(gene cloning)的突破,其中資訊科技的進展能力功不可沒。然而許多衍生的資料處理和分析問題亟待克服,特別是生物晶片資料變數多而樣本數少的問題。本研究目的係針對生物晶片資料的特性,發展生物晶片資料挖礦(Data Mining)方法和模式藉以探索與尋找疾病和特定基因的關係,並建構其規則;再設計生物晶片資料的集群分析(Cluster Analysis)演算法,建構基因組(genome-wide)關係,從中萃取有價值之資訊,以作為醫療診斷決策支援參考。本研究採用史丹佛大學晶片資料庫中乳癌晶片資料驗證研究效度,從四萬多個基因與64個樣本當中,使用顯著性分析(Significant Analysis of Microarray)與決策樹分析(Decision Tree)挖掘出具影響力的基因及診斷決策規則;並以乳癌晶片基因資料提供完整基因組集群分析結果。


    Owing to increasing breakthroughs for microarray in biochips and gene cloning technologies, biotechnology is now an emergent and promising industry worldwide. Although information technology advancements enable complex calculation and comprehensive data storage involved in biotechnology, a number of critical issues need to be addressed for both practice and research needs. This study aims to develop a data mining framework within a proposed cluster analysis algorithm for analyzing huge bio-chip data that are different from the data addressed in manufacturing and service industries. Bio-chip data that consists of high-dimensional attributes have more attributes than specimens. Feature selection and extraction is critical to remove noisy features and reduce the dimensionality in microarray analysis. In particular, specific genes between normal and abnormal individuals are extracted in decision rules to clarify the relationships among genes and diseases; the relationship of in-group and with-group among genes is needed to be built up. We adopt the breast cancer patient cDNA microarray dataset for validating the proposed approach. We firstly extracted significant genes from more than 44,000 genes and then use decision tree to derive classification rules, and use the proposed algorithm to build up cluster relationship by displaying table list to support medical diagnosis and reference. The results showed practical viability of this framework.

    第1章 緒論 1 1.1 研究背景、動機與重要性 1 1.2 研究目的 4 1.3 論文架構 4 第2章 理論基礎 7 2.1 生物晶片 7 2.1.1 檢測型晶片 10 2.1.2 處理型晶片 12 2.2 生物訊息 14 2.3 生物晶片資料分析 16 2.4 乳癌晶片分析研究回顧 22 2.5 知識發現與資料挖礦 23 2.5.1 決策樹分析 27 2.5.2 集群分析 31 2.6 統計分析 34 2.6.1 相關分析 34 2.6.2 主成份分析與因素分析 36 第3章 生物晶片資料挖礦模式 38 3.1 問題定義與架構 40 3.2 資料預處理 41 3.3 生物晶片資料挖礦分析模式 42 3.3.1 基因選取與規則建立模式 42 3.3.2 基因分群模式 46 3.4 模式驗證 52 3.5 結果彙整與決策建議 52 第4章 案例研究 53 4.1 問題定義與架構 53 4.1.1 案例簡介 53 4.1.2 資料介紹 53 4.2 資料預處理 55 4.3 生物晶片資料挖礦模式 57 4.3.1 基因選取與規則建立模式 57 4.3.2 基因分群模式 77 4.4 模式驗證 79 4.5 結果彙整與決策建議 89 第5章 結論 91 參考文獻 93

    大石正道(2002),圖解人類基因組的構造,世茂出版社,台北。
    王道還(2003),「人類基因組計畫完成了」,科學發展,第365期,頁76。
    王鴻儒、簡禎富、李培瑞、徐紹鐘(2002),「決策樹資料挖礦架構及其於半導體製程之實證研究」,科技管理學刊,第7卷,第1期,頁137-160。
    公共電視出版社(2006),「DNA時代」VCD。
    吳文騰(2003),生物產業技術概論,清大出版社,新竹。
    何國傑、葉開溫、鄭石通、靳宗洛(2001),基因工程與生物技術概論-基因選殖及DNA分析,藝軒圖書出版社,台北。
    江晃榮(2003),經濟巨人Bio-生物科技的千億商機,世茂出版社,台北。
    成大生物科技中心(2006),http://www.ncku.edu.tw/~cbst/index.html。
    莊榮輝(2004),http://juang.bst.ntu.edu.tw/,台灣大學生化科技學系。
    耿直、鄔宏潘、謝邦昌、趙雅婷、蘇志雄(2003),生物醫學統計學-理論與資料分析應用,鼎茂圖書出版有限公司,台北。
    莊慧明(2002),「生物晶片」,產業調查與技術季刊,第141期,頁2-25。
    彭金堂(2004),分析電業自由化市場競爭機制與配電事故診斷資料挖礦之研究,國立清華大學工業工程與工程管理研究所博士論文。
    陽明大學生物資訊中心(2007),http://binfo.ym.edu.tw/core/。
    經濟部工業局(2002),「兩兆雙星推動計畫」,http://www.moeaec.gov.tw/。
    簡禎富(2005),決策分析與管理: 全面決策品質提升之架構與方法,雙葉書廊,台北。
    簡禎富、吳文婷(1997),「醫療決策分析:以唐氏症之診斷為例」,醫療資訊雜誌,第6期,頁39-53。
    簡禎富、林鼎浩、徐紹鐘、彭誠湧(2001),「建構半導體晶圓允收測試資料挖礦架構及其實證研究」,工業工程學刊,第18卷,第4期,頁37-48。
    簡禎富、李培瑞、彭誠湧(2003),「半導體製程資料特徵萃取與資料挖礦之研究」,資訊管理學報,第10卷,第1期,頁63~84。
    簡禎富、林鼎浩、劉巧雯、彭誠湧、徐紹鐘、黃佳琪(2001),「建構晶圓圖分類之資料挖礦方法及其實證研究」,工業工程學刊,第19卷,第2期,頁23-38。
    簡禎富、蕭禮明、王興仁(2004),「建構半導體製造管理目標層級架構與製造資料之資料挖礦」,工業工程學刊,第21卷,第4期,頁313-327。
    簡禎富、王興仁、陳麗妃(2005),「利用資料挖礦提升半導體廠製造技術員人力資源管理品質」,品質學報,第12卷,第1期,頁9-28。
    簡禎富、林國勝(2006),「建構cDNA生物晶片之二元資料挖礦模式及其實証研究」,資訊管理學報,第13卷,第4期,頁133-159。
    Aczel, A. D., and J. Sounderpandian (2002), Complete Business Statistics, McGraw-Hill Companies, Boston.
    Baldi, P., and S. Brunak (2004), Bioinformatics: The Machine Learning Approach, The MIT Press, London.
    Bergeron, B. (2002), Bioinformatics Computing, Prentice Hall, New Jersey.
    Berry, M. J., and G. S. Linoff (2004), Data Mining Techniques for Marketing Sales, and Customer Relationship Management, Wiley Publishing Inc., Indianapolis.
    Breiman, L., J. H. Friedman, R. J. Olshen, and C. J. Stone (1984), Classification and Regression Tree, Wadsworth, Belmont.
    Chen, X. (2003), “Gene Selection for Cancer Classification Using Bootstrapped Genetic Algorithms and Support Vector Machines,” Proceedings of the 2003 IEEE Bioinformatics Conference, pp. 504-505.
    Chien, C. F., S. Chen, and Y. Lin (2002), “Using Bayesian Network for Fault Location on Distribution Feeder of Electrical Power Delivery Systems,” IEEE Transactions on Power Delivery, Vol. 17, No. 13, pp. 785-793.
    Chien, C. F., W. C. Wang, and J. C. Cheng (2007), “Data mining for yield enhancement in semiconductor manufacturing and an empirical study,” Expert Systems with Applications, Vol. 33, No. 1, pp. 1-7.
    Cho, S. (2002), “Exploring features and classifiers to classify gene expression profiles of acute leukemia,” International Journal of Pattern Recognition and Artificial Intelligence, Vol. 16, No. 7, pp. 831-844.
    Chuang, H., H. Liu, S. Brown, C. McMunn-Coffran, C. Kao, and F. Hsu (2004), “Identifying Significant Genes from Microarray Data,” Proceedings of the Fourth IEEE Symposium on Bioinformatics and Bioengineering, pp. 358-365.
    Eisen, M. B., P. T. Spellman, P. O. Brown, and D. Botstein (1998), “Cluster analysis and display of genome-wide expression patterns,” Proceedings of National Academic Science, USA, Vol. 95, pp. 14863-14868.
    Ewens, W. J., and G. R. Grant (2005), Statistical Methods in Bioinformatics: An Introduction, Springer Science+Business Media Inc., New York.
    Fayyad, U. (1997), “Data mining and knowledge discovery in database: implication for scientific databases,” Scientific and Statistical Database Management, pp. 2-11.
    Finetti, P., N. Cervera, E. Charafe-Jauffret, C. Chabannon, C. Charpin, M. Chaffanet, J. Jacquemier, P. Viens, D. Birnbaum, and F. Bertucci (2008), “Sixteen-kinase gene expression identifies luminal breast cancers with poor prognosis,” Cancer Research, Vol. 68, No. 3, pp. 767-76.
    Freund, Y., and R. Schapire (1995), “A decision-theoretic generalization of on-line learning and an application to boosting,” Journal of Computer and System Sciences, Vol. 55, No. 1, pp. 119-139.
    Getz, G., E. Levine, and E. Domany (2000), “Coupled two-way clustering analysis of gene microarray data,” Proceedings of National Academic Science, USA, Vol. 97, pp. 12079-12084.
    Gregory, P., and P. Tamayo (2003), “Microarray Data Mining: Facing the Challenges,” Knowledge Discovery and Data Mining, Vol. 5, No. 2, pp. 1-5.
    Han, J., and M. Kamber (2000), Data Mining Concepts and Techniques, Morgan Kaufmann Publishers, San Francisco.
    Hartigan, J. A. (1975), Clustering Algorithms, John Wiley & Sons, New York.
    Hastie, T., R. Tibshirani, M. B. Eisen, A. Alizadeh, R. Levy, L. Staudt, W. C. Chan, D. Bostein, and P. Brown (2000), “Gene shaving as a method for identifying distinct sets of genes with expression patterns,” Genome Biology, Vol. 1, pp. 1-21.
    Herrero, J., A. Valencia, and J. Dopazo (2001), “A hierarchical unsupervised growing neural network for clustering gene expression patterns,” Bioinformatics, Vol. 17, pp. 126-136.
    Higgins, J. J. (2004), An Introduction to Modern Nonparametric Statistics, Duxbury Press, Belmont.
    Jackson K., and I. Koprinska (2002), “DNA Microarray Data Clustering Using Growing Self Organizing Networks,” Proceedings of the 9th International Conference on Neural Information Processing, Vol. 2, pp. 805-808.
    Kass, G. V. (1980), “An exploratory technique for investigation large quantities of categorical data,” Applied Statistics, Vol. 29, pp. 119-127.
    Khabzaoui, M., C. Dhaenens, and E. Talbi (2004), “A Multicriteria Genetic Algorithm to analyze DNA microarray data,” Proceedings of the Fourth IEEE Symposium on Bioinformatics and Bioengineering, pp. 1874-1881.
    Knudsen, S. (2004), Guide to Aanlysis of DNA Microarray Data, John Wiley & Sons Inc, New Jersey.
    Korkola, J. E., E. Blaveri, S. DeVries, D. H. Moore, E. S. Hwang, Y. Y. Chen, A. L. Estep, K. L. Chew, R. H. Jensen, and F. M. Waldman (2007), “Identification of a robust gene signature that predicts breast cancer outcome in independent data sets,” BMC Cancer, Vol. 7, pp. 61-73.
    Kreike B., H. Halfwerk, P. Kristel, A. Glas, H. Peterse, H. Bartelink, and M. van de Vijver (2006), “Gene expression profiles of primary breast carcinomas from patients at high risk for local recurrence after breast-conserving therapy,” Clinical Cancer Research, Vol. 12, No. 19, pp. 5705-5712.
    Kreike B., M. van Kouwenhove, H. Horlings, B. Weigelt, H. Peterse, H. Bartelink, and M. van de Vijver (2007), “Gene expression profiling and histopathological characterization of triple-negative/basal-like breast carcinomas,” Breast Cancer Research, Vol. 9, No. 5, pp. 65-78.
    Kristina, A. K., and L. A. Salter (2004), “A comparison of methods for estimating the transition: transversion ration from DNA sequences,” Molecular Phylogenetics and Evolution, Vol. 32, pp. 495-503.
    Langerod, A.,H. Zhao, O. Borgan, J. M. Nesland, I. R. Bukholm, T. Ikdahl, R. Karesen, A. L. Borresen-Dale, and S. S. Jeffrey (2007), “TP53 mutation status and gene expression profiles are powerful prognostic markers of breast cancer,” Breast Cancer Research, Vol. 9, No. 3, pp. 30-45.
    Lesk, A. M. (2002), Introduction to Bioinformatics, Oxford University Press, New York.
    Liang, M. Y., A. G. Briggs, E. Rute, A. S. Greene, and A. W. Cowley (2003), “Quantitative assessment of the importance of dye switching and biological replication in cDNA microarray studies,” Physiological Genomics, Vol. 14, No. 3, pp. 199-207.
    Liebler, D. C. (2002), Introduction to Proteomics-Tools for the New Biology, Humana Press Inc., New York.
    Lin, K., and C. Chien (forthcoming),“Cluster Analysis of Genome-wide Expression Data for Feature Extraction,” Expert Systems with Applications (doi:10.1016/j.eswa.2008.01.068).
    Livasy, C. A., G. Karaca, R. Nanda, M. S. Tretiakova, O. I. Olopade, D. T. Moore, C., and M. Perou (2006), “Phenotypic evaluation of the basal-like subtype of invasive breast carcinoma,” Modern Pathology, Vol. 19, No. 2, pp. 264-71.
    Loh, W., and Y. Shih (1997), “Split Selection Methods for Classification Trees,” Statistica Sinica, Vol. 7, pp. 815-840.
    Lusa L, L. McShane, J. Reid, L. De Cecco, F. Ambrogi, E. Biganzoli, M. Gariboldi, and M. Pierotti (2007), “Challenges in projecting clustering results across gene expression-profiling datasets,” Journal of National Cancer Institute, Vol. 99, No. 22, pp. 1715-1723.
    Ma, Y., Y. Qian, L. Wei, J. Abraham, X. Shi, V. Castranova, E. J. Harner, D. C. Flynn, and L. Guo (2007), “Population-based molecular prognosis of breast cancer by transcriptional profiling,” Clinical Cancer Research, Vol. 13, No. 7, pp. 2014-2022.
    Magerman, D. M. (1996), “Learning Grammatical Structure Using Statistical Decision-Trees,” Grammatical Inference: Learning Syntax from Sentences (Lecture Notes in Computer Science), pp. 1-22.
    Matsuya, T., K. Otake, S. Tashiro, N. Hoshino, M. Katada, and T. Okuyama (2006), “A new time-resolved fluorometric microarray detection system using core-shell-type fluorescent nanosphere and its application to allergen microarray,” Analytical and Bioanalytical Chemistry, Vol. 385, No. 5, pp. 797-806.
    Mingers, J. (1989), “An Empirical Comparison of Selection Measures for Decision-Tree Induction,” Machine Learning, Vol. 3, pp. 319-342.
    Miyamoto, T., S. Uchimura, H. Yoshihiko, N. Iizuka, M. Oka, and Y. Yamada-Okabe (2003), “Comparative study of feature selection methods on microarray data,” IEEE EMBS Asian-Pacific Conference on Biomedical Engineering, pp. 82-83.
    Mukherjee, S. N. (2003), “Gene ranking using bootstrapped P-value,” Knowledge Discovery and Data Mining, Vol. 5, No. 2, pp. 16-22.
    Nadimpally, V., and M. Zaki (2003), “A Novel Approach to Determine Normal Variation in Gene Expression Data,” Knowledge Discovery and Data Mining, Vol. 5, No. 2, pp. 6-11.
    Nagarajan, R., and M. Upreti (2006), “Correlation statistics for cDNA microarray image analysis,” IEEE-ACM Transactions on Computational Biology and Bioinformatics, Vol. 3, No. 3, pp.232-238.
    National Institutes of Health (2006), http://www.nih.gov/.
    Nie, L., G. Wu, and W. W. Zhang (2006), “Correlation between mRNA and protein abundance in Desulfovibrio vulgaris: A multiple regression to identify sources of variations,” Biochemical and Biophysical Research Communications, Vol. 339, No. 2, pp. 603-610.
    Peng, C., and C. Chien (2003), “Data Value Development to Enhance Yield and Maintain Competitive Advantage for Semiconductor Manufacturing,” International Journal of Service Technology and Management, Vol. 4, No. 4, pp. 365-383.
    Peng, J., C. Chien, and B. Tseng (2004), “Rough set theory for data mining for fault diagnosis on distribution feeder,” IEE Proceedings-Generation, Transmission, and Distributions, Vol. 151, No. 6, pp. 689-697.
    Perreard L., C. Fan, J. Quackenbush, M. Mullins, N. Gauthier, E. Nelson, M. Mone, H. Hansen, S. Buys, K. Rasmussen, A. Orrico, D. Dreher, R. Walters, J. Parker, Z. Hu, X. He, J. Palazzo, O. Olopade, A. Szabo, C. Perou, and P. Bernard (2006), “Classification and risk stratification of invasive breast carcinomas using a real-time quantitative RT-PCR assay,” Breast Cancer Research, Vol. 8, No. 2, pp. 23-33.
    Piatetsky-Shapiro, G., and P. Tamayo (2004), “Microarray Data Mining: Facing the Challenges,” SIGKDD Explorations, Vol. 5, No. 2, pp. 1-5.
    Pyle, D. (1999), Data Preparation for Data Mining, Morgan Kaufmann Publishers, San Francisco.
    Quinlan, J. R. (1986), “Induction of Decision Trees”, Machine Learning, Vol. 1, pp. 81-106.
    Schuetz, C. S., M. Bonin, S. E. Clare, K. Nieselt, K. Sotlar, M. Walter, T. Fehm, E. Solomayer, O. Riess, D. Wallwiener, R. Kurek, and H. J. Neubauer (2006), “Progression-specific genes identified by expression profiling of matched ductal carcinomas in situ and invasive breast tumors, combining laser capture microdissection and oligonucleotide microarray analysis,” Cancer Research, Vol. 66, No. 10, pp. 5278-5286.
    Shippy, R., T. J. Sendera, R. Lockner, C. Palaniappan, T. Kaysser-Kranich, G. Watts, and J. Alsobrook (2004), “Performance evaluation of commercial short-oligonucleotide microarrays and the impact of noise in making cross-platform correlations,” Bmc Genomics, Vol. 5, No. 61, pp. 1-15.
    Shoemaker, J., I. Painter, and B. Weir (1999), “Bayesian statistics in genetics,” Trends in Genetics, Vol. 15, No. 9, pp. 354-358.
    Simon, R. (2003), “Supervised analysis when the number of candidate feature (p) greatly exceeds the number of cases (n),” Knowledge Discovery and Data Mining, Vol. 5, No.2, pp. 31-36.
    Stanford Microarray Database (2006), http://genome-www5.stanford.edu/.
    Smith, A. D., T. K. Atwood, P. N. Campbell, J. H. Parish, and A. D. Smith (2006), Oxford Dictionary of Biochemistry and Molecular Biology, Oxford University Press, New York.
    Sørlie, T., C. Perou, R. Tibshirani, T. Aas, S. Geisler, H. Johnson, T. Hastie, M. Eisen, M. Rijn, S. Jeffrey, T. Thorsen, H. Quist, J. Matese, P. Brown, D. Botstein, P. Lønning, and A. Børresen-Dale (2001), “Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications,” Proceedings of the National Academy of Science of the United States of America, Vol. 98, No. 19, pp. 10869–10874.
    Thykjaer, T., C. Workman, M. Kruhøffer, K. Demtro¨der, H. Wolf, L. D. Andersen, C. M. Frederiksen, S. Knudsen, and T. F. Ørntoft (2001), “Identification of Gene Expression Patterns in Superficial and Invasive Human Bladder Cancer,” Cancer Research, Vol. 61, pp. 2492-2499.
    Troyanskaya, O., K. Dolinski, A. Owen, R. Altman, and D. Botstein (2003), “A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae),” Proceedings of the National Academy of Science of the United States of America, Vol. 100, No. 14, pp. 8348–8353.
    Turashvili, G., J. Bouchal, K. Baumforth, W. Wei, M. Dziechciarkova, J. Ehrmann, J. Klein, E. Fridman, J. Skarda, J. Srovnal, M. Hajduch, P. Murray, and Z. Kolar (2007), “Novel markers for differentiation of lobular and ductal invasive breast carcinomas by laser microdissection and microarray analysis,” BMC Cancer, Vol. 7, pp. 55-74.
    Tusher, V., R. Tibshirani, and G. Chu (2001), “Significance analysis of microarrays applied to the ionizing radiation response,” Proceedings of the National Academy of Science of the United States of America, Vol. 98, No. 9, pp. 5116–5121.
    Van Laere, S., G. Van den Eynden, I. Van der Auwera, M. Vandenberghe, P. van Dam, E. Van Marck, K. van Golen, P. Vermeulen, and L. Dirix (2006), “Identification of cell-of-origin breast tumor subtypes in inflammatory breast cancer by gene expression profiling,” Breast Cancer Research and Treatment, Vol. 95, pp. 243-255.
    Ward, J. H. (1963), “Hierarchical Grouping to Optimize an Objective Function,” Journal of the American Statistical Association, Vol. 58, pp. 236–244.
    Welm, A. L., J. B. Sneddon, C. Taylor, D. S. A. Nuyten, and M. J. Vijver (2007), “The macrophage-stimulating protein pathway promotes metastasis in a mouse model for breast cancer and predicts poor prognosis in humans,” Proceedings of the National Academy of Science of the United States of America, Vol. 104, No. 18, pp. 7570–7575.
    Wu, Y. and A. Zhang (2004), “Feature Selection for Classifying High-Dimensional Numerical Data,” Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 251-258.
    Yun, H., D. Ha, B. Hwang, and K. Ryu (2003), “Mining association rules on significant rare data using relative support,” The Journal of System and Software, Vol. 67, pp. 181-191.
    Zelditch, M., D. Swiderski, D. H. Sheets, and W. Fink (2004), Geometric Morphometrics for Biologists, Academic Press, New York.
    Zhao, H., A. Langerød, Y. Ji, K. W. Nowels, J. M. Nesland, R. Tibshirani, I. K. Bukholm, R. Kåresen, D. Botstein, A. Børresen-Dale, and S. S. Jeffrey (2004), “Different Gene Expression Patterns in Invasive Lobular and Ductal Carcinomas of the Breast,” Molecular Biology of the Cell, Vol. 15, pp. 2523–2536.

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)

    QR CODE