研究生: |
邱泰諭 Chiu, Tai-Yu |
---|---|
論文名稱: |
運用於基因表現時間序列的親和性互動式分群演算法 Affinity Propagation Based Consensus Clustering for Time-Series Gene Expression |
指導教授: |
王家祥
Wang, Jia-Shung |
口試委員: | |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Computer Science |
論文出版年: | 2009 |
畢業學年度: | 97 |
語文別: | 英文 |
論文頁數: | 46 |
中文關鍵詞: | 基因表現 、時間序列 、親和性傳遞分群 、分群一致性 |
外文關鍵詞: | gene expression, time-series, affinity propagation, consensus cluster |
相關次數: | 點閱:2 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年來,DNA微陣列技術在分子生物學的研究上扮演了相當重要的角色。隨著生物行為實驗的與日俱增,從時間序列中分析有意義的變化情形就變成了探索生物系統中複雜變化的關鍵步驟。由於在實驗中產生的雜訊和不確定性,時間序列的分析是遠比普通的資料分析來得困難許多。早期的分群演算法像是k-means、Self-organizing Maps和hierarchical clustering忽略了時間序列中,在連續時間點內的高度相關性;相較之下,根據機率模型的演算法如dynamic Bayesian networks (DBN)和hidden Markov models (HMM)則更適合分析時間序列,然而此類演算法卻缺乏效率。在這篇論文中,我們提出了一個結合了親和性傳遞(Affinity Propagation)和分群一致性(consensus clustering)精神的非監督式之分群演算法。我們提出的方法藉由時間區段的選擇檢查了基因在不同時間點的關係並減輕了雜訊和分離物的影響。藉由人工合成和真實時間序列基因表現資料庫的評估,我們的方法表現出顯著的分群準確率且不需要預先得知分群數量與分群中心點的資訊。此外我們的結果也根據Gene Ontology資料庫的結果和前人研究結果來展示其生物關聯性。我們的研究提供了關於基因表現時間序列分群的未來可能發展方向。
Recent years, the DNA microarray technology has played a key role in research on molecular biology. As the increase of experiments on biological processes over time, analyzing statistical patterns from time-series data has become a crucial step for exploring the complex dynamics of biological systems. Due to the noise and measurements of uncertainty, the analysis task on time-series is more complicated than common data analysis. The early clustering methods such as k-means, Self-organizing Maps and hierarchical clustering neglect the temporal dependence between successive time points. The probabilistic model-based methods like dynamic Bayesian networks (DBN) and hidden Markov models (HMM) for clustering are more suitable for time-series but exist computation inefficiency. In this thesis, an unsupervised clustering algorithm which combines a recently proposed clustering scheme, Affinity Propagation, and the spirit of consensus clustering for multiple clustering partitions, is proposed. The proposed method investigates the relationship between genes across distinct time points through the interval selection from time points, and eliminates the influence of the noise and outliers. Our method produces a clustering result without a priori knowledge about the cluster number and exemplars, and demonstrate the significant clustering accuracy on the synthesis and real gene expression time-series datasets. Besides, the biological relevance of the clustering results is analyzed with the annotation of Gene Ontology, compared to early work. Our study provides the possible directions of clustering gene expression time-series data for future biological investigations.
Androulakis, I. P., Yang, E., and Almon, R. R. (2007). Analysis of time-series gene expression data: Methods, challenges, and opportunities. Annual Review of Biomedical Engineering, 9:205-228.
Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., Harris, M. A., Hill, D. P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J. C., Richardson, J. E., Ringwald, M., Rubin, G. M., and Sherlock, G. (2000). Gene ontology: tool for the unication of biology. Nature Genetics, 25(1):25-29.
Bandyopadhyay, S., Mukhopadhyay, A., and Maulik, U. (2007). An improved algorithm for clustering gene expression data. Bioinformatics, 23(21):2859-2865.
Bar-Joseph, Z. (2004). Analyzing time series gene expression data. Bioinformatics, 20(16):2493-2503.
Bar-Joseph, Z., Gerber, G., Giord, D. K., and Jaakkola, T. S. (2002). A new approach to analyzing gene expression time series data. In Proceedings of the Annual International Conference on Computational Molecular Biology, RECOMB,
pages 39-48.
Cho, R. J., Campbell, M. J., Winzeler, E. A., Steinmetz, L., Conway, A., Wodicka, L., Wolfsberg, T. G., Gabrielian, A. E., Landsman, D., Lockhart, D. J., and Davis, R. W. (1998). A genome-wide transcriptional analysis of the mitotic cell cycle. Molecular Cell, 2(1):65-73.
Chu, S., DeRisi, J., Eisen, M., Mulholland, J., Botstein, D., Brown, P. O., and Herskowitz, I. (1998). The transcriptional program of sporulation in budding yeast. Science, 282(5389):699-705.
Ernst, J., Nau, G. J., and Bar-Joseph, Z. (2005). Clustering short time series gene expression data. Bioinformatics, 21(Supp11):i159-i168.
Frey, B. J. and Dueck, D. (2007). Clustering by passing messages between data points. Science, 315(5814):972-976.
Grotkjaer, T., Winther, O., Regenberg, B., Nielsen, J., and Hansen, L. K. (2006). Robust multi-scale clustering of large dna microarray datasets with the consensus algorithm. Bioinformatics, 22(1):58-67.
Hubert, L. and Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1):193-218.
Ideker, T., Thorsson, V., Ranish, J. A., Christmas, R., Buhler, J., Eng, J. K., Bumgarner, R., Goodlett, D. R., Aebersold, R., and Hood, L. (2001). Integrated genomic and proteomic analyses of a systematically perturbed metabolic
network. Science, 292(5518):929-934.
Leone, M., Sumedha, and Weigt, M. (2007). Clustering by soft-constraint affinity propagation. Bioinformatics, 23(20):2708-2715.
Li, C.-T., Yuan, Y., and Wilson, R. (2008). An unsupervised conditional random elds approach for clustering gene expression time series. Bioinformatics, 24(21):2467-2473.
Medvedovic, M., Yeung, K., and Bumgarner, R. (2004). Bayesian mixture model based clustering of replicated microarray data. Bioinformatics, 20(8):1222-1232.
Monti, S., Tamayo, P., Mesirov, J., and Golub, T. (2003). Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Machine Learning, 52(1-2):91-118.
Ng, S. K., McLachlan, G. J., Wang, K., Jones, L. B.-T., and Ng, S.-W. (2006). A mixture model with random-eects components for clustering correlated gene-expression profiles. Bioinformatics, 22(14):1745-1752.
Rousseeuw, P. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20(1):53-65.
Schliep, A., Costa, I. G., Steinho, C., and Schonhuth, A. (2005). Analyzing gene expression time-courses. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2(3):179-193.
Strehl, A., Ghosh, J., and Cardie, C. (2002). Cluster ensembles - a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 3:583-617.
Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E. S., and Golub, T. R. (1999). Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic dierentiation. Proceedings of the National Academy of Sciences, USA, 96(6):2907-2912.
Tjaden, B. (2006). An approach for clustering gene expression data with error information. BMC Bioinformatics, 7:17.
Xu, Y., Olman, V., and Xu, D. (2002). Clustering gene expression data using a graph-theoretic approach: an application of minimum spanning trees. Bioinformatics, 18(4):536-545.
Yedidia, J. S., Freeman, W. T., andWeiss, Y. (2005). Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Transactions on Information Theory, 51(7):2282-2312.
Yeung, K. Y., Medvedovic, M., and Bumgarner, R. E. (2003). Clustering gene-expression data with repeated measurements. Genome Biology, 4:R34.
Yeung, K. Y. and Ruzzo, W. L. (2001). Principal component analysis for clustering gene expression data. Bioinformatics, 17(9):763-774.