簡易檢索 / 詳目顯示

研究生: 徐廷倢
Hsu, Ting-Chieh
論文名稱: Improved Affinity Propagation by Spline Interpolation on Time-Series Gene Expression Clustering
利用Spline內插技術改進時序基因資料的AP分群方法研究
指導教授: 王家祥
Wang, Jia-Shung
口試委員:
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2010
畢業學年度: 98
語文別: 英文
論文頁數: 37
中文關鍵詞: Spline 內插技術親和性互動式演算法基因表現時間序列
外文關鍵詞: Spline Interpolation, Affinity Propagation, Time-Series Gene Expression
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來,由於DNA微陣列技術(microarray)的突破,生物學家得以同時觀察生物行為實驗上大量基因的表現。從時序資料中分析有意義的表現情形就變成了解生物系統的關鍵步驟。將已知功能與未知功能的基因透過分群技術自動分類,其結果或能推測未知基因的功能。但由於生物實驗中有許多不確定性及雜訊,時序性基因表現資料的分析並不容易。早期的分群演算法像是k-means、self-organizing maps和hierarchical clustering忽略了時間序列中,在連續時間點內的高度相關性;相較之下,根據機率模型的演算法如dynamic Bayesian networks (DBN)和hidden Markov models (HMM)則更適合分析時間序列,然而此類演算法卻缺乏效率。另外生物實驗採樣的時間間隔長,也可能有取樣不足的問題。在這篇論文中,我們提出了一個結合了Spline內插技術和Affinity Propagation(AP)的非監督式分群演算法。我們提出的方法檢查每個時間區段基因之間的表現關係,並減輕了雜訊和分離物的影響。透過實際分析酵母菌時序基因表現資料庫,我們的方法有顯著的分群準確率,而且不需要預先得知分群數量與分群中心點的資訊。提供關於基因表現時序資料分群的一個未來發展方向。


    DNA microarray technology has been widely used in life science research for many years. The technology allows scientists monitoring genes' expression level during biological processes simultaneously. Analyzing massive time-series data is important to explore the complex dynamics of biological systems. However, the analysis task of time-series gene expression data is difficult since noise levels and measurement uncertainties are high. The early clustering methods such as k-means, self-organizing maps and hierarchical clustering disregarded the temporal dependency between successive time points. As for probabilistic model-based methods, dynamic Bayesian networks (DBN) and hidden Markov models (HMM), are more suitable for time-series but fail in computational inefficiency. In addition, real gene datasets has undersampling problem for long intervals between time points of harvesting expression data. In this thesis, an unsupervised clustering algorithm which combines Spline interpolation and Affinity Propagation is proposed. The proposed method investigates the relationship between genes across distinct time points through the interval selection after using interpolation to eliminate the influence of undersampling. We demonstrate our method result in significant accuracy on real gene expression time-series datasets without \textit{priori} knowledge such as the number of clusters and exemplars. Our study provides a way of clustering gene expression time-series data for future biological investigations.

    List of Figures List of Tables 1 Introduction 2 Related Work 2.1 Spline Interpolation 2.2 Anity Propagation Clustering Algorithm 2.3 Consensus Clustering 3 The Proposed Algorithms 3.1 Spline Interpolation 3.2 Anity Propagation Clustering with Interval Selection 3.3 Gene-Relativity Graph Construction 3.4 Graph Partitioning for Class Discovery 4 Experimental Results 4.1 Measure of Agreement 4.1.1 The Adjusted Rand index 4.1.2 Silhouette index 4.2 Time-series datasets 4.2.1 The Yeast galactose dataset 4.2.2 The Yeast cell-cycle dataset 4.2.3 The Yeast sporulation dataset 4.3 Parameters Setting 4.4 Results and Discussions 4.4.1 The Yeast galactose dataset 4.4.2 The Yeast cell-cycle dataset 4.4.3 The Yeast sporulation dataset

    Androulakis, I. P., Yang, E., and Almon, R. R. (2007).
    of Biomedical Engineering, 9:205-228.

    Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., Harris, M. A., Hill, D. P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J. C., Richardson, J. E., Ringwald, M., Rubin, G. M., and Sherlock, G. (2000). Gene ontology: tool for the uni cation of biology. Nature Genetics, 25(1):25{29.

    Bandyopadhyay, S., Mukhopadhyay, A., and Maulik, U. -2007-. An improved algorithm for clustering gene expression data. Bioinformatics, 23(21):2859-02865.

    Bar-Joseph, Z. (2004). Analyzing time series gene expression data. Bioinformatics, 20(16):2493-2503.

    Bar-Joseph, Z., Gerber, G., Gi ord, D. K., and Jaakkola, T. S. (2002). A new approach to analyzing gene expression time series data. In Proceedings of the An-nual International Conference on Computational Molecular Biology, RECOMB, ages 39-48.

    Chiu, T.-Y., Hsu, T.-C., and Wang, J.-S. (2010). Ap-based consensus clustering for gene expression time series. IAPR International Conference on Pattern Recognition.

    Cho, R. J., Campbell, M. J., Winzeler, E. A., Steinmetz, L., Conway, A., Wodicka,
    L., Wolfsberg, T. G., Gabrielian, A. E., Landsman, D., Lockhart, D. J., and
    Davis, R. W. (1998). A genome-wide transcriptional analysis of the mitotic cell
    cycle. Molecular Cell, 2(1):65-73.

    Chu, S., DeRisi, J., Eisen, M., Mulholland, J., Botstein, D., Brown, P. O., and
    Herskowitz, I. (1998). The transcriptional program of sporulation in budding
    yeast. Science, 282(5389):699-705.

    Frey, B. J. and Dueck, D. (2007). Clustering by passing messages between data
    points. Science, 315(5814):972{976.

    Grotkjaer, T., Winther, O., Regenberg, B., Nielsen, J., and Hansen, L. K. (2006).
    Robust multi-scale clustering of large dna microarray datasets with the consen-
    sus algorithm. Bioinformatics, 22(1):58-67.

    Hubert, L. and Arabie, P. (1985). Comparing partitions. Journal of Classi cation,
    2(1):193-218.
    Ideker, T., Thorsson, V., Ranish, J. A., Christmas, R., Buhler, J., Eng, J. K.,
    Bumgarner, R., Goodlett, D. R., Aebersold, R., and Hood, L. (2001). Inte-
    grated genomic and proteomic analyses of a systematically perturbed metabolic
    network. Science, 292(5518):929-934.

    Leone, M., Sumedha, and Weigt, M. (2007). Clustering by soft-constraint anity propagation. Bioinformatics, 23(20):2708-2715.

    Li, C.-T., Yuan, Y., and Wilson, R. (2008). An unsupervised conditional random elds approach for clustering gene expression time series. Bioinformatics,
    24(21):2467-2473.

    Luan, Y. and Li, H. (2003). Clustering of time-course gene expression data using a mixed-e ects model with b-splines. Bioinformatics, 19(4):474-482.

    Maulik, U. and Bandyopadhyay, S. (2003). Fuzzy partitioning using a real-coded variable-length genetic algorithm for pixel classi cation. IEEE Transactions on
    Geoscience and Remote Sensing, 41(5):1075-1081.

    Medvedovic, M., Yeung, K., and Bumgarner, R. (2004). Bayesian mixture model based clustering of replicated microarray data. Bioinformatics, 20(8):1222-1232.

    Monti, S., Tamayo, P., Mesirov, J., and Golub, T. (2003). Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Machine Learning, 52(1-2):91118.

    Ng, S. K., McLachlan, G. J., Wang, K., Jones, L. B.-T., and Ng, S.-W. (2006). A mixture model with random-e ects components for clustering correlated gene-
    expression pro les. Bioinformatics, 22(14):1745-1752.

    Qin, Z. S. (2006). Clustering microarray gene expression data using weighted chinese restaurant process. Bioinformatics, 22(16):1988-1997.

    Rousseeuw, P. (1987). Silhouettes: a graphical aid to the interpretation and vali-dation of cluster analysis. Journal of Computational and Applied Mathematics,
    20(1):53-65.

    Schliep, A., Costa, I. G., Steinho , C., and Schonhuth, A. (2005). Analyzing gene expression time-courses. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2(3):179-193.

    Strehl, A., Ghosh, J., and Cardie, C. (2002). Cluster ensembles - a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning
    Research, 3:583-617.

    Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lan- der, E. S., and Golub, T. R. (1999). Interpreting patterns of gene expression with
    self-organizing maps: Methods and application to hematopoietic di erentiation. Proceedings of the National Academy of Sciences, USA, 96(6):2907-2912.

    Yedidia, J. S., Freeman, W. T., andWeiss, Y. (2005). Constructing free-energy ap-
    proximations and generalized belief propagation algorithms. IEEE Transactions
    on Information Theory, 51(7):2282{2312.

    Yeung, K. Y., Medvedovic, M., and Bumgarner, R. E. (2003). Clustering gene-
    expression data with repeated measurements. Genome Biology, 4:R34.

    Yeung, K. Y. and Ruzzo, W. L. (2001). Principal component analysis for clustering
    gene expression data. Bioinformatics, 17(9):763-774.

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)

    QR CODE