本論文的研究主題為探討如何對長期追蹤資料 (longitudinal data) 之觀測對象分群的問題 。本文以函數型資料 (functional data) 的觀點來看待長期追蹤資料 , 亦即將資料視為一組互相獨立的時間函數或曲線在間斷時間點的觀測值 , 也因此文中所發展出的方法可處理資料具有密集的觀測時間點或是每個觀測對象之觀測時間不相同時的分群問題 。 此外 , 由於資料中同一個觀測對象在不同時間所得到的測量值存在著相關性 , 所以本文試圖根據資料隨時間變化的平均趨勢與在相異時間點之間的共變異結構對資料分群 。在分群曲線之平均函數 (mean function) 與共變異數函數 (covariance function) 相異的假設下 , 本文根據函數型主成份分析 (FPCA , functional principal components analysis) 建立分群曲線的基本模式 , 並依此模式提出一套以疊代重分群為主的分群演算法 , 簡稱為 rFPCA (reallocation based on FPCA) 。 rFPCA 分群演算法可分為初步分群與疊代重分群兩個步驟 , 其最終目的是想有效區分資料中平均函數與共變異數函數相異的曲線 。 在初步分群方面 , 主要是利用有限維度的函數型主成份分數 (functional principal component scores) 之分佈來探查資料在平均結構上的初步分群 。而在重分群方面 , 則是利用各群資料各別對等待被重分群的曲線做配適 , 然後以各群所得的配適曲線與此待重分群之觀測曲線間的最小 L2 距離來選擇此曲線的新群別 。除此之外 , 文中也從理論方面來探討 rFPCA 分群演算法的原理與特性 , 並經由模擬研究來驗證理論上的結果 。 結果發現 , 本文所提出的分群演算法在分群曲線之平均函數或共變異數函數相異的情況下 , 大部份都可達到預期的分群目的 。 尤其是當相異群組的共變異數函數有顯著差異時, 疊代重分群步驟更能大幅改善初步分群的結果 。 另一方面 , 本文亦提出一套兩階段的拔靴法檢定 (two-stage bootstrap test) , 以做為在實際應用上可判斷資料是否需要執行重分群步驟的工具 。 在本文的最後 ,將以成長曲線與隨時間量測的基因表現量等兩組資料的分群分析來說明所提出之方法的實際應用與可行性 。
Abraham, C., Cornillon, P. A, Matzner-L$\o$ber, E. and Molinari, N. (2003), Unsupervised Curve Clustering Using B-splines. Scandinavian Journal of Statistics, 30, 581-595.
Alter, O., Brown, P. O. and Botstein, D. (2000), Singular Value Decomposition for Genome-Wide Expression Data Processing and Modeling. Proc. Natl. Acad. Sci. USA, 97, 10101-10106.
Arbeitman, M. N., Furlomg, E. E. M., Imam, F., Johnson, E., Null, B. H., Baker, B. S., Krasnow, M. A., Scott, M. P., Davis,, R. W. and White, K. P. (2002), Gene Expression During the Life Cycle of Drosophila Melanogaster, Science, 297, 2270-2275.
Ash, R. B. and Gardner, M. F. (1975), Topics in Stochastic Process, Academic Press, New York.
Banfield, J. D. and Raftery, A. E. (1993), Model-Based Gaussian and Non-Gaussian Clustering, Biometrics, 49, 803-821.
Castro, P. E., Lawton, W. H. and Sylvestre, E. A.(1986), Principal Modes of Variation for Processes With Continuous Sample Curves. Technometrics, 28, 329-337.
Chang, W. C. (1983), On Using Principal Components Before Separating a Mixture of Two Multivariate Normal Distributions. Appl. Statist., 32, 267-275.
Cho, R. J., Campbell, M. J., Winzeler, E. A., Steinmetz, L., Conway, A., Wodicka, L., Wolfsberg, T. G., Garielian, A. E., Landsman, D., Lockhart, D. J. and Davis, R. W. (1998), A Genome-Wide Transcriptional Analysis of the Mitotic Cell Cycle. Mol. Cell., 2, 65-73.
Eisen, M. B., Spellman, P. T., Brown, P. O. and Botstein, D. (1998), Cluster Analysis and Display of Genome-Wide Expression Patterns. Proc. Natl. Acad. Sci. USA, 95, 14863-14868.
Fan, J. and Gijbels, I. (1996), Local Polynomial Modelling and Its Application. Chapman & Hall, London.
Fraley, C. and Raftery, A. E. (2002), Model-Based Clustering, Discriminate Analysis and Density Estimation. J. Amer. Statist. Assoc., 97, 611-631.
Hall, P. and Heckman, N. E. (2002), Estimating and Depicting the Structure of A Distribution of Random Functions. Bimetrika, 89, 145-158.
Hartigan, J. A. and Wong, M. A. (1978), A k-Means Clustering Algorithm, Applied Statistics, 28, 100-108.
Heckman, N. E. and Zamar, K. K. J. (2000), Comparing the Shapes of Regression Functions. Biometrika, 87, 135-144.
Holter, N. S., Mitra, M., Maritan, A., Cieplak, M., Banavar, J. R. and Fedoroff, N. V. (2000), Fundamental Patterns Underlying Gene Expression Profiles : Simplicity from Complexity. Proc. Natl. Acad. Sci. USA, 97, 8409-8414.
Hubert, L. and Arabie, P. (1985), Comparing Partitions. Journal of Classification, 2, 193-218.
Iyer, V. R., Eisen, M. B., Ross, D. T., Schuler, G., Moore, T., Lee, J. C. F., Trent, J. M., Staudt, L. M., Hudson, J. Boguski, M., Lashkari, D., Shalon, D., Botstein, D. and Brown, P. O. (1999) The Transcriptional Program in the Response of Human Fibroblasts to Serum. Science, 283, 83-87.
James, G. M. and Sugar, C. A. (2003), Clustering for Sparsely Sampled Functional Data. J. Amer. Statist. Assoc., 98, 397-408.
Jolliffe, I. T. (2002), Principal Component Analysis. Springer, New York.
Kaufman, L. and Rousseeuw, P. (1990), Finding Groups in Data: An Introduction to Cluster Analysis, Wiley, New York.
Liu, X. and M$\mathrm{\ddot{u}}$ller, H. G. , (2003), Modes and Clustering for Time-Warped Gene Expression Profiles Data, Bioinformatics, 19, 1937-1944.
Luan, Y. and Li, H. (2003), Clustering of time-course gene expression data using a mixed-effects model with B-splines. Bioinformatics, 19, 474-482.
Ramsay, J. O. and Silverman, B. W. (1997), Functional data analysis. Springer, New York.
Rand, W. M. (1971), Objective criteria for the evaluation of clustering methods. J. Amer. Statist. Assoc., 66, 846-850.
Serban, N. and Wasserman, L. (2005), CATS: Clustering after transformation and smoothing. J. Amer. Statist. Assoc., 100, 990-999.
Spellman, P., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Esien, M. B., Brown, P. O., Botstein, D. and Futcher, B. (1998), Comprehensive Identification of Cell Cycle-Regulated Genes of the Yeast Saccharomyces Cerevisiae by Microarray Hybridization. Mol. Biol. Cell., 9, 3273-3297.
Staniswalis, J. G. and Lee, J. J. (1998), Nonparametric Regression Analysis of Longitudinal Data, J. Amer. Statist. Assoc., 93, 1403-1418.
Tamayo, P., Solni, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E. S. and Golub, T. R. (1999), Interpreting patterns of gene expression with self-organizing maps : Methods and application to hematopoietic differenctation. Proc. Natl. Acad. Sci. USA, 96, 2907-2917.
Tarpey, T. and Kinateder, K. K. J. (2003), Clustering functional data. Journal of Classification, 20, 93-114.
Tavazoie, S., Hughes, D., Campbell, M. J., Cho, R. J. and Church, G. M. (1999), Systematic determination of genetic network architecture. Nature Genetics, 22, 281-285.
Tuddenham, R. D. and Snyder, M. M. (1954), Physical growth of California boys and girls from birth to eighteen years. University of California Publications in Child Development, 1, 183-364.
Yao, F., Mϋller, H. G., Clifford, A. J., Dueker, S. R., Follett, J., Lin, Y., Buchholz, B. A. and Vogel, J. S.(2003), Shrinkage estimation for functional principal component scores, with application to the population kinetics of plasma folate. Biometrics, 59, 676-685.
Yeung, K. Y. and Ruzzo, W. L. (2001), Principal component analysis for clustering gene expression data. Bioinformatics, 17, 763-774.
Yeung, K. Y., Fraley, C., Muruan, A., Raftery, A. E. and Ruzzo, W. L. (2001), Model-based clustering and data transformation for gene expression data. Bioinformatics, 17, 977-987.
Zhao X., Marron, J. S. and Wells, M. T. (2004), The Functional Data Analysis View of Longitudinal Data, Statistica Sinica, 14, 789-808.