簡易檢索 / 詳目顯示

研究生: 林建同
Lin, Chien-Tong
論文名稱: 高維度倖存資料下之模式選取
A greedy-type variable selection on high-dimensional Cox model
指導教授: 鄭又仁
Cheng, Yu-Jen
銀慶剛
Ing, Ching-Kang
口試委員: 黃文瀚
Huang, Wen-Han
黃信誠
Huang, Hsin-Cheng
俞淑惠
Yu, Shu-Hui
學位類別: 博士
Doctor
系所名稱: 理學院 - 統計學研究所
Institute of Statistics
論文出版年: 2019
畢業學年度: 107
語文別: 英文
論文頁數: 52
中文關鍵詞: 倖存分析模式選取高維度資料柴比雪夫貪婪演算法
外文關鍵詞: Cox model screening, Sure Screening, Variable selection consistency, Sure independence screening, Conditional sure independence
相關次數: 點閱:4下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在高維度 Cox 模式之下,我們考慮柴比雪夫貪婪演算法(Chebyshev’s Greedy
    Algorithm, CGA)進行變數過濾(variable screening)。透過逐步選取變數的方式,我們証明了 CGA 的在整個選取過程的一致收斂性,並証明了在 CGA 迭代停止
    之後具有確認過濾(Sure screening) 的性質。此外,我們發展了高維度訊息準則
    (High-dimensional information criterion, HDIC),並根據該準則設計了三步驟的變數選取流程,証明了該選取流程具有變數選取的一致性。最後,我們將本文提供的模式選取方法套用至漫大性 B 細胞淋巴癌(Diffusion Large B-cell Lymphoma,DLBCL)的實驗資料(Rosenwald et al., 2002)。


    Motivated by the conditional sure independence screening [Barut et al. (2016), CSIS], we consider a greedy type method, namely, the Chebyshev Greedy Algorithm [Temlyakov (2015), CGA]. Different to CSIS, which improves the screening performance by conditioning on prior knowledge, CGA sequentially construct the conditioning set to include new variables. In this dissertation, we propose using CGA as a variable screening tool for the high-dimensional survival data and study its convergence rate. We show that the sure screening property can be achieved by CGA with a theoretically justified stopping rule and propose a greedier variant of CGA (gCGA) to enhance its finite sample performance in terms of variable screening. We develop a three-step variable selection procedure which is an essemble of implementing CGA and a high-dimensional information criterion (HDIC), and give the condition under which the variable selection consistency can be achieved by HDIC. The utility of the proposed method is examined by extensive simulation studies and the analysis of a diffuse large B-cell lymphoma (DLBCL) dataset.

    1 Introduction 1 2 Literature Review 3 2.1 Sure Independence Screening 4 2.1.1 Conditional sure independence screening 5 2.1.2 Iterative Sure Independence Screening 5 Greedy algorithms 6 2.2.1 Incremental forward stagewsie regression 7 2.2.2 Chebyshev’s greedy algorithm and forward regression 7 3 Chebyshev Greedy Algorithm on Cox model 8 3.1 Notations 8 3.2 Component-wise Gradient Boosting and Chebyshev Greedy Algorithm 10 4 Uniform Convergence Rates for CGA 12 4.1 Population version 12 4.2 Sample version 15 5 Theoretical properties on the three-step procedure 20 5.1 Sure screening property 20 5.2 Variable selection consistency 22 6 Simulations 27 7 Data Analysis 31 8 Conclusions and future works 33 A Additional simulation results 34 B Additional lemmas 38

    Andersen, P. K., & Gill, R. D. (1982). Cox’s regression model for counting processes: a large sample study. The Annals of Statistics, 10 , 1100–1120.

    Bartlett, P. L., & Mendelson, S. (2002). Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3 , 463–482. 48

    Barut, E., Fan, J., & Verhasselt, A. (2016). Conditional sure independence screening. Journal of the American Statistical Association, 111 (515), 1266–1277.

    Breheny, P., & Huang, J. (2011). Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Annals of Applied Statistics, 5 (1), 232–253.

    B ̈hlmann, P. (2006). Boosting for high-dimensional linear models. The Annals of Statistics, 34 (2), 559–583.

    B ̈hlmann, P., & Yu, B. (2003). Boosting with the l2 loss: Regression and classification. Journal of the American Statistical Association, 98 (462), 324–339.

    Chen, J., & Chen, Z. (2012). Extended BIC for small-n-large-p sparse GLM. Statistica Sinica, 22 (2), 555–574.

    Donoho, D. L., Tsaig, Y., Drori, I., & Starck, J.-L. (2012). Sparse solution of under-determined systems of linear equations by stagewise orthogonal matching pursuit. IEEE Transactions on Information Theory, 58 (2), 1094–1121.

    Fan, J., Feng, Y., & Wu, Y. (2010). High-dimensional variable selection for cox’s proportional hazards model. In Borrowing strength: Theory powering applications a festschrift for lawrence d. brown (pp. 70–86). Institute of Mathematical Statistics.

    Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96 (456), 1348–1360.

    Fan, J., & Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space (with discussion). Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70 (5), 849–911.

    Fan, J., Samworth, R., & Wu, Y. (2009). Ultrahigh dimensional variable selection: beyond the linear model. Journal of Machine Learning Research, 10 (1), 32.

    Fan, J., & Song, R. (2010). Sure independence screening in generalized linear models with NP-dimensionality. The Annals of Statistics, 38 (6), 3567–3604. 49

    Foster, D. P., & George, E. I. (1994). The Risk Inflation Criterion for Multiple Regression. The Annals of Statistics, 22 (4), 1947–1975.

    Gorst-Rasmussen, A., & Scheike, T. (2013). Independent screening for single-index hazard rate models with ultrahigh dimensional features. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75 (2), 217–245.

    Gui, J., & Li, H. (2005). Penalized cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data. Bioinformatics, 21 (13), 3001–3008.

    Hastie, T., Taylor, J., Tibshirani, R., & Walther, G. (2007). Forward stagewise regression and the monotone lasso. Electronic Journal of Statistics, 1 , 1–29. He, K., Li, Y., Zhu, J., Liu, H., Lee, J. E., Amos, C. I., . . . Wei, Q. (2016).

    Component-wise gradient boosting and false discovery control in survival analysis with high-dimensional covariates. Bioinformatics, 32 (1), 50–57.

    Hong, H. G., Kang, J., & Li, Y. (2016). Conditional screening for ultra-high dimensional covariates with survival outcomes. Lifetime Data Analysis, 24 (1), 45–71.

    Huang, J., Sun, T., Ying, Z., Yu, Y., & Zhang, C.-H. (2013). Oracle inequalities for the lasso in the cox model. The Annals of Statistics, 41 (3), 1142–1165.

    Ing, C.-K., & Lai, T. L. (2011). A stepwise regression method and consistent model selection for high-dimensional sparse linear models. Statistica Sinica, 21 , 1473–1513.

    Koltchinskii, V. (2011). Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems. Berlin, Heidelberg: Springer Berlin Heidelberg.

    Kong, S., & Nan, B. (2014). Non-asymptotic oracle inequalities for the high-dimensional cox regression via lasso. Statistica Sinica, 24 (1), 25–42.

    Luo, S., Xu, J., & Chen, Z. (2015). Extended bayesian information criterion in the cox model with a high-dimensional feature space. Annals of the Institute of Statistical Mathematics, 67 (2), 287–311. 50

    Mallat, S., & Zhang, Z. (1993). Matching pursuit with time-frequency dictionaries. IEEE Transactions on Signal Processing, 41 , 3397–3415.

    Massart, P. (2000a). About the constants in talagrand’s concentration inequalities for empirical processes. The Annals of Probability, 28 (2), 863–884.

    Massart, P. (2000b). Some applications of concentration inequalities to statistics. Annales de la facult ́ des sciences de Toulouse Math ́matiques, 9 (2), 245–303.

    Meir, R., & Zhang, T. (2003). Generalization error bounds for bayesian mixture algorithms. Journal of Machine Learning Research, 4 , 839–860.

    Negahban, S. N., Ravikumar, P., Wainwright, M. J., & Yu, B. (2012). A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers. Statistical Science, 27 (4), 538–557.

    Rosenwald, et al. (2002). The use of molecular profiling to predict survival after chemotherapy for diffuse large-b-cell lymphoma. New England Journal of Medicine, 346 (25), 1937–1947.

    Song, R., Lu, W., Ma, S., & Jessie Jeng, X. (2014). Censored rank independence screening for high-dimensional survival data. Biometrika, 101 (4), 799–814.

    Temlyakov, V. N. (2000). Weak greedy algorithms. Advances in Computational Mathematics, 12 (2-3), 213–227.

    Temlyakov, V. N. (2015). Greedy approximation in convex optimization. Constructive Approximation, 41 (2), 269–296.

    Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58 (1), 267–288.

    Tibshirani, R. (1997). The lasso method for variable selection in the cox model. Statistics in Medicine, 16 (4), 385–395.

    Tropp, J. A., & Gilbert, A. C. (2007). Signal recovery from random measurements via orthogonal matching pursuit. IEEE Transactions on information theory, 53 (12), 4655–4666. 51

    Van Der Vaart, A. W., & Wellner, J. A. (1996). Weak convergence and empirical processes: With applications to statistics. Springer, New York.

    Wang, H. (2009). Forward regression for ultra-high dimensional variable screening. Journal of the American Statistical Association, 104 (488), 1512–1524.

    Wang, X., Nan, B., Zhu, J., & Koeppe, R. (2014). Regularized 3D functional regression for brain image data via Haar wavelets. The Annals of Applied Statistics, 8 (2), 1045–1064.

    Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38 (2), 894–942.

    Zhang, T. (2011). Sparse recovery with orthogonal matching pursuit under rip. IEEE Transactions on Information Theory, 57 (9), 6215–6221.

    Zhang, T., & Yu, B. (2005). Boosting with early stopping: Convergence and consistency. The Annals of Statistics, 33 , 1538–1579.

    Zhao, S. D., & Li, Y. (2012). Principled sure independence screening for cox models with ultra-high-dimensional covariates. Journal of Multivariate Analysis, 105 (1), 397–411.

    QR CODE