簡易檢索 / 詳目顯示

研究生: 連郡儀
Chun-Yi Lian
論文名稱: Some Advancement in Model Selection Methods And Its Application to a Genetic Epidemiology Study
選模方法之最近發展及其在遺傳流病研究之應用
指導教授: 熊昭
Chao Hsiung
謝文萍
Wen-Ping Hsieh
口試委員:
學位類別: 碩士
Master
系所名稱: 理學院 - 統計學研究所
Institute of Statistics
論文出版年: 2008
畢業學年度: 96
語文別: 英文
論文頁數: 33
中文關鍵詞: 選模
外文關鍵詞: model selction, GEE, QIC, penalized likelihood
相關次數: 點閱:3下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • Model selection is an important topic in data analysis. If the model is selected appropriately, we can use it to predict well. We introduced three tools for model selection in this paper. These three tools are QIC (Quasi-likelihood under the Independence model Criterion), L1-regularization path algorithm for generalized linear model, and L2-penalized logistic regression with a stepwise variable selection. The method QIC can be used for the correlated data such as family data. L1-regularization path algorithm and L2-penalized algorithm can be used for high-dimensional data such as microarray data. If we focus on gene interactions, the method L2-penalized algorithm may be useful. Our data from the SAPPHIRe (Stanford Asian Pacific Program for Hypertension and Insulin Resistance) project is from family data hence correlated. We use these three methods for the data set and compare the models selected by different methods and evaluate the performance of the prediction.


    1. Introduction 1 2. Material and methods 3 2.1 QIC 3 2.1.1 Kullback-Leibler information 4 2.1.2 Akaike’s information criterion, AIC 4 2.1.3 QIC 6 2.2 L1-regularization path algorithm for generalized linear model 11 2.2.1 predictor-corrector algorithm 13 2.3 L2-penalized logistic regression with a stepwise variable selection 18 2.3.1 Penalized logistic regression 19 2.3.2 Forward stepwise procedure 20 3. Real data 21 3.1 SAPPHIRe dataset 22 3.2 Model selection 22 3.3 Prediction 26 4. Conclusion and discussion 28 5. Reference 31

    Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In Proceedings of the Second International Symposium on Information Theory, B. N. Petrov and F. Csaki (eds), 267-281. Budapest:Akademial Kiado.
    Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression. Ann. Statist., 32, 407-499.
    Friedman, J. (1991). Multivariate adaptive regression splines. The Annals of Statistics 19, 1-67.
    Golub, T., Slonim, D., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J., Coller, H., Loh, M., Downing, J., Caligiuri, M., Bloomfield, C. and Lander, E. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531-537.
    James C. and Guoqi Q. (2007). Selection of Working Correlation Structure and Best Model in GEE Analyses of Longitudinal data. Simulation and Computation, 36: 987-997.
    Jianqing F. and Runze L. (2001). Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties. JASA, Vol. 96, No.456.
    Jianwen C., Jianqing F., Runze L. and Haibo Z. (2005). Variable selection for multivariate failure time data. Biometrika, 92, 2, pp. 303-316.
    Klein, R., Klein, B. E. K., Moss, S. E., Davis, M. D., and DeMets, D. L. (1984). The Wisconsin Epidemiologic Study of Diabetic Retinopathy: Ⅱ. Prevalence and risk of diabetic retinopathy when age at diagnosis is less than 30 years, Archives of Ophthalmology 102, 520-526.
    Kullback, S. and Leibler, R. A. (1951). On information and sufficiency. Annals of Mathematical Statistics 22, 79-86.
    Le Cessie, S. and Van Houwelingen, J. (1992). Ridge estimators in logistic regression. Applied Statistics, 41, 191-201.
    Lee, A. and Silvapulle, M. (1988). Ridge estimation in logistic logistic regression. Communications in Statistics, Simulation and Computation 17,1231-1257.
    Liang, K.-Y. and Zeger, S. L. (1986). Longitudinal data analysis using generalized linear models. Biometrika 73, 13-22.
    Linhart, L. and Zucchini, W. (1986). Model Selecton. New York: Wiley.
    Mallows, C.L. (1973). Some Comments on Cp. Technimetrics, 15, 661-675.
    Park M. Y. and Hastie, T. (2007). L1-regularization path algorithm for generalized linear models. J. R. Statistic. Soc. B 69, Part 4, pp. 650-677.
    Park M. Y. and Hastie, T. (2008). Penalized Logistic Regression for Detection Gene Interactions. Biostatistics 9(1):30-50.
    Rosset, S. (2004). Tracking curved regularized optimization solution paths. In Neural Information Processing Systems. Cambridge: MIT press.
    Rosset, S., Zhu, J. and Hastie, T. (2004). Boosting as a regularized path to a maximum margin classifier. J. Mach. Learn. Res., 5, 941-973.
    Schwarz, G., (1978). Estimating the dimension of a model. Annals of Statistics 6(2):461-464.
    Tibshirani, R. J. (1997). The LASSO Method for Variable Selection in the Cox Model. Statistics in Medicine, 16, 385-395.
    Wedderburn, R. W. M. (1974). Quasi-likelihood functions, generalized linear models and the Gaussian method. Biometrika, 61, 439-47
    Wei P. (2001). Akaike’s Information Criterion in Generalized Estimating Equations. Biomatrics 57, 120-125.
    Yasuhiko W. and Nobuhisa K. (1990). Selecting Statistical Models with Information Statistics. J Dairy Sci 73:3575-3582.
    Zou, H. and Hastie, T. (2004). On the ‘degrees of freedom’ of the lasso. Technical Report. Stanford University, Stanford.

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)

    QR CODE