簡易檢索 / 詳目顯示

研究生: 辜冠銘
Ku, Kuan-Ming
論文名稱: 應用Elastic Net於多基因風險評分分析
Polygenic Risk Score Analysis with Elastic Net
指導教授: 謝文萍
Hsieh, Wen-Ping
口試委員: 張升懋
Chang, Sheng-Mao
鍾仁華
Chung, Ren-Hua
學位類別: 碩士
Master
系所名稱: 理學院 - 統計學研究所
Institute of Statistics
論文出版年: 2020
畢業學年度: 108
語文別: 英文
論文頁數: 23
中文關鍵詞: 多基因風險評分連鎖不平衡彈性網
外文關鍵詞: PolygenicRiskScore, LinkageDisequilibrium, ElasticNet
相關次數: 點閱:1下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 多基因風險評分(Polygenic Risk Score, PRS)已經被應用於預測一些複雜的疾病風險。傳統的多基因風險評分方法中,會對每個遺傳的變異建構一個簡單線性迴歸(simple linear regression),然後以迴歸係數當權重將這些變異的基因型做加總以求得PRS。其中一個典型的問題是連鎖不平衡(Linkage Disequilibrium, LD),這些變異因為在染色體上距離相近而產生共線性(collinearity),現存的方法如LDpred、lassosum和C+T(Clumping + Thresholding)都可以根據LD的結構來處理這個問題。然而這樣的方法,導致那些彼此相關但因為距離較遠的變異被視為互相獨立。於是我們提出使用彈性網(Elastic net)來處理這個問題。彈性網是一個正則化的多元線性迴歸(multiple linear regression)方法,結合了最小絕對值收斂與選擇算子(lasso)及嶺迴歸(ridge regression),可以同時對所有效應做估計以及選取重要的變數,且聯合建模可以納入變異間共享的資訊。
    我們使用了來自台灣人體生物資料庫的資料,透過比較身體質量指數(BMI)與多基因風險評分的相關性來展示該策略的優勢。根據分析的結果,在某些情況下使用彈性網而非簡單線性迴歸來對大量變異的效應做估計,可以使預測得到的多基因風險評分更準確。


    Polygenic risk scores (PRS) have been applied to predicting the risk of some complex disease. In the standard approach, people construct a simple linear regression on each genetic variant, and then aggregate the effective alleles by summing up the effect sizes estimated in the regression. A typical issue in constructing PRS is linkage disequilibrium among the variants. There have been a number of methods treating this problem according to the linkage disequilibrium structure of the chromosomes, such as LDpred, lassosum and C+T (Clumping + Thresholding). However, some variants that carry independent information are probably not retained because of close distance. On the other hand, highly correlated SNPs will both be included if they are far away from each other. Here, we propose to construct PRS model by using elastic net, a classical penalized regression, combining the advantages of lasso and ridge. Elastic net is a multiple linear regression framework, and it can estimate the effect size and select the causal variant from all SNPs simultaneously. Instead of modeling just one SNP at a time, joint modeling of the effects can accommodate the shared information without over-emphasizing certain group of SNPs.
    We demonstrate the benefit of the proposed strategy with the data from Taiwan Biobank by comparing the correlation between BMI and the PRSs to other methods. According to our experimental results, the prediction of BMI is more accurate with elastic net estimates than with the simple linear regression estimates.

    摘要 Abstract Acknowledgements Contents Introduction 1 Method 5 Imputation – LD-kNNi 5 Overview of standard methods 6 Effect size estimated in GWAS 6 Unadjusted PRS 6 LDpred 7 lassosum 8 C+T 9 PRS based on Elastic Net 10 Elastic Net 10 Result 12 Materials and Data preprocessing 12 Quality control 12 Variant filtering 12 Imputation 13 Model performance 13 Overfitting in constructing PRS model 16 Discussion 17 Reference 18 Supplement 19

    1. Vilhjalmsson, B.J., et al., Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores. Am J Hum Genet, 2015. 97(4): p. 576-92.
    2. Mak, T.S.H., et al., Polygenic scores via penalized regression on summary statistics. Genet Epidemiol, 2017. 41(6): p. 469-480.
    3. Choi, S.W. and P.F. O'Reilly, PRSice-2: Polygenic Risk Score software for biobank-scale data. GigaScience, 2019. 8(7).
    4. Hoerl, A.E. and R.W. Kennard, Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics, 1970. 12(1): p. 55-67.
    5. Tibshirani, R., Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society Series B-Methodological, 1996. 58(1): p. 267-288.
    6. Zou, H. and T. Hastie, Regularization and Variable Selection via the Elastic Net. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 2005. 67(2): p. 301-320.
    7. Money, D., et al., LinkImpute: Fast and Accurate Genotype Imputation for Nonmodel Organisms. G3 (Bethesda), 2015. 5(11): p. 2383-90.
    8. Browning, S.R. and B.L. Browning, Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. American Journal of Human Genetics, 2007. 81(5): p. 1084-1097.
    9. Scheet, P. and M. Stephens, A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet, 2006. 78(4): p. 629-44.
    10. Prive, F., et al., Making the Most of Clumping and Thresholding for Polygenic Scores. American Journal of Human Genetics, 2019. 105(6): p. 1213-1221.

    QR CODE