研究生: |
李少芃 LI, SHAO-PENG |
---|---|
論文名稱: |
計數型數據的類別變數選取與水準合併分析 Categorical Variable Selection and Level Clustering in Count Data |
指導教授: |
徐南蓉
Hsu, Nan-Jung |
口試委員: |
汪上曉
Wong, Shang-Hsiao 曾勝滄 Tseng, Sheng-Tsiang |
學位類別: |
碩士 Master |
系所名稱: |
理學院 - 統計學研究所 Institute of Statistics |
論文出版年: | 2017 |
畢業學年度: | 105 |
語文別: | 中文 |
論文頁數: | 55 |
中文關鍵詞: | 計數模型 、類別變數選取 、水準合倂 |
外文關鍵詞: | Count regression, Group Lasso, CAS-ANOVA |
相關次數: | 點閱:3 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本論文所感興趣的研究議題為製程最佳機台組合(golden path)的選取問題。此類研究問題的傳統做法是先找出影響製程良率的重要因子,再依據影響效應找出可使良率最大化的最佳機台組合。這類做法雖能找出唯一的最佳機台組合,但基於生產實務的考量,機台或生產路徑間是否存在實質差異性往往是更受關注的議題,因此在推論機台效應差異的同時,若能同步歸納具相似表現的機台群組,將能提供更具彈性的最佳機台組合策略。 基於上述目標,本論文針對自變數皆屬於類別型的計數型資料 (count data) 提出一套兩階段的參數估計方法,第一階段估計著重於篩選出重要因子,第二階段估計則是將重要因子中具有相似效應的類別(水準)進行合併,兩階段的統計推論都採用regularized likelihood approach。
本論文研究的計數模型廣泛地涵蓋卜瓦松迴歸 (Poisson regeression)、負二項迴歸 (negative binomial regression),並考量over-dispersion 與零膨脹 (zero-inflated) 的情況。但所提出的推論方法可廣泛地適用於其他廣義線性模式 (generalized linear model).
藉由數值模擬與製程機台組合的實例分析,驗證本論文所提出的參數估計方法在重要因子辨識與相似水準歸類兩面向皆有相當好的推論成效。
Finding the golden path in a production process is an important issue for intelligent manufacturing. This thesis aims to solve this problem for a specific case that the production quality is measured by the failure counts and the factors relevant to the production quality all belong to categorical variables. Traditional approaches identify the important factors (tools) affecting the yield of the process first, and then determine the best production path maximizing the mean yield, called the golden path.This thesis further takes into account the clustering patterns of tool effects to provide a more flexible solution for the golden path in practice. To achieve this goal, a two-stage inference procedure for count data with categorical covariates is developed in a generalized linear model framework. A penalized likelihood approach is adopted for estimation and variable selection in which the important factors are identified in the first-stage via incorporating the grouped lasso regularization and tool clustering is implemented in the second stage via incorporating the fused lasso regularization. The effectiveness of the proposed method is demonstrated via a simulation study for Poisson models and an application to real manufacturing data collected in a 13-stage production process. The proposed methodology successfully identifies important factors and finds reasonably well cluster patterns of effects within factors for both simulations and the application.
[1] Henry Scheffe. (1999). The Analysis of Variance. Elsevier Health Science
Press.
[2] Helton, Jon C., and Freddie Joe Davis. (2003). Latin hypercube sampling and the propagation of uncertainty in analyses of complex systems. Reliability Engineering System Safety 81(1):23–69.
[3] Howard D Bondell and Brian J Reich. (2009). Simultaneous factor selection and collapsing levels in anova. Biometrics, 65(1):169–177.
[4] John W Tukey. (1949). Comparing individual means in the analysis of variance. Biometrics, 99–114.
[5] Jerome Friedman, Trevor Hastie, and Rob Tibshirani. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1.
[6] Joseph M Hilbe (2011). Negative Binomial Regression. Cambridge University Press.
[7] Lukas Meier, Sara Van De Geer, and Peter Bühlmann. (2008). The group lasso for logistic regression. Journal of the Royal Statistical Society, 70(1):53–71.
[8] Peter McCullagh. (1984). Generalized linear models. European Journal of Operational Research, 16(3):285–292.
[9] Post, Justin B., and Howard D. Bondell. (2013). Factor selection and structural identification in the interaction ANOVA model. Biometrics 69(1):70–79.
[10] Robert Tibshirani, Michael Saunders, Saharon Rosset, Ji Zhu, and Keith Knight. (2005). Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society, 67(1):91–108.
[11] Rina Foygel Barber, Emmanuel J Candès, et al. (2015). Controlling the false discovery rate via knockoffs. The Annals of Statistics, 43(5):2055–2085.
[12] Sture Holm. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 65–70.
[13] Yosef Hochberg. (1988). A sharper bonferroni procedure for multiple tests of significance. Biometrika, 75(4):800–802.54
[14] Yoav Benjamini and Yosef Hochberg. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society , 289–300.
[15] Yuan, Ming and Lin,Yi. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, 68(1):49–67.