研究生: |
李泓緯 Li, Hung-Wei |
---|---|
論文名稱: |
資料縮減方法在不平衡分類及節點部署之應用 Applications of Data Reduction Methods on Classification for Imbalanced Data and Knot Placement for Prediction Problems |
指導教授: |
徐南蓉
Hsu, Nan-Jung |
口試委員: |
黃信誠
Huang, Hsin-Cheng 陳春樹 Chen, Chun-Shu |
學位類別: |
碩士 Master |
系所名稱: |
理學院 - 統計學研究所 Institute of Statistics |
論文出版年: | 2022 |
畢業學年度: | 110 |
語文別: | 英文 |
論文頁數: | 49 |
中文關鍵詞: | 子抽樣 、薄板樣條 、空間隨機效應模型 、不平衡資料 、節點部署 |
外文關鍵詞: | subsampling, thin-plate spline, spatial random effect model, imbalanced data, knot placement |
相關次數: | 點閱:2 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在配適統計模型或訓練機器學習模型時, 大數據會帶來高昂的計算成本。為了克服這個議 題,可在執行配適及訓練的過程僅利用數據的子樣本來降低計算成本。本論文探討兩種子採 樣方法 SPlit (Joseph and Vakayil, 2021) 和 Supercompress (Joseph and Mak, 2021),在不平 衡數據分類問題、平滑預測問題中的節點(knot)放置、及空間預測問題等三個應用問題上 的有效性。經由模擬試驗,得到以下結論:(1)在不平衡數據分類問題上,SPlit 方法能取 出更具代表性的子樣本確實能提升 minority class 分類的正確率,其分類成效明顯優於簡單 隨機抽樣。(2)在節點放置的兩個應用上,不論是 spatial smoothing 或空間隨機效應模型的 預測問題,當資料散佈及數值的變動非常不均勻時,Supercompress 方法的整體空間預測表 現都優於傳統的均勻佈節點的方法。
Large data typically bring high computation costs when fitting statistical or machine learn- ing models. To overcome this issue, model fitting can be done based on a subsample of data instead of using the entire data set. There are several ways of taking representative subsamples in the literature. This thesis emphasizes two new subsampling methods, SPlit (Joseph and Vakayil, 2021) and Supercompress (Joseph and Mak, 2021), to study their usefulness in the applications of the classification problem with imbalanced data and knot placement for predic- tion problems. Their performance on prediction accuracy is studied via numerical simulations, compared with the traditional approaches. In the application to imbalanced data classification problems, the SPlit method is found to perform generally better than a simple random sampling in terms of classification accuracy for the minority class. In the applications to knot placement, the Supercompress method performs better than the uniform placement when data patterns are inhomogeneous in the spatial domain for both the thin-plate spline fitting and the fixed rank kriging in the spatial random effects model framework.
Cressie, N. and Johannesson, G. (2008). Fixed rank kriging for very large spatial data sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1):209–226.
Fernández, A., García, S., Galar, M., Prati, R. C., Krawczyk, B., and Herrera, F. (2018). Learning from Imbalanced Data Sets, volume 10. Springer, Nature Switzerland AG.
Hastie, T. and Hastie, M. T. (2015). R package gam.
Huang, C., Joseph, V. R., and Huang, M. C. (2022). R package supercompress.
Joseph, V. R. and Mak, S. (2021). Supervised compression of big data. Statistical Analysis and Data Mining: The ASA Data Science Journal, 14(3):217–229.
Joseph, V. R. and Vakayil, A. (2021). Split: An optimal method for data splitting. Techno- metrics, 64(2):1–11.
Ribeiro Jr, P. J., Diggle, P. J., Ribeiro Jr, M. P. J., and Imports, M. (2020). R package geoR.
Sarndal, C.E. Swensson, B. and Wretman, J. (1992). Model Assisted Survey Sampling, volume 8. Springer-verlag, New York.
Székely, G. J. and Rizzo, M. L. (2013). Energy statistics: A class of statistics based on distances. Journal of Statistical Planning and Inference, 143(8):1249–1272.
Tzeng, S. and Huang, H.-C. (2018). Resolution adaptive fixed rank kriging. Technometrics, 60(2):198–208.
Vakayil, A., Joseph, R., Mak, S., and Vakayil, M. A. (2021). R package SPlit.
Vakayil, A. and Joseph, V. R. (2021). Data twinning. Statistical Analysis and Data Mining:
The ASA Data Science Journal, 70:1–13.
Wood, S. N. (2003). Thin plate regression splines. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(1):95–114.