研究生: |
林巍源 Lin, Wei Yuan |
---|---|
論文名稱: |
資料品質 data quality |
指導教授: |
張韻詩
Chang, Yun Shih |
口試委員: |
金仲達
Jin, Zhong Da 朱宗賢 Zhu, Zong Xian |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Computer Science |
論文出版年: | 2016 |
畢業學年度: | 105 |
語文別: | 英文 |
論文頁數: | 67 |
中文關鍵詞: | 資料品質 、即時資料品質控管 、異常值偵測 |
外文關鍵詞: | data quality, real time quality control, outlier detection |
相關次數: | 點閱:2 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
資料品質在許多領域上面是個重要的議題,一個成功的計畫無可避免的必須倚賴於高品質的資料。本文提出了許多不同的方法,為了即時分析觀察資料的品質。資料品質控管的其中一個目標,是從觀察的資料裡面,找出異常的資料。我們特別關注於一種異常資料的叫做異常值,本文定義一筆觀察記錄為一個異常值,如果這一筆紀錄的數值跟其他紀錄的數值差距大於一定的間距。本文提供了四種方法來找出觀察記錄裡面的異常值,包括了兩種現成的異常值偵測演算法: Local outlier factor (LOF) algorithm and Local distance outlier factor (LDOF) algorithm,以及另外兩種方法: Partially ordered data analysis method, and Minimum bounding box method。我們在第四章做了許多的實驗來評估這些方法的效能,檢測這些方法是否能夠正確地找出異常值
Data quality is an important issue in almost all application domains. A successful project invariably relies on the high quality data. This thesis proposes methods for analyzing the quality of observational data in real time. An objective of real time quality control is to detect anomalies of observational data. Specifically, this thesis focuses on the detection of a common type of anomaly called outlier. We say a data item is an outlier when its value differs by a specified amount from values of other data items recorded under similar conditions. The thesis presents an overview of outlier detection for real-time quality control, including outlier detection algorithms, local outlier factor (LOF) algorithm and local distance outlier factor (LDOF) algorithm, partially ordered data analysis method, and minimum bounding box method. The thesis also describes several simulation experiments to evaluate the performance of the methods.
[1]Liran Malul, "The Importance of Data Quality Audit Tool",
https://www.ringlead.com/why-a-data-quality-audit-tool-is-important/ , 2015
[2]Ying-Jhe Hu, "Design Rationales for Disaster Record Capture Tool", March 2015
[3]National Weather Service Observing Program,
http://www.nws.noaa.gov/om/coop/ (Accessed: 6 September 2016)
[4]Electronic Health Records, https://www.cms.gov/Medicare/E-health/EHealthRecords/index.html (Accessed: 6 September 2016)
[5]David Strachan, "Managing the Resource: Urban Archaeological Databases", http://www.tafac.org.uk/data.pdf (Accessed: 6 September 2016)
[6]Major League Baseball Hitting Records, http://mlb.mlb.com/mlb/history/all_time_leaders.jsp (Accessed: 6 September 2016)
[7]National Oceanic Atmospheric Administration Typhoon Records, http://www.nhc.noaa.gov/data/ (Accessed: 6 September 2016)
[8]Wikipedia contributors, "Machine-readable data" Wikipedia, https://en.wikipedia.org/wiki/Machine-readable_data (Accessed: 6 September 2016)
[9][10]Wikipedia contributors, "Data quality" Wikipedia, https://en.wikipedia.org/wiki/Data_quality (Accessed: 6 September 2016)
[11]British Oceanographic Data Centre, http://www.bodc.ac.uk/(Accessed: 6 September 2016)
[12]National Oceanographic Database(NODB), https://www.bodc.ac.uk/data/online_delivery/nodb/search/ (Accessed: 6 September 2016)
[13]BODC data quality control, http://www.bodc.ac.uk/about/presentations_and_papers/documents/ssb_oct2013.pdf (Accessed: 6 September 2016)
[14]ISO 9000 - Quality management, http://www.iso.org/iso/home/standards/management-standards/iso_9000.htm (Accessed: 6 September 2016)
[15]National Weather Service Observing Program , http://www.nws.noaa.gov/om/coop/ (Accessed: 7 September 2016)
[16][17]Cooperative Observer Program, http://www.nws.noaa.gov/om/coop/ (Accessed: 7 September 2016)
[18]QC testings, http://www.tutorialspoint.com/software_testing/software_testing_qa_qc_testing.htm (Accessed: 7 September 2016)
[19][20]Data quality control framework of National Weather Service Observing Program , http://www.srh.noaa.gov/epz/?n=cwopepz (Accessed: 7 September 2016)
[21]Wand, Y., & Wang, R.Y. (1996), "Anchoring Data Quality Dimensions in Ontological Foundations" ,Communications of the ACM, 39, 11, 86-95.
[22]Fraud Detection with Advanced Outlier Detection Algorithms , http://newblog.easysol.net/advanced-outlier-detection/ (Accessed: 7 September 2016)
[23]Ghosa&Pranay Kumar, "Outliers Detection in Weather Forecast using k-Means Clustering Technique" , March 2013
[24]Dallas Thornton & Guido van Capelleveen ,"Outlier-based Health Insurance Fraud Detection for U.S. Medicaid Data" http://doc.utwente.nl/91956/1/49861.pdf
[25]Yixin Chen , "Outlier Detection: A Novel Depth Approach" , http://www.math.iupui.edu/~hpeng/chen_dang_peng_bookchapter.pdf , November 29, 2008
[26]Gustavo H. Orair Carlos H, "Distance-Based Outlier Detection: Consolidation and Renewed Bearing" , http://www.vldb.org/pvldb/vldb2010/papers/I09.pdf (Accessed: 8 September 2016)
[27]G. Meera Gandhi, "Cluster Based Outlier Detection Algorithm for Healthcare Data" (Accessed: 8 September 2016)
[28][29]Wikipedia contributors, "Local outlier factor" Wikipedia, https://en.wikipedia.org/wiki/Local_outlier_factor (Accessed: 8 September 2016)
[30][31]Ke Zhang and Marcus Hutter, "A New Local Distance-Based Outlier Detection Approach for Scattered Real-World Data" , March 2009
[32]Central Weather Bureau , http://www.cwb.gov.tw/eng/ (Accessed: 8 September 2016)
[33]MLB Hitting records, http://mlb.mlb.com/mlb/history/all_time_leaders.jsp (Accessed: 8 September 2016)
[34]Integrated Ocean Observation System , https://ioos.noaa.gov/ (Accessed: 8 September 2016)
[35][36]Weather Forecast Office , http://www.srh.noaa.gov/jetstream/nws/wfos.html (Accessed: 8 September 2016)
[37]Yixin Chen , "Outlier Detection: A Novel Depth Approach" , http://www.math.iupui.edu/~hpeng/chen_dang_peng_bookchapter.pdf , November 29, 2008
[38]Gustavo H. Orair Carlos H, "Distance-Based Outlier Detection:Consolidation and Renewed Bearing" , http://www.vldb.org/pvldb/vldb2010/papers/I09.pdf (Accessed: 8 September 2016)
[39]G. Meera Gandhi, "Cluster Based Outlier Detection Algorithm for Healthcare Data"(Accessed: 9 September 2016)
[40]Data Quality: High-impact Strategies , https://www.iho.int/mtg_docs/com_wg/TSMAD/TSMAD22/TSMAD22_DIPWG3-11.7A_S-101_Data_Quality_FINAL.pdf (Accessed: 9 September 2016)
[41]The Health Information and Quality Authority , https://www.hiqa.ie/about-us (Accessed: 9 September 2016)
[42]Talend Open Studio, https://www.talend.com/products/talend-open-studio (Accessed: 9 September 2016)
[43]Software AG , http://www.softwareag.com/corporate/default.asp (Accessed: 9 September 2016)
[44]Oracle Warehouse Builder, https://docs.oracle.com/cd/B28359_01/owb.111/b31278/concept_overview.htm (Accessed: 9 September 2016)
[45]SAS Data Quality Software, http://www.sas.com/en_id/software/data-management/data-quality.html , (Accessed: 9 September 2016)
[46~49]Master Data Management, http://www.oracle.com/us/products/applications/master-data-management/018876.pdf (Accessed: 9 September 2016)
[50]The six primary dimensions for data quality assessment , https://www.em360tech.com/wp-content/files_mf/1407250286DAMAUKDQDimensionsWhitePaperR37.pdf , (Accessed: 9 September 2016)
[51]Online book "ETutorials", http://etutorials.org/Misc/data+quality/Part+I+Understanding+Data+Accuracy/Chapter+2+Definition+of+Accurate+Data/2.3+Data+Accuracy+Defined/ , (Accessed: 9 September 2016)
[52]Wikipedia contributors,"data verification", https://en.wikipedia.org/wiki/Data_verification (Accessed: 16 September 2016)
[53]Wikipedia contributors,"data validation", https://en.wikipedia.org/wiki/Data_validation (Accessed: 16 September 2016)
[54]Wikipedia contributors,"Hazardous Materials Identification System (HMIS)", https://en.wikipedia.org/wiki/Hazardous_Materials_Identification_System (Accessed: 16 September 2016)
[55]validation rules, https://support.office.com/en-us/article/Validation-rules-ae5df363-ef15-4aa1-9b45-3c929314bd33 (Accessed: 16 September 2016)
[56]Integrated Ocean Observation System (IOOS), https://ioos.noaa.gov/ (Accessed: 16 September 2016)
[57]Weather Forecast Office, http://www.srh.noaa.gov/jetstream/nws/wfos.html (Accessed: 16 September 2016)
[58][59]NATIONAL WEATHER SERVICE INSTRUCTION , http://www.nws.noaa.gov/directives/sym/pd01013005curr.pdf (Accessed: 16 September 2016)
[60]Wikipedia contributors,"association rule learning", https://en.wikipedia.org/wiki/Association_rule_learning (Accessed: 16 September 2016)
[61]decision tree analysis, https://www.mindtools.com/dectree.html (Accessed: 16 September 2016)
[62]Data Mining - Classification & Prediction, https://www.tutorialspoint.com/data_mining/dm_classification_prediction.htm (Accessed: 16 September 2016)
[63-66]Wikipedia contributors,"Anomaly detection" https://en.wikipedia.org/wiki/Anomaly_detection (Accessed: 16 September 2016)
[67]Edwin M. KnorrRaymond T ,"Distance-based outliers: algorithms and applications" 1999
[68]Wikipedia contributors,"k-nearest neighbors algorithm" , https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm (Accessed: 16 September 2016)
[69][70]Wikipedia contributors,"Mahalanobis outlier analysis", https://en.wikipedia.org/wiki/Mahalanobis_distance (Accessed: 16 September 2016)
[71]Ke Zhang1 and Marcus Hutter , "A New Local Distance-Based Outlier Detection Approach for Scattered Real-World Data",2009
[72]Wikipedia contributors,"Local outlier factor" , https://en.wikipedia.org/wiki/Local_outlier_factor (Accessed: 16 September 2016)
[73]P. Murugavel,"Performance Evaluation of Density-Based Outlier Detection on High Dimensional Data" (Accessed: 16 September 2016)
[74]Spiros Papadimitriou, "LOCI: Fast Outlier Detection Using the Local Correlation Integral"(Accessed: 16 September 2016)
[75]Wikipedia contributors,"K-means algorithm", https://en.wikipedia.org/wiki/K-means_clustering (Accessed: 16 September 2016)
[76]Wikipedia contributors,"hierarchical clustering" https://en.wikipedia.org/wiki/Hierarchical_clustering (Accessed: 16 September 2016)
[77][78]Wikipedia contributors,"Bradley-Fayyad-Reina algorithm", https://en.wikipedia.org/wiki/Hoshen%E2%80%93Kopelman_algorithm#Pseudo-code (Accessed: 16 September 2016)
[79]Sridhar Ramaswamy , "Efficient Algorithms for Mining Outliers from Large Data Sets" ,(Accessed: 16 September 2016)
[80]Wikipedia contributors,"Fuzzy clustering" , https://en.wikipedia.org/wiki/Fuzzy_clustering (Accessed: 16 September 2016)
[81]Wikipedia contributors,"Balanced Iterative Reducing and Clustering using Hierarchies", https://en.wikipedia.org/wiki/BIRCH (Accessed: 17 September 2016)
[82]Mayaguez,"A Meta analysis study of outlier detection methods in classification" , (Accessed: 17 September 2016)
[83]Wikipedia contributors,"Outlier", https://en.wikipedia.org/wiki/Outlier (Accessed: 17 September 2016)
[84]Wikipedia contributors,"Partially ordered set" https://en.wikipedia.org/wiki/Partially_ordered_set (Accessed: 17 September 2016)
[85]Wikipedia contributors,"Hasse diagram" https://en.wikipedia.org/wiki/Hasse_diagram (Accessed: 17 September 2016)
[86]Wikipedia contributors,"Minimum bounding box", https://en.wikipedia.org/wiki/Minimum_bounding_box (Accessed: 17 September 2016)
[87]Wikipedia contributors,"Minimum bounding rectangle", https://en.wikipedia.org/wiki/Minimum_bounding_rectangle (Accessed: 17 September 2016)
[88]Weka,http://www.cs.waikato.ac.nz/ml/weka/ (Accessed: 17 September 2016)
[89]Central Weather Bureau database, http://www.cwb.gov.tw/eng/ (Accessed: 17 September 2016)
[90]Major League Baseball http://mlb.mlb.com/home (Accessed: 17 September 2016)
[91]Wikipedia contributors,"binary classification tests" https://en.wikipedia.org/wiki/Binary_classification (Accessed: 17 September 2016)
[92]Wikipedia contributors,"Sensitivity and specificity", https://en.wikipedia.org/wiki/Sensitivity_and_specificity (Accessed: 17 September 2016)