研究生: |
陳蓓蓓 Chen, Pei-Pei |
---|---|
論文名稱: |
基於物聯網隱私保護性資料傳輸之時間序列模擬方法 Simulating IoT-type Time Series for Privacy-Protecting Data Sharing |
指導教授: |
徐茉莉
Shmueli, Galit |
口試委員: |
林福仁
Lin, Fu-ren 李曉惠 Lee, Hsiao-Hui |
學位類別: |
碩士 Master |
系所名稱: |
科技管理學院 - 服務科學研究所 Institute of Service Science |
論文出版年: | 2020 |
畢業學年度: | 108 |
語文別: | 英文 |
論文頁數: | 70 |
中文關鍵詞: | 時間序列 、物聯網 、時序特徵 、時序模擬 |
外文關鍵詞: | feature-based |
相關次數: | 點閱:3 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
資料科學研究中的資訊倫理議題日益受到重視,各國政府也提出了資料保護和個人 隱私的相關規範。一些常見的隱私議題包括政府大規模監控和用戶個資被不當使用 造成的隱私衝突。例如,在 Facebook–Cambridge Analytica 個資外露事件中,社群 媒體上的用戶並不知道個人資料被其收集,共享和出售。因此,歐盟首先釋出了通 用數據保護條例(GDPR)的法規,以限制個人資料的收集和處理。另一種方法則 是隱私保護性資料傳輸,用於與第三方進行數據共享。
我們的研究目標為基於物聯網隱私保護性資料傳輸之時間序列模擬方法,讓物聯網 時間序列資料得以自由傳輸並進行分析,並避免直接披露資料本體,以最大程度地 減少隱私問題。物聯網時間序列可能涉及非常敏感之使用者行為:例如,無論是智 慧家電中自動收集的感測資料或是用戶自行輸入資料,這些數據有助於智慧家電理 解我們的日常偏好和協助日常瑣事。目前在搜集物聯網資料上面,用來避免隱私議 題的常用方法包括:以收集較不涉及隱私的用戶資料或以用戶端模型取代而不需將 用戶資料上傳到雲端。但是,收集敏感數據仍會伴隨隱私風險,例如資料的重新識 別,重建和分解(Laforet等,2015)。我們專注於基於物聯網隱私保護性資料傳輸 之時間序列模擬方法。研究了GRATIS時間序列模擬架構,其提供了基於原始時間 序列特徵的時間序列模擬方法,藉由GRATIS 架構於物聯網時間序列的模擬研究, 釐清了下列三個研究問題:
1.哪一組時間序列特徵最適合用於模擬物聯網類型時間序列? 2.是否存在對於不同的時間序列週期(例如每小時或每天)中,不同的效能表現? 3.如何運用基於特徵的時間序列模擬來達到隱私保護的目的?
為了研究GRATIS 模擬方法的效用性,我們運用兩種模擬方法(全序列模擬,分段 模擬)在三個特徵集(GRATIS,CompEngine,catch22)的兩個週期上(每小時, 每分鐘)。 我們以圖形比較原始數據和模擬數據和計算RMSE相似度來評估其性能。我們使用 了兩份物聯網資料來驗證其效益,包含運動手環上的心律資料和家庭用電量資料。 我們的研究有助於了解如何共享物聯網類型的模擬時間序列,以平衡隱私保護和準 確性。我們也在此篇論文中提出了用於隱私控制和共享策略的方法。
Ethical issues in data mining have been receiving more attention, and several laws and regulati ons have emerged emphasizing the importance of data protection and privacy. Some common privacy concerns include big brother watching and unintended use of personal data. For instan ce, the Facebook–Cambridge Analytica data scandal shows that social media users are unawa re that personal data is collected, shared, and sold. Hence, laws are needed for protecting perso nal data-related rights. The European Union came out with the first regulation called the Gener al Data Protection Regulation (GDPR) to restrict the collection and processing of personal dat a. Another approach relies on privacy-preserving data sharing, such as methods employed by b ureaus of statistics for sharing administrative data with various users.
This research aims to find a solution that allows sharing IoT time-series data for purposes of a nalysis, while preventing harm to the data subject from directly disclosing their data in order t o minimize privacy issues. IoT time series can be very sensitive: For example, collecting senso r data or user-entered data from smart home applications is necessary for understanding our pr eferences and assisting our daily chores. Common methods for avoiding sharing sensitive IoT data include collecting less sensitive data or building models on local machines without transm itting user data to cloud services. However, collecting sensitive data still poses privacy risks su ch as re-identification, reconstruction, and disaggregation (Laforet et al., 2015).
We focus on a simulation approach for sharing time-series IoT data. In our research, we study t he ability of the GRATIS scheme by Kang et al. (2020) to provide simulated series that are suf ficiently different from the original, yet preserve the main features needed for analysis. By stud ying this approach, we are able to answer the following questions.
1. What is a suitable set of time series features for simulating IoT-type time series?
2. How does performance vary across different time series periodicities (e.g. hourly or daily)? 3. How can feature-based simulated time series be useful for protecting privacy?
To study the ability of the GRATIS simulation approach, we compare three feature sets (GRA TIS, CompEngine, catch22) on two periodicities (hourly, minutely), for two simulation approac hes (entire series simulation, piecewise simulation). We evaluate their performance by graphica lly comparing the original and simulated data and compute RMSE similarity measures. Our ap plication uses real IoT data on household power consumption and heart-rate from a fitness ban d.
Our findings contribute to the body of knowledge on how to share IoT-type simulated series fo r balancing privacy protection and accuracy. As an integration of this research, we propose sev eral approaches for privacy control and sharing strategy.
Allhoff, F., & Henschke, A. (2018). The internet of things: Foundational ethical issues. Internet of Things, 1, 55–66.
Amazon.com: Echo (3rd Gen)- Smart speaker with Alexa- Charcoal: Amazon Devices. (n.d.). Retrieved March 17, 2020, from https://www.amazon.com/all-new-Echo/dp/B07NFTVP7P/ref=sxin_0_ac_d_pm?ac_md=2-1-QmV0d2VlbiAkNTAgYW5kICQxMDA%3D-ac_d_pm&cv_ct_cx=amazon+echo&keywords=amazon+echo&pd_rd_i=B07NFTVP7P&pd_rd_r=8f222265-b2e8-4f94-96cc-0ebde9556244&pd_rd_w=onTp1&pd_rd_wg=X2Lz5&pf_rd_p=0e223c60-bcf8-4663-98f3-da892fbd4372&pf_rd_r=NPEP9CA1NCG41J804EP4&psc=1&qid=1584425825
Amazon.com: Hello MB15226/W1 Sense with Voice Sleep System—Cotton (Current Generation—2nd): Health & Personal Care. (n.d.). Retrieved March 17, 2020, from https://www.amazon.com/Hello-MB15226-W1-Sense-System/dp/B01M9F2WLE/ref=dp_ob_title_wld
Apple. (n.d.). Retrieved March 17, 2020, from https://www.apple.com/
Ashouri, M., Shmueli, G., & Sin, C.-Y. (2019). Tree-based methods for clustering time series using domain-relevant attributes. Journal of Business Analytics, 2(1), 1–23. https://doi.org/10.1080/2573234X.2019.1645574
Banerjee, S., Hemphill, T., & Longstreet, P. (2018). Wearable devices and healthcare: Data sharing and privacy. The Information Society, 34(1), 49–57.
Bertoni, S. (2014). Oscar health using misfit wearables to reward fit customers. In Forbes.
Constantinopoulos, C., Titsias, M. K., & Likas, A. (2006). Bayesian Feature and Model Selection for Gaussian Mixture Models. IEEE Computer Society. https://doi.org/10.1109/TPAMI.2006.111
Dalenius, T., & Reiss, S. P. (1982). Data-swapping: A technique for disclosure control. Journal of Statistical Planning and Inference, 6(1), 73–85. https://doi.org/10.1016/0378-3758(82)90058-1
Duncan, G. T., Fienberg, S. E., Krishnan, R., Padman, R., & Roehrig, S. F. (2001). Disclosure Limitation Methods and Information Loss for Tabular Data. 31.
Duncan, G. T., Pearson, R. W., & others. (1991). Enhancing access to microdata while protecting confidentiality: Prospects for the future. Statistical Science, 6(3), 219–232.
Escobar, M. D., & West, M. (1995). Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, 90(430), 577–588.
Fang, X., Misra, S., Xue, G., & Yang, D. (2011). Smart grid—The new and improved power grid: A survey. IEEE Communications Surveys & Tutorials, 14(4), 944–980.
Fulcher, B. D. (2017). Feature-based time-series analysis. ArXiv:1709.08055 [Cs]. http://arxiv.org/abs/1709.08055
Fulcher, B. D., & Jones, N. S. (2014). Highly comparative feature-based time-series classification. IEEE Transactions on Knowledge and Data Engineering, 26(12), 3026–3037. https://doi.org/10.1109/TKDE.2014.2316504
Fulcher, B. D., & Jones, N. S. (2017). hctsa: A Computational Framework for Automated Time-Series Phenotyping Using Massive Feature Extraction. Cell Systems, 5(5), 527-531.e3. https://doi.org/10.1016/j.cels.2017.10.001
Fulcher, B. D., Lubba, C. H., Sethi, S. S., & Jones, N. S. (2019). CompEngine: A self-organizing, living library of time-series data. ArXiv:1905.01042 [Physics]. http://arxiv.org/abs/1905.01042
Fuller, W. (1993). Masking procedures for microdata disclosure. Journal of Official Statistics, 9(2), 383–406.
Google, T3007ES, Nest Learning Thermostat, 3rd Gen, Smart Thermostat, Stainless Steel, Works With Alexa—- Amazon.com. (n.d.). Retrieved March 17, 2020, from https://www.amazon.com/Nest-T3007ES-Thermostat-Temperature-Generation/dp/B0131RG6VK/ref=sr_1_2?keywords=google+nest&qid=1584425415&sr=8-2
Greveler, U., Glosekotter, P., Justus, B., & Loehr, D. (n.d.). Multimedia Content Identification Through Smart Meter Power Usage Profiles. 8.
Griffiths, T. L., Jordan, M. I., Tenenbaum, J. B., & Blei, D. M. (2004). Hierarchical topic models and the nested chinese restaurant process. Advances in Neural Information Processing Systems, 17–24.
Kadane, J. B., Krishnan, R., & Shmueli, G. (2006). A Data Disclosure Policy for Count Data Based on the Com-Poisson Distribution. Management Science, 52(10), 1610–1617. JSTOR.
Kang, Y., Hyndman, R. J., & Li, F. (2020). GRATIS: GeneRAting TIme Series with diverse and controllable characteristics. ArXiv:1903.02787 [Cs, Stat]. http://arxiv.org/abs/1903.02787
Kang, Y., Hyndman, R. J., & Smith-Miles, K. (2017). Visualising forecasting algorithm performance using time series instance spaces. International Journal of Forecasting, 33(2), 345–358. https://doi.org/10.1016/j.ijforecast.2016.09.004
Kegel, L., Hahmann, M., & Lehner, W. (2017). Generating What-If Scenarios for Time Series Data. Proceedings of the 29th International Conference on Scientific and Statistical Database Management, 1–12. https://doi.org/10.1145/3085504.3085507
Kramer, O. (2017). Genetic Algorithm Essentials (Vol. 679). Springer International Publishing. https://doi.org/10.1007/978-3-319-52156-5
Kravets, D. (2016). Sex toys and the Internet of Things collide—What could go wrong. Ars Technica, September.
Kumar, J. S., & Patel, D. R. (2014). A survey on internet of things: Security and privacy issues. International Journal of Computer Applications, 90(11).
Laforet, F., Buchmann, E., & Böhm, K. (2015). Individual privacy constraints on time-series data. Information Systems, 54(C), 74–91. https://doi.org/10.1016/j.is.2015.06.006
Lars, N. (2014). Connected Medical Devices, Apps: Are They Leading the IoT Revolution—Or Vice Versa. Wired Http://Www. Wired. Com/2014/06/Connected-Medical-Devices-Apps-Leading-Iot-Revolution-Vice-Vers Accessed Feb, 27, 2015.
Lin, H., & Bergmann, N. W. (2016). IoT Privacy and Security Challenges for Smart Home Environments. Information, 7(3), 44. https://doi.org/10.3390/info7030044
Lotze, T., Shmueli, G., & Yahav, I. (2010). Simulating and Evaluating Biosurveillance Datasets. In T. Kass-Hout & X. Zhang (Eds.), Biosurveillance. Chapman and Hall/CRC. https://doi.org/10.1201/b10315-3
Lubba, C. H., Sethi, S. S., Knaute, P., Schultz, S. R., Fulcher, B. D., & Jones, N. S. (2019). catch22: CAnonical Time-series CHaracteristics. ArXiv:1901.10200 [Cs, Stat]. http://arxiv.org/abs/1901.10200
Matthews, G. J., & Harel, O. (2011). Data confidentiality: A review of methods for statistical disclosure limitation and methods for assessing privacy. Statistics Surveys, 5, 1–29. https://doi.org/10.1214/11-SS074
Molina-Markham, A., Shenoy, P., Fu, K., Cecchet, E., & Irwin, D. (2010). Private memoirs of a smart meter. Proceedings of the 2nd ACM Workshop on Embedded Sensing Systems for Energy-Efficiency in Building, 61–66. https://doi.org/10.1145/1878431.1878446
Moore, R. A. (1996). CONTROLLED DATA-SWAPPING TECHNIQUES FOR MASKING PUBLIC USE MICRODATA SETS. 42.
Mörchen, F. (2003). Time series feature extraction for data mining using DWT and DFT.
Motiv Ring | 24/7 Smart Ring | Fitness + Sleep Tracking | Online Security Motiv Ring. (n.d.). Retrieved March 17, 2020, from https://mymotiv.com/
Nair, B., K, S., Balakrishna, R., Bharathvajan, L., & Krishna, V. (2019). Household water consumption dataset. 1. https://doi.org/10.17632/2yjwrft6nr.1
OECD. (2008). OECD Glossary of Statistical Terms. OECD Publishing.
OECD Glossary of Statistical Terms—Cell suppression Definition. (n.d.). Retrieved March 17, 2020, from https://stats.oecd.org/glossary/detail.asp?ID=6891
Povinelli, R. J., Johnson, M. T., Lindgren, A. C., & Ye, J. (2004). Time series classification using Gaussian mixture models of reconstructed phase spaces. IEEE Transactions on Knowledge and Data Engineering, 16(6), 779–783.
Raghunathan, T., Reiter, J., & Rubin, D. (2003). Multiple imputation for statistical disclosure limitation. Journal of Of®cial Statistics, 19, 1–16.
Reiter, J. P. (2002). Satisfying disclosure restrictions with synthetic data sets. Journal of Official Statistics, 18(4), 531.
Rubin, D. B. (1993). Statistical disclosure limitation. Journal of Official Statistics, 9(2), 461–468.
Rubin, D. B. (2004). Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons.
Safavi, S., & Shukur, Z. (2014). Conceptual Privacy Framework for Health Information on Wearable Device. PLoS ONE, 9(12), e114306. https://doi.org/10.1371/journal.pone.0114306
Shmueli, G., Bruce, P. C., Yahav, I., Patel, N. R., & Lichtendahl Jr, K. C. (2017). Data mining for business analytics: Concepts, techniques, and applications in R. John Wiley & Sons.
Shop Fitbit Versa 2TM Smartwatch. (n.d.). Retrieved March 17, 2020, from https://www.fitbit.com/us/products/smartwatches/versa
Sweeney, L. (2001). Computational disclosure control: A primer on data privacy protection [Thesis, Massachusetts Institute of Technology]. https://dspace.mit.edu/handle/1721.1/8589
Sweeney, L. (2002). k-ANONYMITY: A MODEL FOR PROTECTING PRIVACY. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05), 557–570. https://doi.org/10.1142/S0218488502001648
Villani, M., Kohn, R., & Giordani, P. (2009). Regression density estimation using smooth adaptive Gaussian mixtures. Journal of Econometrics, 153(2), 155–173.
Wachter, S. (2018). Normative challenges of identification in the Internet of Things: Privacy, profiling, discrimination, and the GDPR. Computer Law & Security Review, 34(3), 436–449.
Wang, X., Smith, K., & Hyndman, R. (2006). Characteristic-Based Clustering for Time Series Data. Data Mining and Knowledge Discovery, 13(3), 335–364. https://doi.org/10.1007/s10618-005-0039-x
Ward, D. (2007). Data and metadata reporting and presentation handbook. OECD Publishing.
Warren Liao, T. (2005). Clustering of time series data—A survey. Pattern Recognition, 38(11), 1857–1874. https://doi.org/10.1016/j.patcog.2005.01.025
Watch—Apple. (n.d.). Retrieved March 17, 2020, from https://www.apple.com/watch/
Wong, C. S., & Li, W. K. (2000). On a mixture autoregressive model. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 62(1), 95–115. https://doi.org/10.1111/1467-9868.00222
Yangzhuoran, Y., & Hyndman, R. (2019). Introduction to the tsfeatures package. https://cran.r-project.org/web/packages/tsfeatures/vignettes/tsfeatures.html
Ye, L., & Keogh, E. (2009). Time series shapelets: A new primitive for data mining. Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 947–956.
Zarsky, T. Z. (2016). Incompatible: The GDPR in the age of big data. Seton Hall L. Rev., 47, 995.