研究生: |
陶瑞 Musa Touray |
---|---|
論文名稱: |
Review of Parallel Computation Strategies for Statistical Service Engines Using R 檢視使用R為統計服務核心之平行計算策略 |
指導教授: |
雷松亞
Soumya Ray |
口試委員: |
林福仁
Lin, Fu-Ren 王俊程 Wang, Jyun-Cheng |
學位類別: |
碩士 Master |
系所名稱: |
科技管理學院 - 國際專業管理碩士班 International Master of Business Administration(IMBA) |
論文出版年: | 2013 |
畢業學年度: | 101 |
語文別: | 英文 |
論文頁數: | 77 |
中文關鍵詞: | 平行計算策略 、統計服務 |
外文關鍵詞: | Parallel Computation Strategies, Statistical Service, R-Engine |
相關次數: | 點閱:4 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
ABSTRACT
In enterprise environment, the source data are stored in various forms such as files, database, and streaming data. Currently, analysts conduct data analysis in offline mode using statistical software [5]. In a conventional sequential computer, processing is channeled through one physical location. In a parallel machine, processing can occur simultaneously at many locations and consequently many more computational operations per second should be achievable. Due to the rapidly decreasing cost of processing, memory, and communication, it has appeared inevitable for at least two decades that parallel machines will eventually displace sequential ones in computationally demanding fields [9]. Many modern enterprises are collecting data at the most detailed level possible, creating data repositories ranging from terabytes to petabytes in size. The ability to apply sophisticated statistical analysis methods to this data is becoming essential for marketplace competitiveness. This need to perform deep analysis over huge data repositories creates a significant challenge to existing statistical software and data management systems. On the one hand, statistical software provides rich functionality for data analysis and modeling, but can handle only limited amounts of data; e.g., popular packages like R and SPSS operate entirely in main memory. On the other hand, data intensive management systems—such as MapReduce-based systems—can scale to petabytes of data, but provide insufficient analytical functionality. [1]
We are reviewing the statistical model in Lee’s paper [5] which runs in sequential mode executing data given to it by the Application Server. We use Hardoop/MapReduce model as our statistical engine at the back end of our architecture data analysis algorithms (statistical service engine solution) which include two parts; one half is the R statistical analysis system and the other half is the implementation of the Hadoop data management system. This model consists of three components: an R driver process operated by the data analyst, a Hadoop cluster that hosts the data and runs Jaql (and possibly also some R sub-processes), and an R-Jaql bridge that connects these two components [1]. This is to improve the performance of the scalability and the functionality of the statistical jobs sent to it in a cluster or distributed environment. Also, we use the approach of Message Passing Interface (MPI) and Parallel DBMS Computation to support our model of parallel computation. Thus the new system architecture of statistical service engine solution of Lee’s paper is built.
REFERENCES
1. Sudipto Das, Yannis Sismanis, Kevin S. Beyer, University of California(Santa Barbara, CA, USA)- sudipto@cs.ucsb.edu,IBM Almaden Research Center (San Jose, CA, USA)-{syannis, kbeyer, rgemull, phaas, jmcphers}@us.ibm.com; “Ricardo: Integrating R and Hadoop”; pg1; 2010; ACM New York,NY,USA @2010
2. Sudipto Das, Yannis Sismanis, Kevin S. Beyer, University of California(Santa Barbara, CA, USA)- sudipto@cs.ucsb.edu,IBM Almaden Research Center (San Jose, CA, USA)-{syannis, kbeyer, rgemull, phaas, jmcphers}@us.ibm.com; “Ricardo: Integrating R and Hadoop”; pg1; 2010; ACM New York,NY,USA @2010
3. Surajit Chaudhur-, Umeshwar Dayal; “An Overview of Data Warehousing and OLAP Technology”; pg1; March 1997; Newsletter ACM SIGMOD, New York, NY, USA
4. Surajit Chaudhuri- Microsoft Research, Redmond, Umeshwar Dayal- Hewlett-Packard Labs, PaloAlto; “An Overview of Data Warehousing and OLAPTechnolog”; pg1; March 1997; Newsletter ACM SIGMOD, New York, NY, USA
5. Rich C. Lee; “A Novel Data Analysis Service Oriented Architecture Using R”; March 2012; Proceedings of The International Symposium on Grids and Clouds (ICGC 2012). 26 February-2 March. Taipei, Taiwan. Published online at http://pos.sissa.it/cgi-bin/reader/conf.cgi?confid=153, id.8
6. Evelson Boris; “Topic Overview: Business Intelligence”; November 2008; Forrester Research Inc
7. B. P. Kumar, J. Selvam, V. Meenakshi, K. Kanthi, A. Suseela, and V. L. Kumar; “Business Decision Making, Management and Information Technolog”; Ubiquity, vol. 2007, p. 4, 2007
8. H. Arsham (2007); “Statistical Thinking for Managerial Decisions (9 ed.)”; retrieved July 14, 2007, from http://home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm
9. Leslie G. Valiant; “A Bridging Model for Parallel Computation”; Vol.33, August 1990; Magazine communication of the ACM New York, NY, USA
10. Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph M. Hellerstein, and Caleb Welton; “MAD skills: New analysis practices for big data”; PVLDB, 2(2):1481–1492; August 2009; Journal , Proceedings of the VLDB Endowment, USA
11. Jeffrey Dean and Sanjay Ghemawat; “MapReduce: simplified data processing on large clusters”; pg 137–150; October 2004; USENIX Associations, In OSDI’04 Proceedings of the 6th conference on symposium on Operating Systems Design and Implementation, USA
12. Robert J. Stewart, Phil W. Trinder, and Hans-Wolfgang Loidl; “JAQL: Query Language for JavaScript Object Notation (JSON)”; Retrieved 2009 from http://code.google.com/p/jaql
13. Ross Ihaka & Robert Gentleman _ Published online: 21 Feb 2012.; “R: A Language for Data Analysis and Graphics”; May 1995; Published by: American Statistical Association, Institute of Mathematical Statistics, and Interface Foundation of America
14. W.N. Venables & D.M. Smith, R.Gentleman & R.Ihaka; “An Introduction to R”, Retrieved May 2013 from http://cran.r-project.org/doc/manuals/R-intro.html
15. Wendy K. Murphrey-SAS Institute, April 2005 ,“Connections JMP and ODBC”; retrieved May 2013 from www.jmp.com/software/whitepapers/pdfs/102427_jmp_odbc.pd
16. Robert I.Kabacoff; “Access to Database Management Systems (DBMS)”;2012 , retrieved from www.statmethods.net/input/dbinterface.html
17. Powered by Google Project Hosting, “Jaql Query Language for JavaScript Object Notation”; Retrieved May 2013, from http://code.google.com/p/jaql/wiki/JaqlOverview
18. Mike Olson ,“HADOOP: Scalable, Flexible Data Storage and Analysis” , Spring 2010, IQT Quarterly
19. Grant Mackey, Saba Sehrish, John Bent, Julio Lopez, Salman Habib, Jun Wang “Introducing Map-Reduce to High End Computing”; 2008, Petascale Data Storage Workshop, 2008. PDSW '08. 3rd
20. Andrew Pavlo , Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J.Dewitt, Samuel Madden; “A Comparison of Approaches to Large-Scale Data Analysis ”, 2009, Proceedings of the 2009 ACM SIGMOD International Conference on Management of data ACM New York, NY, USA
21. William Gropp, Ewing Lusk ; “Goals Guiding Design: PVM and MPI”; August 2002, Conference Paper- 4th IEEE International Conference on Cluster Computing, San Diego, CA, USA
22. P. H. Carns, W. B. Ligon III, S. P. McMillan, and R. B. Ross; “An Evaluation of Message Passing Implementations on Beowulf Workstations”; March 1999, Proceedings of the 1999 IEEE Aerospace Conference (Volume:5)
23. Blaise Barney, Lawrence Livermore National Laboratory; “ Parallel computing Tutorial” Retrieved from 2012 www.mcs.anl.gov/mpi/ (https://computing.llnl.gov/tutorials/mpi/)
24. Markus Schmidberger, Martin Morgan, Dirk Eddelbuettel, Hao Yu, Luke Tierney, Ulrich Mansmann; “State of the Art in Parallel Computing with R (2009)”. August 2009, Journal of Statistical Software, No.1, Vol.31
25. Michael J.Quinn - CHP 4 (Message-Passing programming) of the text book : “PARALLEL PROGRAMMING in C with MPI and OpenMP”, June 2003, McGraw-Hill Science/Engineering/Math; 1 Edition
26. David Dewitt and Jim Gray, “Parallel Database Systems: The Future of High Performance Database Systems”, June 1992, Magazine-Communications of the ACM Volume 35. ACM New York, NY,USA
27. M. Tamer O¨ zsu, Patrick Valduriez ; “Distributed and Parallel Database Systems”; 1996, ACM Computing Surveys
28. Michael I. Gordon, William Thies, and Saman Amarasinghe; “Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs”, December 2006; ASPLOS XII Proceedings of the 12th International Conference on Architectural support for programming languages and operating systems, pg 151-162, ACM New York, NY, USA
29. M. Tamer Özsu and Patrick Valduriez; “Principles of Distributed Database Systems, Third Edition”; 2011, Springer New York, USA
30. Azza Abouzeid, Kamil BajdaPawlikowski,Daniel Abadi, Avi Silberschatz1, Alexander Rasin “HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads”; August 2009, Journal –Proceedings of the VLDB Endowment. Volume 2
31. Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Xhakka, Ning Zhang, Suresh Antony, Hao Lin - Facebook.2010 “Hive-A petabyte scale data warehouse using Hadoop” retrieved June 2013,from http://www.facebook.com/note.php?note id=89508453919
32. Gruska, Natalie, Patrick Martin; “Integrating MapReduce and RDBMS”; pg 212-223, 2010, Proceedings of the 2010 Conference of the Center for Advance Studies on Collaborative Research. New York, NY, USA:ACM
33. Fei Chen, Meichun Hsu, HP Labs “A Performance Comparison of Parallel DBMSs and MapReduce on Large-Scale Text Analytics”, pg 613-624, 2013, EDBT Proceedings of the 16th International Conference on Extending Database Technology. ACM New York, NY, USA