Review of Parallel Computation Strategies for Statistical Service Engines Using R

簡易檢索 / 詳目顯示

回結果列表

研究生：	陶瑞 Musa Touray
論文名稱：	Review of Parallel Computation Strategies for Statistical Service Engines Using R 檢視使用R為統計服務核心之平行計算策略
指導教授：	雷松亞 Soumya Ray
口試委員:	林福仁 Lin, Fu-Ren 王俊程 Wang, Jyun-Cheng
學位類別：	碩士 Master
系所名稱：	科技管理學院 - 國際專業管理碩士班 International Master of Business Administration(IMBA)
論文出版年：	2013
畢業學年度：	101
語文別：	英文
論文頁數：	77
中文關鍵詞：	平行計算策略、統計服務
外文關鍵詞：	Parallel Computation Strategies, Statistical Service, R-Engine
相關次數：	點閱：74 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

ABSTRACT
In enterprise environment, the source data are stored in various forms such as files, database, and streaming data. Currently, analysts conduct data analysis in offline mode using statistical software [5]. In a conventional sequential computer, processing is channeled through one physical location. In a parallel machine, processing can occur simultaneously at many locations and consequently many more computational operations per second should be achievable. Due to the rapidly decreasing cost of processing, memory, and communication, it has appeared inevitable for at least two decades that parallel machines will eventually displace sequential ones in computationally demanding fields [9]. Many modern enterprises are collecting data at the most detailed level possible, creating data repositories ranging from terabytes to petabytes in size. The ability to apply sophisticated statistical analysis methods to this data is becoming essential for marketplace competitiveness. This need to perform deep analysis over huge data repositories creates a significant challenge to existing statistical software and data management systems. On the one hand, statistical software provides rich functionality for data analysis and modeling, but can handle only limited amounts of data; e.g., popular packages like R and SPSS operate entirely in main memory. On the other hand, data intensive management systems—such as MapReduce-based systems—can scale to petabytes of data, but provide insufficient analytical functionality. [1]
We are reviewing the statistical model in Lee’s paper [5] which runs in sequential mode executing data given to it by the Application Server. We use Hardoop/MapReduce model as our statistical engine at the back end of our architecture data analysis algorithms (statistical service engine solution) which include two parts; one half is the R statistical analysis system and the other half is the implementation of the Hadoop data management system. This model consists of three components: an R driver process operated by the data analyst, a Hadoop cluster that hosts the data and runs Jaql (and possibly also some R sub-processes), and an R-Jaql bridge that connects these two components [1]. This is to improve the performance of the scalability and the functionality of the statistical jobs sent to it in a cluster or distributed environment. Also, we use the approach of Message Passing Interface (MPI) and Parallel DBMS Computation to support our model of parallel computation. Thus the new system architecture of statistical service engine solution of Lee’s paper is built.

Table of Contents 
CHAPTER 1: INTRODUCTION    11
CHAPTER 2:    Literature Review    15
2.1: Lee’s Statistical service engine solution    15
2.2:  R-Statistical Engine    16
2.3:  motivation    20
CHAPTER 3:    PARALLEL DBMS Approach of Parallel Computation    22
3.1: Different Architectures of Parallel DBMS    22
1.    Shared - memory system:    22
2.    Shared - disk system:    24
3.    Shared - nothing system:    25
3.2: Types of Parallel DBMS    26
a)    Pipeline Parallelism    26
b)    Partition Parallelism    27
3.3: Some Major Terminology of Parallel DBMS    28
•    Linear Speed-UP    29
•    Linear Scale-Up    29
3.4: Advantages and Disadvantages of Parallel DBMS plus SQL Sample Code    31
a)    Advantages:    31
b)    Disadvantages    32
Sample Codes and Example of Parallel DBMS    33
•    Selection / projection / aggregation    33
•    Sorting    33
3.5: The Architecture Overview of Parallel DBMS on Statistical Service Engine Solution    33
CHAPTER 4: The Approach of Hadoop-Bridge with A High Query Language    36
4.1: Jaql query language    36
4.2: Hadoop Data Management Systems    38
4:2.1    some advantages of Hadoop/map-reduce    39
4:2.2    some disadvantages or limitations of Hadoop/map-reduce    41
4:2.3    an example of Hadoop/map-reduce plus sample code    41
4.3: The Architecture Overview of Hadoop-Bridge Statistical Service Engine Solution    43
CHAPTER 5: Message Passing Interface (MPI) Approach of Parallel Computation    46
5.1: Programming Model    47
a)    Shared-memory system    48
b)    Distributed-memory system    49
c)    Hybrid Distributed-shared memory system    50
5.2: Operations for Communications    51
a)    Point-to-Point Operations    51
b)    Collective Communication    53
5.3:  Sample design and Examples    53
a)    Example    53
b)    The simple Architecture of MPI    54
5.4: Advantages and Disadvantages of Message Passing Interface (MPI)    55
a)    Advantages of MPI    55
b)    Disadvantages of MPI    56
5.5: The Architecture Overview of MPI on Statistical Service Engine Solution    56
CHAPTER 6: COMPARISON OF THE THREE MODELS    59
6.1: Scalability:    60
6.2:  Ease of Writing code and easy understanding- programming model    61
6.3: Flexibility    63
6.4: Performance and Efficiency    64
6.5: Cost    66
6.6: Fault Tolerance    66
6.7:  Brief Comparison showing in Tabular Form    68
CHAPTER 7: CONCLUSION    71
REFERENCES    74


                                

REFERENCES
1. Sudipto Das, Yannis Sismanis, Kevin S. Beyer, University of California(Santa Barbara, CA, USA)- sudipto@cs.ucsb.edu,IBM Almaden Research Center (San Jose, CA, USA)-{syannis, kbeyer, rgemull, phaas, jmcphers}@us.ibm.com; “Ricardo: Integrating R and Hadoop”; pg1; 2010; ACM New York,NY,USA @2010
2. Sudipto Das, Yannis Sismanis, Kevin S. Beyer, University of California(Santa Barbara, CA, USA)- sudipto@cs.ucsb.edu,IBM Almaden Research Center (San Jose, CA, USA)-{syannis, kbeyer, rgemull, phaas, jmcphers}@us.ibm.com; “Ricardo: Integrating R and Hadoop”; pg1; 2010; ACM New York,NY,USA @2010
3. Surajit Chaudhur-, Umeshwar Dayal; “An Overview of Data Warehousing and OLAP Technology”; pg1; March 1997; Newsletter ACM SIGMOD, New York, NY, USA
4. Surajit Chaudhuri- Microsoft Research, Redmond, Umeshwar Dayal- Hewlett-Packard Labs, PaloAlto; “An Overview of Data Warehousing and OLAPTechnolog”; pg1; March 1997; Newsletter ACM SIGMOD, New York, NY, USA
5. Rich C. Lee; “A Novel Data Analysis Service Oriented Architecture Using R”; March 2012; Proceedings of The International Symposium on Grids and Clouds (ICGC 2012). 26 February-2 March. Taipei, Taiwan. Published online at http://pos.sissa.it/cgi-bin/reader/conf.cgi?confid=153, id.8
6. Evelson Boris; “Topic Overview: Business Intelligence”; November 2008; Forrester Research Inc
7. B. P. Kumar, J. Selvam, V. Meenakshi, K. Kanthi, A. Suseela, and V. L. Kumar; “Business Decision Making, Management and Information Technolog”; Ubiquity, vol. 2007, p. 4, 2007
8. H. Arsham (2007); “Statistical Thinking for Managerial Decisions (9 ed.)”; retrieved July 14, 2007, from http://home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm
9. Leslie G. Valiant; “A Bridging Model for Parallel Computation”; Vol.33, August 1990; Magazine communication of the ACM New York, NY, USA
10. Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph M. Hellerstein, and Caleb Welton; “MAD skills: New analysis practices for big data”; PVLDB, 2(2):1481–1492; August 2009; Journal , Proceedings of the VLDB Endowment, USA
11. Jeffrey Dean and Sanjay Ghemawat; “MapReduce: simplified data processing on large clusters”; pg 137–150; October 2004; USENIX Associations, In OSDI’04 Proceedings of the 6th conference on symposium on Operating Systems Design and Implementation, USA
12. Robert J. Stewart, Phil W. Trinder, and Hans-Wolfgang Loidl; “JAQL: Query Language for JavaScript Object Notation (JSON)”; Retrieved 2009 from http://code.google.com/p/jaql
13. Ross Ihaka & Robert Gentleman _ Published online: 21 Feb 2012.; “R: A Language for Data Analysis and Graphics”; May 1995; Published by: American Statistical Association, Institute of Mathematical Statistics, and Interface Foundation of America

14. W.N. Venables & D.M. Smith, R.Gentleman & R.Ihaka; “An Introduction to R”, Retrieved May 2013 from http://cran.r-project.org/doc/manuals/R-intro.html

15. Wendy K. Murphrey-SAS Institute, April 2005 ,“Connections JMP and ODBC”; retrieved May 2013 from www.jmp.com/software/whitepapers/pdfs/102427_jmp_odbc.pd

16. Robert I.Kabacoff; “Access to Database Management Systems (DBMS)”;2012 , retrieved from www.statmethods.net/input/dbinterface.html

17. Powered by Google Project Hosting, “Jaql Query Language for JavaScript Object Notation”; Retrieved May 2013, from http://code.google.com/p/jaql/wiki/JaqlOverview

18. Mike Olson ,“HADOOP: Scalable, Flexible Data Storage and Analysis” , Spring 2010, IQT Quarterly

19. Grant Mackey, Saba Sehrish, John Bent, Julio Lopez, Salman Habib, Jun Wang “Introducing Map-Reduce to High End Computing”; 2008, Petascale Data Storage Workshop, 2008. PDSW '08. 3rd
20. Andrew Pavlo , Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J.Dewitt, Samuel Madden; “A Comparison of Approaches to Large-Scale Data Analysis ”, 2009, Proceedings of the 2009 ACM SIGMOD International Conference on Management of data ACM New York, NY, USA
21. William Gropp, Ewing Lusk ; “Goals Guiding Design: PVM and MPI”; August 2002, Conference Paper- 4th IEEE International Conference on Cluster Computing, San Diego, CA, USA

22. P. H. Carns, W. B. Ligon III, S. P. McMillan, and R. B. Ross; “An Evaluation of Message Passing Implementations on Beowulf Workstations”; March 1999, Proceedings of the 1999 IEEE Aerospace Conference (Volume:5)

23. Blaise Barney, Lawrence Livermore National Laboratory; “ Parallel computing Tutorial” Retrieved from 2012 www.mcs.anl.gov/mpi/ (https://computing.llnl.gov/tutorials/mpi/)

24. Markus Schmidberger, Martin Morgan, Dirk Eddelbuettel, Hao Yu, Luke Tierney, Ulrich Mansmann; “State of the Art in Parallel Computing with R (2009)”. August 2009, Journal of Statistical Software, No.1, Vol.31

25. Michael J.Quinn - CHP 4 (Message-Passing programming) of the text book : “PARALLEL PROGRAMMING in C with MPI and OpenMP”, June 2003, McGraw-Hill Science/Engineering/Math; 1 Edition
26. David Dewitt and Jim Gray, “Parallel Database Systems: The Future of High Performance Database Systems”, June 1992, Magazine-Communications of the ACM Volume 35. ACM New York, NY,USA

27. M. Tamer O¨ zsu, Patrick Valduriez ; “Distributed and Parallel Database Systems”; 1996, ACM Computing Surveys

28. Michael I. Gordon, William Thies, and Saman Amarasinghe; “Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs”, December 2006; ASPLOS XII Proceedings of the 12th International Conference on Architectural support for programming languages and operating systems, pg 151-162, ACM New York, NY, USA

29. M. Tamer Özsu and Patrick Valduriez; “Principles of Distributed Database Systems, Third Edition”; 2011, Springer New York, USA

30. Azza Abouzeid, Kamil BajdaPawlikowski,Daniel Abadi, Avi Silberschatz1, Alexander Rasin “HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads”; August 2009, Journal –Proceedings of the VLDB Endowment. Volume 2

31. Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Xhakka, Ning Zhang, Suresh Antony, Hao Lin - Facebook.2010 “Hive-A petabyte scale data warehouse using Hadoop” retrieved June 2013,from http://www.facebook.com/note.php?note id=89508453919

32. Gruska, Natalie, Patrick Martin; “Integrating MapReduce and RDBMS”; pg 212-223, 2010, Proceedings of the 2010 Conference of the Center for Advance Studies on Collaborative Research. New York, NY, USA:ACM
33. Fei Chen, Meichun Hsu, HP Labs “A Performance Comparison of Parallel DBMSs and MapReduce on Large-Scale Text Analytics”, pg 613-624, 2013, EDBT Proceedings of the 16th International Conference on Extending Database Technology. ACM New York, NY, USA

全文公開日期本全文未授權公開 (校內網路)
全文公開日期本全文未授權公開 (校外網路)

簡易檢索 / 詳目顯示

相關論文