簡易檢索 / 詳目顯示

研究生: 陶瑞
Musa Touray
論文名稱: Review of Parallel Computation Strategies for Statistical Service Engines Using R
檢視使用R為統計服務核心之平行計算策略
指導教授: 雷松亞
Soumya Ray
口試委員: 林福仁
Lin, Fu-Ren
王俊程
Wang, Jyun-Cheng
學位類別: 碩士
Master
系所名稱: 科技管理學院 - 國際專業管理碩士班
International Master of Business Administration(IMBA)
論文出版年: 2013
畢業學年度: 101
語文別: 英文
論文頁數: 77
中文關鍵詞: 平行計算策略統計服務
外文關鍵詞: Parallel Computation Strategies, Statistical Service, R-Engine
相關次數: 點閱:3下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • ABSTRACT
    In enterprise environment, the source data are stored in various forms such as files, database, and streaming data. Currently, analysts conduct data analysis in offline mode using statistical software [5]. In a conventional sequential computer, processing is channeled through one physical location. In a parallel machine, processing can occur simultaneously at many locations and consequently many more computational operations per second should be achievable. Due to the rapidly decreasing cost of processing, memory, and communication, it has appeared inevitable for at least two decades that parallel machines will eventually displace sequential ones in computationally demanding fields [9]. Many modern enterprises are collecting data at the most detailed level possible, creating data repositories ranging from terabytes to petabytes in size. The ability to apply sophisticated statistical analysis methods to this data is becoming essential for marketplace competitiveness. This need to perform deep analysis over huge data repositories creates a significant challenge to existing statistical software and data management systems. On the one hand, statistical software provides rich functionality for data analysis and modeling, but can handle only limited amounts of data; e.g., popular packages like R and SPSS operate entirely in main memory. On the other hand, data intensive management systems—such as MapReduce-based systems—can scale to petabytes of data, but provide insufficient analytical functionality. [1]
    We are reviewing the statistical model in Lee’s paper [5] which runs in sequential mode executing data given to it by the Application Server. We use Hardoop/MapReduce model as our statistical engine at the back end of our architecture data analysis algorithms (statistical service engine solution) which include two parts; one half is the R statistical analysis system and the other half is the implementation of the Hadoop data management system. This model consists of three components: an R driver process operated by the data analyst, a Hadoop cluster that hosts the data and runs Jaql (and possibly also some R sub-processes), and an R-Jaql bridge that connects these two components [1]. This is to improve the performance of the scalability and the functionality of the statistical jobs sent to it in a cluster or distributed environment. Also, we use the approach of Message Passing Interface (MPI) and Parallel DBMS Computation to support our model of parallel computation. Thus the new system architecture of statistical service engine solution of Lee’s paper is built.


    Table of Contents CHAPTER 1: INTRODUCTION 11 CHAPTER 2: Literature Review 15 2.1: Lee’s Statistical service engine solution 15 2.2: R-Statistical Engine 16 2.3: motivation 20 CHAPTER 3: PARALLEL DBMS Approach of Parallel Computation 22 3.1: Different Architectures of Parallel DBMS 22 1. Shared - memory system: 22 2. Shared - disk system: 24 3. Shared - nothing system: 25 3.2: Types of Parallel DBMS 26 a) Pipeline Parallelism 26 b) Partition Parallelism 27 3.3: Some Major Terminology of Parallel DBMS 28 • Linear Speed-UP 29 • Linear Scale-Up 29 3.4: Advantages and Disadvantages of Parallel DBMS plus SQL Sample Code 31 a) Advantages: 31 b) Disadvantages 32 Sample Codes and Example of Parallel DBMS 33 • Selection / projection / aggregation 33 • Sorting 33 3.5: The Architecture Overview of Parallel DBMS on Statistical Service Engine Solution 33 CHAPTER 4: The Approach of Hadoop-Bridge with A High Query Language 36 4.1: Jaql query language 36 4.2: Hadoop Data Management Systems 38 4:2.1 some advantages of Hadoop/map-reduce 39 4:2.2 some disadvantages or limitations of Hadoop/map-reduce 41 4:2.3 an example of Hadoop/map-reduce plus sample code 41 4.3: The Architecture Overview of Hadoop-Bridge Statistical Service Engine Solution 43 CHAPTER 5: Message Passing Interface (MPI) Approach of Parallel Computation 46 5.1: Programming Model 47 a) Shared-memory system 48 b) Distributed-memory system 49 c) Hybrid Distributed-shared memory system 50 5.2: Operations for Communications 51 a) Point-to-Point Operations 51 b) Collective Communication 53 5.3: Sample design and Examples 53 a) Example 53 b) The simple Architecture of MPI 54 5.4: Advantages and Disadvantages of Message Passing Interface (MPI) 55 a) Advantages of MPI 55 b) Disadvantages of MPI 56 5.5: The Architecture Overview of MPI on Statistical Service Engine Solution 56 CHAPTER 6: COMPARISON OF THE THREE MODELS 59 6.1: Scalability: 60 6.2: Ease of Writing code and easy understanding- programming model 61 6.3: Flexibility 63 6.4: Performance and Efficiency 64 6.5: Cost 66 6.6: Fault Tolerance 66 6.7: Brief Comparison showing in Tabular Form 68 CHAPTER 7: CONCLUSION 71 REFERENCES 74

    REFERENCES
    1. Sudipto Das, Yannis Sismanis, Kevin S. Beyer, University of California(Santa Barbara, CA, USA)- sudipto@cs.ucsb.edu,IBM Almaden Research Center (San Jose, CA, USA)-{syannis, kbeyer, rgemull, phaas, jmcphers}@us.ibm.com; “Ricardo: Integrating R and Hadoop”; pg1; 2010; ACM New York,NY,USA @2010
    2. Sudipto Das, Yannis Sismanis, Kevin S. Beyer, University of California(Santa Barbara, CA, USA)- sudipto@cs.ucsb.edu,IBM Almaden Research Center (San Jose, CA, USA)-{syannis, kbeyer, rgemull, phaas, jmcphers}@us.ibm.com; “Ricardo: Integrating R and Hadoop”; pg1; 2010; ACM New York,NY,USA @2010
    3. Surajit Chaudhur-, Umeshwar Dayal; “An Overview of Data Warehousing and OLAP Technology”; pg1; March 1997; Newsletter ACM SIGMOD, New York, NY, USA
    4. Surajit Chaudhuri- Microsoft Research, Redmond, Umeshwar Dayal- Hewlett-Packard Labs, PaloAlto; “An Overview of Data Warehousing and OLAPTechnolog”; pg1; March 1997; Newsletter ACM SIGMOD, New York, NY, USA
    5. Rich C. Lee; “A Novel Data Analysis Service Oriented Architecture Using R”; March 2012; Proceedings of The International Symposium on Grids and Clouds (ICGC 2012). 26 February-2 March. Taipei, Taiwan. Published online at http://pos.sissa.it/cgi-bin/reader/conf.cgi?confid=153, id.8
    6. Evelson Boris; “Topic Overview: Business Intelligence”; November 2008; Forrester Research Inc
    7. B. P. Kumar, J. Selvam, V. Meenakshi, K. Kanthi, A. Suseela, and V. L. Kumar; “Business Decision Making, Management and Information Technolog”; Ubiquity, vol. 2007, p. 4, 2007
    8. H. Arsham (2007); “Statistical Thinking for Managerial Decisions (9 ed.)”; retrieved July 14, 2007, from http://home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm
    9. Leslie G. Valiant; “A Bridging Model for Parallel Computation”; Vol.33, August 1990; Magazine communication of the ACM New York, NY, USA
    10. Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph M. Hellerstein, and Caleb Welton; “MAD skills: New analysis practices for big data”; PVLDB, 2(2):1481–1492; August 2009; Journal , Proceedings of the VLDB Endowment, USA
    11. Jeffrey Dean and Sanjay Ghemawat; “MapReduce: simplified data processing on large clusters”; pg 137–150; October 2004; USENIX Associations, In OSDI’04 Proceedings of the 6th conference on symposium on Operating Systems Design and Implementation, USA
    12. Robert J. Stewart, Phil W. Trinder, and Hans-Wolfgang Loidl; “JAQL: Query Language for JavaScript Object Notation (JSON)”; Retrieved 2009 from http://code.google.com/p/jaql
    13. Ross Ihaka & Robert Gentleman _ Published online: 21 Feb 2012.; “R: A Language for Data Analysis and Graphics”; May 1995; Published by: American Statistical Association, Institute of Mathematical Statistics, and Interface Foundation of America

    14. W.N. Venables & D.M. Smith, R.Gentleman & R.Ihaka; “An Introduction to R”, Retrieved May 2013 from http://cran.r-project.org/doc/manuals/R-intro.html

    15. Wendy K. Murphrey-SAS Institute, April 2005 ,“Connections JMP and ODBC”; retrieved May 2013 from www.jmp.com/software/whitepapers/pdfs/102427_jmp_odbc.pd

    16. Robert I.Kabacoff; “Access to Database Management Systems (DBMS)”;2012 , retrieved from www.statmethods.net/input/dbinterface.html

    17. Powered by Google Project Hosting, “Jaql Query Language for JavaScript Object Notation”; Retrieved May 2013, from http://code.google.com/p/jaql/wiki/JaqlOverview

    18. Mike Olson ,“HADOOP: Scalable, Flexible Data Storage and Analysis” , Spring 2010, IQT Quarterly

    19. Grant Mackey, Saba Sehrish, John Bent, Julio Lopez, Salman Habib, Jun Wang “Introducing Map-Reduce to High End Computing”; 2008, Petascale Data Storage Workshop, 2008. PDSW '08. 3rd
    20. Andrew Pavlo , Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J.Dewitt, Samuel Madden; “A Comparison of Approaches to Large-Scale Data Analysis ”, 2009, Proceedings of the 2009 ACM SIGMOD International Conference on Management of data ACM New York, NY, USA
    21. William Gropp, Ewing Lusk ; “Goals Guiding Design: PVM and MPI”; August 2002, Conference Paper- 4th IEEE International Conference on Cluster Computing, San Diego, CA, USA

    22. P. H. Carns, W. B. Ligon III, S. P. McMillan, and R. B. Ross; “An Evaluation of Message Passing Implementations on Beowulf Workstations”; March 1999, Proceedings of the 1999 IEEE Aerospace Conference (Volume:5)

    23. Blaise Barney, Lawrence Livermore National Laboratory; “ Parallel computing Tutorial” Retrieved from 2012 www.mcs.anl.gov/mpi/ (https://computing.llnl.gov/tutorials/mpi/)

    24. Markus Schmidberger, Martin Morgan, Dirk Eddelbuettel, Hao Yu, Luke Tierney, Ulrich Mansmann; “State of the Art in Parallel Computing with R (2009)”. August 2009, Journal of Statistical Software, No.1, Vol.31

    25. Michael J.Quinn - CHP 4 (Message-Passing programming) of the text book : “PARALLEL PROGRAMMING in C with MPI and OpenMP”, June 2003, McGraw-Hill Science/Engineering/Math; 1 Edition
    26. David Dewitt and Jim Gray, “Parallel Database Systems: The Future of High Performance Database Systems”, June 1992, Magazine-Communications of the ACM Volume 35. ACM New York, NY,USA

    27. M. Tamer O¨ zsu, Patrick Valduriez ; “Distributed and Parallel Database Systems”; 1996, ACM Computing Surveys

    28. Michael I. Gordon, William Thies, and Saman Amarasinghe; “Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs”, December 2006; ASPLOS XII Proceedings of the 12th International Conference on Architectural support for programming languages and operating systems, pg 151-162, ACM New York, NY, USA

    29. M. Tamer Özsu and Patrick Valduriez; “Principles of Distributed Database Systems, Third Edition”; 2011, Springer New York, USA

    30. Azza Abouzeid, Kamil BajdaPawlikowski,Daniel Abadi, Avi Silberschatz1, Alexander Rasin “HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads”; August 2009, Journal –Proceedings of the VLDB Endowment. Volume 2

    31. Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Xhakka, Ning Zhang, Suresh Antony, Hao Lin - Facebook.2010 “Hive-A petabyte scale data warehouse using Hadoop” retrieved June 2013,from http://www.facebook.com/note.php?note id=89508453919

    32. Gruska, Natalie, Patrick Martin; “Integrating MapReduce and RDBMS”; pg 212-223, 2010, Proceedings of the 2010 Conference of the Center for Advance Studies on Collaborative Research. New York, NY, USA:ACM
    33. Fei Chen, Meichun Hsu, HP Labs “A Performance Comparison of Parallel DBMSs and MapReduce on Large-Scale Text Analytics”, pg 613-624, 2013, EDBT Proceedings of the 16th International Conference on Extending Database Technology. ACM New York, NY, USA

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)

    QR CODE