簡易檢索 / 詳目顯示

研究生: 薛瑞夫
Sheriffo Ceesay
論文名稱: Moving Towards Pure ANSI SQL in NoSQL
指導教授: 鍾葉清
口試委員: 徐慰中
李哲榮
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊系統與應用研究所
Institute of Information Systems and Applications
論文出版年: 2013
畢業學年度: 101
語文別: 英文
論文頁數: 60
中文關鍵詞: 雲端運算軟體框架資料庫數據庫資料庫數據庫查詢語言
外文關鍵詞: Hadoop, MapReduce, NoSQL, Hive, HBase, ANSI SQL
相關次數: 點閱:3下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • Moving Towards Pure ANSI SQL in NoSQL
    The main focus of this master’s thesis is to narrow down the user friendly gap between
    the newly more distributed data processing platforms (HBase,Cassandra,
    MapReduce e.t.c) and the traditional less distributed data processing platforms
    e.g. (RDBMS’s).
    Lot of work have been done in this area e.g. Hive and Pig but they are not
    pure SQL.
    Over the past few decades RDBMS’s and data-warehouses were the only choice
    of data processing platforms with rich set of data processing tools e.g. SQL but
    recently, due to the variety, velocity and volume of data, these traditional data
    processing platforms becomes less efficient to handle this kind of data; thereby
    the need to come up with more efficient data stores and processing platforms.
    Though NoSQL data stores have lived up to their expectations of storing and
    processing large datasets but this process might not be simple and convenient
    as in traditional databases. One common cons of NoSQL databases is the lack
    of the much loved SQL language.
    This thesis will therefore focus on this new type of data stores also called
    (NoSQL). Specifically we will focus on HBase which is a column oriented or
    BigTable like Database as our choice of NoSQL store.
    The fact that NoSQL databases are becoming very popular we will propose our
    data mapping methods which can help migration from Relational Databases to
    NoSQL databases to be less daunting.
    Since this movement is from RDB’s which has rich set of procedures i.e. SQL to
    access and manipulate data, we will extend our work to bridge the gap between
    SQL and NoSQL by providing methods of using pure ANSI SQL to manipulate
    the underlying data which is stored in our NoSQL store (HBase).


    Moving Towards Pure ANSI SQL in NoSQL
    The main focus of this master’s thesis is to narrow down the user friendly gap between
    the newly more distributed data processing platforms (HBase,Cassandra,
    MapReduce e.t.c) and the traditional less distributed data processing platforms
    e.g. (RDBMS’s).
    Lot of work have been done in this area e.g. Hive and Pig but they are not
    pure SQL.
    Over the past few decades RDBMS’s and data-warehouses were the only choice
    of data processing platforms with rich set of data processing tools e.g. SQL but
    recently, due to the variety, velocity and volume of data, these traditional data
    processing platforms becomes less efficient to handle this kind of data; thereby
    the need to come up with more efficient data stores and processing platforms.
    Though NoSQL data stores have lived up to their expectations of storing and
    processing large datasets but this process might not be simple and convenient
    as in traditional databases. One common cons of NoSQL databases is the lack
    of the much loved SQL language.
    This thesis will therefore focus on this new type of data stores also called
    (NoSQL). Specifically we will focus on HBase which is a column oriented or
    BigTable like Database as our choice of NoSQL store.
    The fact that NoSQL databases are becoming very popular we will propose our
    data mapping methods which can help migration from Relational Databases to
    NoSQL databases to be less daunting.
    Since this movement is from RDB’s which has rich set of procedures i.e. SQL to
    access and manipulate data, we will extend our work to bridge the gap between
    SQL and NoSQL by providing methods of using pure ANSI SQL to manipulate
    the underlying data which is stored in our NoSQL store (HBase).

    1 Introduction 3 1.1 Background Studies . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.1.1 Data distribution: . . . . . . . . . . . . . . . . . . . . . . 6 1.1.2 Data locality: . . . . . . . . . . . . . . . . . . . . . . . . 7 1.1.3 Key-Value pair orientation: . . . . . . . . . . . . . . . . 7 1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.3.1 Size of Data . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.3.2 Heterogeneous Nature of Data . . . . . . . . . . . . . . . 12 1.3.3 Chapter Summary . . . . . . . . . . . . . . . . . . . . . 12 2 Data Mapping 13 2.1 Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Transforming RDBMS into HBase . . . . . . . . . . . . . . . . . 16 2.2.1 Schema Transformation . . . . . . . . . . . . . . . . . . 18 2.2.2 Rule of Thumb . . . . . . . . . . . . . . . . . . . . . . . 18 2.3 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . 22 3 Basic Implementation 23 3.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.1.1 Life Cycle of MapReduce Query Execution . . . . . . . . 24 3.2 SELECT Statement General Structure . . . . . . . . . . . . . . 25 3.3 WHAT TO SELECT AND KEYWORDS . . . . . . . . . . . . 26 3.3.1 Table and Column Aliasing . . . . . . . . . . . . . . . . 26 3.3.2 ALL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3.3 Projection and Filtering (IN) . . . . . . . . . . . . . . . 27 3.3.4 Projection and Filtering (LIKE) . . . . . . . . . . . . . . 28 3.3.5 DISTINCT . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.4 SUMMARY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 xi Contents 4 Advanced Implementation 31 4.1 JackHare . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.1.1 JackHare and our Work . . . . . . . . . . . . . . . . . . 32 4.1.2 JackHare Architecture . . . . . . . . . . . . . . . . . . . 32 4.2 Advance MapReduce to SQL Implementation . . . . . . . . . . 33 4.2.1 General Structure . . . . . . . . . . . . . . . . . . . . . . 33 4.2.2 ORDERING AND SORTING . . . . . . . . . . . . . . . 34 4.2.3 Descending and Ascending Ordering: . . . . . . . . . . . 37 4.2.4 Multi-Column Order By: . . . . . . . . . . . . . . . . . . 38 4.2.5 AGGREGATION . . . . . . . . . . . . . . . . . . . . . . 39 4.2.6 NESTED and SUB-QUERIES or COMPOSITION QUERIES 41 4.2.7 Model for Composite Queries . . . . . . . . . . . . . . . 41 4.2.8 JOIN Implementation and Cross Join Optimization . . . 43 4.2.9 SET OPERATIONS and it’s Optimization . . . . . . . . 44 4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5 Experiments 47 5.1 Setup A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.2 Benchmark Results of Setup A . . . . . . . . . . . . . . . . . . . 48 5.2.1 Effect of Data Size . . . . . . . . . . . . . . . . . . . . . 52 5.3 Setup B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.3.1 Benchmark Results of Setup B . . . . . . . . . . . . . . . 53 5.4 Functional Benchmarking . . . . . . . . . . . . . . . . . . . . . 54 5.5 Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 6 Conclusion and Future Work 55

    [1] David Dewitt and Jim Gray, Parallel Database System: The future of High
    Performance Database Systems, ACM 1992.
    [2] Fey Chang, Jeffrey Dean et al. BigTable: A Distributed Storage System
    for Structured Data
    [3] HBase: http://hbase.apache.org/
    [4] NoSQL: http://nosql-database.org/
    [5] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing
    on Large Clusters
    [6] Hadoop: http://hadoop.apache.org/
    [7] JackHare: http://sourceforge.net/projects/jackhare/
    [8] Chongxin Li. RDB to HBase: Transforming Relational Database into
    HBase: A Case Study
    [9] Hung Bing, et al. Structured Data Processing: Structured Data Processing
    on MapReduce in NoSQL Databases
    59
    Bibliography
    [10] Meng-Ju Hsieh, et al. SQLMR: A Scalable Database Management System
    for Cloud Computing
    [11] GIGACOM 2012 Facebook Data: http://gigaom.com/data/facebook-iscollecting-
    your-data-500-terabytes-a-day/
    [12] Biswapesh, et al. Tenzing: A SQL implementation On The MapReduce
    Framework
    [13] Number of Column Family recommendations:
    http://hbase.apache.org/book/number.of.cfs.html
    [14] JackHare: http://sourceforge.net/projects/jackhare/
    [15] Hive:https://cwiki.apache.org/confluence/display/Hive/Home
    [16] Apache Pig: http://pig.apache.org/
    60

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)

    QR CODE