簡易檢索 / 詳目顯示

研究生: 邱宣德
Chiu, Hsuan Te
論文名稱: 科學資料之內存運算查詢系統
In-memory query system for scientific datasets
指導教授: 周志遠
Chou, Jerry
口試委員: 李哲榮
蕭宏璋
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2015
畢業學年度: 103
語文別: 英文
論文頁數: 52
中文關鍵詞: 索引科學資料
外文關鍵詞: In-situ computing, query-driven analysis, indexing,, scientifi
相關次數: 點閱:1下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著現今電腦的運算能力愈來愈強大,而且在資料量不斷提升
    的情況下有限的I/O頻寬卻無法等比例的提升,兩者間日趨擴大的效
    能差異導致傳統的模擬後數據處理方法(post-simulation data
    processing method)已面臨效能上的瓶頸。因此原位計算(in-situ
    computing)與查詢驅動數據分析(query-driven data analysis)是
    用於縮短資料搬移路徑很重要的技巧。我們實作一個結合了位圖索
    引(bitmap indexing)、空間資料結構重組(spatial data reorganization)
    、分散式共享內存(distributed shared memory)與
    位置感知平行執行(location-aware parallel execution)的索引系
    統,並且使用了NERSC的超級電腦作為真實環境對兩個真實科學模擬
    資料運行實驗分析。結果顯示對比於傳統依賴平行儲存檔案系統的
    查詢系統,我們的系統可以達到10倍以上的效能優化。


    The growing gap between compute performance and I/O bandwidth coupled with the increasing data volumes has resulted in a bottleneck to the traditional post- simulation data processing method. Hence in-situ computing and query-driven data analysis are important techniques to minimize data movement. By taking advantage of the growing memory capacity on supercomputers, we developed an in-memory query system for scientific data analysis. Our approach is a combination of bitmap indexing, spatial data layout re-organization, distributed shared memory, and location-aware parallel execution. Our evaluations on a NERSC supercomputer using two real scientific datasets showed that we can aggregate the memory ca- pacity from thousands of computes nodes to analyze a 750GB simulation dataset without transferring data to remote nodes or storage systems. Comparing to the traditional solutions based on out-of-core parallel file systems, we achieve more than x10 speedup. Therefore, our system can support interactive query and serve as a vehicle for steering simulations.

    Contents 1 Introduction 5 2 Related Work 9 2.1 Array-based database systems . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Query and indexing techniques . . . . . . . . . . . . . . . . . . . . . 10 2.3 In-memory & parallel processing . . . . . . . . . . . . . . . . . . . . . 11 3 System Overview 13 3.1 Design Principal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2 Data Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.3 API & Use Case Example . . . . . . . . . . . . . . . . . . . . . . . . 18 4 DSM Storage Layer 19 4.1 Data Partition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2 Data Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.3 Swap Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 5 Variable Creation 22 6 Variable Transformation 24 7 Query & Indexing 28 7.1 Spatial Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 7.2 Spatial Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 8 Experimental Evaluation 33 8.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 8.2 Range Query Indexing & Query . . . . . . . . . . . . . . . . . . . . . 34 8.3 Spatial Indexing & Query . . . . . . . . . . . . . . . . . . . . . . . . 36 8.4 Data Caching & Processing . . . . . . . . . . . . . . . . . . . . . . . 39 8.5 Compared with SciDB . . . . . . . . . . . . . . . . . . . . . . . . . . 39 9 Conclusion 45

    Bibliography
    [1] D. J. Abadi, S. R. Madden, and N. Hachem. Column-stores vs. row-stores:
    How di erent are they really? In Proceedings of the 2008 ACM SIGMOD
    International Conference on Management of Data, SIGMOD '08, pages 967{
    980, 2008.
    [2] H. Abbasi, G. Eisenhauer, M. Wolf, K. Schwan, and S. Klasky. Just in time:
    Adding value to the io pipelines of high performance applications with jitstaging.
    In Proceedings of the 20th International Symposium on High Performance
    Distributed Computing, HPDC '11, pages 27{36, 2011.
    [3] I. Alagiannis, R. Borovica, M. Branco, S. Idreos, and A. Ailamaki. Nodb:
    Ecient query execution on raw data les. In Proceedings of the 2012 ACM
    SIGMOD International Conference on Management of Data, SIGMOD '12,
    pages 241{252, 2012.
    [4] IPCC Fifth Assessment Report. http://en.wikipedia.org/wiki/IPCCF ifthAssessmentReport:
    [5] P. Baumann, A. Dehmel, P. Furtado, R. Ritsch, and N. Widmann. The multidimensional
    database system rasdaman. In Proceedings of the 1998 ACM SIG-
    MOD International Conference on Management of Data, SIGMOD '98, pages
    575{577, New York, NY, USA, 1998. ACM.
    [6] S. Blanas, K. Wu, S. Byna, B. Dong, and A. Shoshani. Parallel data analysis
    directly on scienti c le formats. In Proceedings of the 2014 ACM SIGMOD
    International Conference on Management of Data, SIGMOD '14, pages 385{
    396, 2014.
    [7] P. A. Boncz, M. L. Kersten, and S. Manegold. Breaking the memory wall in
    monetdb. Commun. ACM, 51(12):77{85, Dec. 2008.
    [8] K. J. Bowers, B. Albright, L. Yin, B. Bergen, and T. Kwan. Ultrahigh performance
    three-dimensional electromagnetic relativistic kinetic plasma simulationa).
    Physics of Plasmas (1994-present), 15(5):055703, 2008.
    [9] K. J. Bowers, B. J. Albright, L. Yin, B. Bergen, and T. J. T. Kwan. Ultrahigh
    performance three-dimensional electromagnetic relativistic kinetic plasma
    simulation. Physics of Plasmas, 15(5):7, 2008.
    [10] P. G. Brown. Overview of scidb: Large scale array storage, processing and
    analysis. In Proceedings of the 2010 ACM SIGMOD International Conference
    on Management of Data, SIGMOD '10, pages 963{968, 2010.
    [11] S. Byna, J. Chou, O. Rubel, Prabhat, H. Karimabadi, W. S. Daughton,
    V. Roytershteyn, E. W. Bethel, M. Howison, K.-J. Hsu, K.-W. Lin, A. Shoshani,
    A. Uselton, and K. Wu. Parallel i/o, analysis, and visualization of a trillion
    particle simulation. In SC, page 59, 2012.
    [12] Y. Cheng and F. Rusu. Parallel in-situ data processing with speculative loading.
    In Proceedings of the 2014 ACM SIGMOD International Conference on
    Management of Data, SIGMOD '14, pages 1287{1298, 2014.
    [13] J. Chou, M. Howison, B. Austin, K. Wu, J. Qiang, E. W. Bethel, A. Shoshani,
    O. Rubel, Prabhat, and R. D. Ryne. Parallel index and query for large scale
    data analysis. In SC, page 30, 2011.
    [14] J. Chou, K. Wu, and Prabhat. FastQuery: A parallel indexing system for
    scienti c data. In IASDS. IEEE, 2011.
    [15] J. Chou, K. Wu, O. Rubel, M. Howison, J. Qiang, Prabhat, B. Austin, E. W.
    Bethel, R. D. Ryne, and A. Shoshani. Parallel index and query for large scale
    data analysis. In SC11, 2011.
    [16] P. Cudre-Mauroux, H. Kimura, K.-T. Lim, J. Rogers, R. Simakov, E. Soroush,
    P. Velikhov, D. L. Wang, M. Balazinska, J. Becla, D. DeWitt, B. Heath,
    D. Maier, S. Madden, J. Patel, M. Stonebraker, and S. Zdonik. A demonstration
    of scidb: A science-oriented dbms. Proc. VLDB Endow., 2(2):1534{1537,
    Aug. 2009.
    [17] J. Dean and S. Ghemawat. Mapreduce: Simpli ed data processing on large
    clusters. Commun. ACM, 51(1):107{113, Jan. 2008.
    [18] B. Dong, S. Byna, and K. Wu. Sds: A framework for scienti c data services.
    In Proceedings of the 8th Parallel Data Storage Workshop, PDSW '13, pages
    27{32, 2013.
    [19] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed
    data-parallel programs from sequential building blocks. In Proceedings of the
    2Nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007,
    EuroSys '07, pages 59{72, 2007.
    [20] J. Jenkins, I. Arkatkar, S. Lakshminarasimhan, I. Boyuka, DavidA., E. Schendel,
    N. Shah, S. Ethier, C.-S. Chang, J. Chen, H. Kolla, S. Klasky, R. Ross, and
    N. Samatova. Alacrity: Analytics-driven lossless data compression for rapid insitu
    indexing, storing, and querying. In Transactions on Large-Scale Data- and
    Knowledge-Centered Systems X, volume 8220 of Lecture Notes in Computer
    Science, pages 95{114. Springer Berlin Heidelberg, 2013.
    [21] J. Kim, H. Abbasi, L. Chacon, C. Docan, S. Klasky, Q. Liu, N. Podhorszki,
    A. Shoshani, and K. Wu. Parallel in situ indexing for data-intensive computing.
    In Large Data Analysis and Visualization (LDAV), 2011 IEEE Symposium on,
    pages 65{72, Oct 2011.
    [22] J. Kim, H. Abbasi, L. Chacon, C. Docan, S. Klasky, Q. Liu, N. Podhorszki,
    A. Shoshani, and K. Wu. Parallel in situ indexing for data-intensive computing.
    In Large Data Analysis and Visualization (LDAV), 2011 IEEE Symposium on,
    pages 65{72, Oct 2011.
    [23] S. Klasky, H. Abbasi, J. Logan, M. Parashar, K. Schwan, A. Shoshani, M. Wolf,
    S. Ahern, I. Altintas, W. Bethel, L. Chacon, C. Chang, J. Chen, H. Childs,
    J. Cummings, S. Ethier, R. Grout, Z. Lin, Q. Liu, X. Ma, K. Moreland, V. Pascucci,
    N. Podhorszki, N. Samatova, W. Schroeder, R. Tchoua, J. Wu, and
    W. Yu. In Situ Data Processing for Extreme-Scale Computing. In SciDAC,
    July 2011.
    [24] ADIOS. http://www.nccs.gov/user-support/center- projects/adios/.
    [25] S. Lakshminarasimhan, D. A. Boyuka, S. V. Pendse, X. Zou, J. Jenkins,
    V. Vishwanath, M. E. Papka, and N. F. Samatova. Scalable in situ scienti c
    data encoding for analytical query processing. In Proceedings of the 22Nd Inter-
    national Symposium on High-performance Parallel and Distributed Computing,
    HPDC '13, pages 1{12, New York, NY, USA, 2013. ACM.
    [26] S. Lakshminarasimhan, J. Jenkins, I. Arkatkar, Z. Gong, H. Kolla, S.-H. Ku,
    S. Ethier, J. Chen, C. Chang, S. Klasky, R. Latham, R. Ross, and N. Samatova.
    Isabela-qa: Query-driven analytics with isabela-compressed extreme-scale
    scienti c data. In High Performance Computing, Networking, Storage and Anal-
    ysis (SC), 2011 International Conference for, pages 1{11, Nov 2011.
    [27] S. Lakshminarasimhan, N. Shah, S. Ethier, S. Klasky, R. Latham, R. Ross, and
    N. F. Samatova. Compressing the incompressible with isabela: In-situ reduction
    of spatio-temporal data. In Proceedings of the 17th International Conference
    on Parallel Processing - Volume Part I, Euro-Par'11, pages 366{379, 2011.
    [28] J. K. Lawder. Querying multi-dimensional data indexed using the hilbert space-
    lling curve. SIGMOD Record, 30:2001, 2001.
    [29] L. Libkin, R. Machlin, and L. Wong. A query language for multidimensional
    arrays: Design, implementation, and optimization techniques. In Proceedings
    of the 1996 ACM SIGMOD International Conference on Management of Data,
    SIGMOD '96, pages 228{239, 1996.
    [30] K.-L. Ma. In situ visualization at extreme scale: Challenges and opportunities.
    Computer Graphics and Applications, IEEE, 29(6):14{19, Nov 2009.
    [31] J. Mache, V. Lo, and S. Garg. The impact of spatial layout of jobs on I/O
    hotspots in mesh networks. JPDC, 65(10):1190{1203, Oct. 2005.
    [32] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and
    G. Czajkowski. Pregel: A system for large-scale graph processing. In Proceed-
    ings of the 2010 ACM SIGMOD International Conference on Management of
    Data, SIGMOD '10, pages 135{146, 2010.
    [33] A. P. Marathe and K. Salem. Query processing techniques for arrays. The
    VLDB Journal, 11(1):68{91, Aug. 2002.
    [34] F. Rusu and A. Dobra. Glade: A scalable framework for ecient analytics.
    SIGOPS Oper. Syst. Rev., 46(1):12{18, Feb. 2012.
    [35] H. Sagan. Space-Filling Curves. Springer-Verlag, New York, NY.
    [36] E. Soroush, M. Balazinska, and D. Wang. Arraystore: A storage manager for
    complex parallel array processing. In Proceedings of the 2011 ACM SIGMOD
    International Conference on Management of Data, SIGMOD '11, pages 253{
    264, New York, NY, USA, 2011. ACM.
    [37] T. Tu, H. Yu, J. Bielak, O. Ghattas, J. C. Lopez, K.-L. Ma, D. R. O'Hallaron,
    L. Ramirez-Guzman, N. Stone, R. Taborda-Rios, and J. Urbanic. Remote
    runtime steering of integrated terascale simulation and visualization. In Pro-
    ceedings of the 2006 ACM/IEEE Conference on Supercomputing, SC '06, New
    York, NY, USA, 2006. ACM.
    [38] V. Vishwanath, M. Hereld, V. Morozov, and M. E. Papka. Topology-aware
    data movement and staging for i/o acceleration on blue gene/p supercomputing
    systems. In Proceedings of 2011 International Conference for High Performance
    Computing, Networking, Storage and Analysis, SC '11, pages 19:1{19:11, 2011.
    [39] A. Witkowski, M. Colgan, A. Brumm, T. Cruanes, and H. Baer. Performant
    and Scalable Data Loading with Oracle Database 11g, 2011.
    [40] K. Wu, S. Ahern, E. W. Bethel, J. Chen, H. Childs, E. Cormier-Michel,
    C. Geddes, J. Gu, H. Hagen, B. Hamann, W. Koegler, J. Lauret, J. Meredith,
    P. Messmer, E. Otoo, V. Perevoztchikov, A. Poskanzer, Prabhat, O. Rubel,
    A. Shoshani, A. Sim, K. Stockinger, G. Weber, and W.-M. Zhang. FastBit:
    Interactively searching massive data. In SciDAC, 2009.
    [41] H. Yu, C. Wang, R. W. Grout, J. H. Chen, and K.-L. Ma. In situ visualization
    for large-scale combustion simulations. IEEE Comput. Graph. Appl., 30(3):45{
    57, May 2010.
    [42] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark:
    Cluster computing with working sets. In Proceedings of the 2Nd USENIX Con-
    ference on Hot Topics in Cloud Computing, HotCloud'10, pages 10{10, 2010.
    [43] Y. Zhang, M. Kersten, and S. Manegold. Sciql: Array data processing inside
    an rdbms. In Proceedings of the 2013 ACM SIGMOD International Conference
    on Management of Data, SIGMOD '13, pages 1049{1052, 2013.
    [44] F. Zheng, H. Abbasi, C. Docan, J. Lofstead, Q. Liu, S. Klasky, M. Parashar,
    N. Podhorszki, K. Schwan, and M. Wolf. PreDatA: Preparatory Data Analytics
    on Peta-scale Machines. In Parallel Distributed Processing (IPDPS), 2010 IEEE

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)

    QR CODE