科學資料之內存運算查詢系統｜國立清華大學博碩士論文庫

簡易檢索 / 詳目顯示

回結果列表

研究生：	邱宣德 Chiu, Hsuan Te
論文名稱：	科學資料之內存運算查詢系統 In-memory query system for scientific datasets
指導教授：	周志遠 Chou, Jerry
口試委員:	李哲榮蕭宏璋
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊工程學系 Computer Science
論文出版年：	2015
畢業學年度：	103
語文別：	英文
論文頁數：	52
中文關鍵詞：	索引、科學資料
外文關鍵詞：	In-situ computing, query-driven analysis, indexing,, scientifi
相關次數：	點閱：1 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

隨著現今電腦的運算能力愈來愈強大，而且在資料量不斷提升
的情況下有限的I/O頻寬卻無法等比例的提升，兩者間日趨擴大的效
能差異導致傳統的模擬後數據處理方法(post-simulation data
processing method)已面臨效能上的瓶頸。因此原位計算(in-situ
computing)與查詢驅動數據分析(query-driven data analysis)是
用於縮短資料搬移路徑很重要的技巧。我們實作一個結合了位圖索
引(bitmap indexing)、空間資料結構重組(spatial data reorganization)
、分散式共享內存(distributed shared memory)與
位置感知平行執行(location-aware parallel execution)的索引系
統，並且使用了NERSC的超級電腦作為真實環境對兩個真實科學模擬
資料運行實驗分析。結果顯示對比於傳統依賴平行儲存檔案系統的
查詢系統，我們的系統可以達到10倍以上的效能優化。

The growing gap between compute performance and I/O bandwidth coupled with the increasing data volumes has resulted in a bottleneck to the traditional post- simulation data processing method. Hence in-situ computing and query-driven data analysis are important techniques to minimize data movement. By taking advantage of the growing memory capacity on supercomputers, we developed an in-memory query system for scientific data analysis. Our approach is a combination of bitmap indexing, spatial data layout re-organization, distributed shared memory, and location-aware parallel execution. Our evaluations on a NERSC supercomputer using two real scientific datasets showed that we can aggregate the memory ca- pacity from thousands of computes nodes to analyze a 750GB simulation dataset without transferring data to remote nodes or storage systems. Comparing to the traditional solutions based on out-of-core parallel file systems, we achieve more than x10 speedup. Therefore, our system can support interactive query and serve as a vehicle for steering simulations.

Contents
Introduction 5
Related Work 9
1 Array-based database systems . . . . . . . . . . . . . . . . . . . . . . 9
2 Query and indexing techniques . . . . . . . . . . . . . . . . . . . . . 10
3 In-memory & parallel processing . . . . . . . . . . . . . . . . . . . . . 11
System Overview 13
1 Design Principal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 Data Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 API & Use Case Example . . . . . . . . . . . . . . . . . . . . . . . . 18
DSM Storage Layer 19
1 Data Partition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2 Data Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 Swap Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Variable Creation 22
Variable Transformation 24
Query & Indexing 28
1 Spatial Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2 Spatial Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Experimental Evaluation 33
1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2 Range Query Indexing & Query . . . . . . . . . . . . . . . . . . . . . 34
3 Spatial Indexing & Query . . . . . . . . . . . . . . . . . . . . . . . . 36
4 Data Caching & Processing . . . . . . . . . . . . . . . . . . . . . . . 39
5 Compared with SciDB . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Conclusion 45

                                

Bibliography
[1] D. J. Abadi, S. R. Madden, and N. Hachem. Column-stores vs. row-stores:
How dierent are they really? In Proceedings of the 2008 ACM SIGMOD
International Conference on Management of Data, SIGMOD '08, pages 967{
980, 2008.
[2] H. Abbasi, G. Eisenhauer, M. Wolf, K. Schwan, and S. Klasky. Just in time:
Adding value to the io pipelines of high performance applications with jitstaging.
In Proceedings of the 20th International Symposium on High Performance
Distributed Computing, HPDC '11, pages 27{36, 2011.
[3] I. Alagiannis, R. Borovica, M. Branco, S. Idreos, and A. Ailamaki. Nodb:
Ecient query execution on raw data les. In Proceedings of the 2012 ACM
SIGMOD International Conference on Management of Data, SIGMOD '12,
pages 241{252, 2012.
[4] IPCC Fifth Assessment Report. http://en.wikipedia.org/wiki/IPCCF ifthAssessmentReport:
[5] P. Baumann, A. Dehmel, P. Furtado, R. Ritsch, and N. Widmann. The multidimensional
database system rasdaman. In Proceedings of the 1998 ACM SIG-
MOD International Conference on Management of Data, SIGMOD '98, pages
575{577, New York, NY, USA, 1998. ACM.
[6] S. Blanas, K. Wu, S. Byna, B. Dong, and A. Shoshani. Parallel data analysis
directly on scientic le formats. In Proceedings of the 2014 ACM SIGMOD
International Conference on Management of Data, SIGMOD '14, pages 385{
396, 2014.
[7] P. A. Boncz, M. L. Kersten, and S. Manegold. Breaking the memory wall in
monetdb. Commun. ACM, 51(12):77{85, Dec. 2008.
[8] K. J. Bowers, B. Albright, L. Yin, B. Bergen, and T. Kwan. Ultrahigh performance
three-dimensional electromagnetic relativistic kinetic plasma simulationa).
Physics of Plasmas (1994-present), 15(5):055703, 2008.
[9] K. J. Bowers, B. J. Albright, L. Yin, B. Bergen, and T. J. T. Kwan. Ultrahigh
performance three-dimensional electromagnetic relativistic kinetic plasma
simulation. Physics of Plasmas, 15(5):7, 2008.
[10] P. G. Brown. Overview of scidb: Large scale array storage, processing and
analysis. In Proceedings of the 2010 ACM SIGMOD International Conference
on Management of Data, SIGMOD '10, pages 963{968, 2010.
[11] S. Byna, J. Chou, O. Rubel, Prabhat, H. Karimabadi, W. S. Daughton,
V. Roytershteyn, E. W. Bethel, M. Howison, K.-J. Hsu, K.-W. Lin, A. Shoshani,
A. Uselton, and K. Wu. Parallel i/o, analysis, and visualization of a trillion
particle simulation. In SC, page 59, 2012.
[12] Y. Cheng and F. Rusu. Parallel in-situ data processing with speculative loading.
In Proceedings of the 2014 ACM SIGMOD International Conference on
Management of Data, SIGMOD '14, pages 1287{1298, 2014.
[13] J. Chou, M. Howison, B. Austin, K. Wu, J. Qiang, E. W. Bethel, A. Shoshani,
O. Rubel, Prabhat, and R. D. Ryne. Parallel index and query for large scale
data analysis. In SC, page 30, 2011.
[14] J. Chou, K. Wu, and Prabhat. FastQuery: A parallel indexing system for
scientic data. In IASDS. IEEE, 2011.
[15] J. Chou, K. Wu, O. Rubel, M. Howison, J. Qiang, Prabhat, B. Austin, E. W.
Bethel, R. D. Ryne, and A. Shoshani. Parallel index and query for large scale
data analysis. In SC11, 2011.
[16] P. Cudre-Mauroux, H. Kimura, K.-T. Lim, J. Rogers, R. Simakov, E. Soroush,
P. Velikhov, D. L. Wang, M. Balazinska, J. Becla, D. DeWitt, B. Heath,
D. Maier, S. Madden, J. Patel, M. Stonebraker, and S. Zdonik. A demonstration
of scidb: A science-oriented dbms. Proc. VLDB Endow., 2(2):1534{1537,
Aug. 2009.
[17] J. Dean and S. Ghemawat. Mapreduce: Simplied data processing on large
clusters. Commun. ACM, 51(1):107{113, Jan. 2008.
[18] B. Dong, S. Byna, and K. Wu. Sds: A framework for scientic data services.
In Proceedings of the 8th Parallel Data Storage Workshop, PDSW '13, pages
27{32, 2013.
[19] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed
data-parallel programs from sequential building blocks. In Proceedings of the
2Nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007,
EuroSys '07, pages 59{72, 2007.
[20] J. Jenkins, I. Arkatkar, S. Lakshminarasimhan, I. Boyuka, DavidA., E. Schendel,
N. Shah, S. Ethier, C.-S. Chang, J. Chen, H. Kolla, S. Klasky, R. Ross, and
N. Samatova. Alacrity: Analytics-driven lossless data compression for rapid insitu
indexing, storing, and querying. In Transactions on Large-Scale Data- and
Knowledge-Centered Systems X, volume 8220 of Lecture Notes in Computer
Science, pages 95{114. Springer Berlin Heidelberg, 2013.
[21] J. Kim, H. Abbasi, L. Chacon, C. Docan, S. Klasky, Q. Liu, N. Podhorszki,
A. Shoshani, and K. Wu. Parallel in situ indexing for data-intensive computing.
In Large Data Analysis and Visualization (LDAV), 2011 IEEE Symposium on,
pages 65{72, Oct 2011.
[22] J. Kim, H. Abbasi, L. Chacon, C. Docan, S. Klasky, Q. Liu, N. Podhorszki,
A. Shoshani, and K. Wu. Parallel in situ indexing for data-intensive computing.
In Large Data Analysis and Visualization (LDAV), 2011 IEEE Symposium on,
pages 65{72, Oct 2011.
[23] S. Klasky, H. Abbasi, J. Logan, M. Parashar, K. Schwan, A. Shoshani, M. Wolf,
S. Ahern, I. Altintas, W. Bethel, L. Chacon, C. Chang, J. Chen, H. Childs,
J. Cummings, S. Ethier, R. Grout, Z. Lin, Q. Liu, X. Ma, K. Moreland, V. Pascucci,
N. Podhorszki, N. Samatova, W. Schroeder, R. Tchoua, J. Wu, and
W. Yu. In Situ Data Processing for Extreme-Scale Computing. In SciDAC,
July 2011.
[24] ADIOS. http://www.nccs.gov/user-support/center- projects/adios/.
[25] S. Lakshminarasimhan, D. A. Boyuka, S. V. Pendse, X. Zou, J. Jenkins,
V. Vishwanath, M. E. Papka, and N. F. Samatova. Scalable in situ scientic
data encoding for analytical query processing. In Proceedings of the 22Nd Inter-
national Symposium on High-performance Parallel and Distributed Computing,
HPDC '13, pages 1{12, New York, NY, USA, 2013. ACM.
[26] S. Lakshminarasimhan, J. Jenkins, I. Arkatkar, Z. Gong, H. Kolla, S.-H. Ku,
S. Ethier, J. Chen, C. Chang, S. Klasky, R. Latham, R. Ross, and N. Samatova.
Isabela-qa: Query-driven analytics with isabela-compressed extreme-scale
scientic data. In High Performance Computing, Networking, Storage and Anal-
ysis (SC), 2011 International Conference for, pages 1{11, Nov 2011.
[27] S. Lakshminarasimhan, N. Shah, S. Ethier, S. Klasky, R. Latham, R. Ross, and
N. F. Samatova. Compressing the incompressible with isabela: In-situ reduction
of spatio-temporal data. In Proceedings of the 17th International Conference
on Parallel Processing - Volume Part I, Euro-Par'11, pages 366{379, 2011.
[28] J. K. Lawder. Querying multi-dimensional data indexed using the hilbert space-
lling curve. SIGMOD Record, 30:2001, 2001.
[29] L. Libkin, R. Machlin, and L. Wong. A query language for multidimensional
arrays: Design, implementation, and optimization techniques. In Proceedings
of the 1996 ACM SIGMOD International Conference on Management of Data,
SIGMOD '96, pages 228{239, 1996.
[30] K.-L. Ma. In situ visualization at extreme scale: Challenges and opportunities.
Computer Graphics and Applications, IEEE, 29(6):14{19, Nov 2009.
[31] J. Mache, V. Lo, and S. Garg. The impact of spatial layout of jobs on I/O
hotspots in mesh networks. JPDC, 65(10):1190{1203, Oct. 2005.
[32] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and
G. Czajkowski. Pregel: A system for large-scale graph processing. In Proceed-
ings of the 2010 ACM SIGMOD International Conference on Management of
Data, SIGMOD '10, pages 135{146, 2010.
[33] A. P. Marathe and K. Salem. Query processing techniques for arrays. The
VLDB Journal, 11(1):68{91, Aug. 2002.
[34] F. Rusu and A. Dobra. Glade: A scalable framework for ecient analytics.
SIGOPS Oper. Syst. Rev., 46(1):12{18, Feb. 2012.
[35] H. Sagan. Space-Filling Curves. Springer-Verlag, New York, NY.
[36] E. Soroush, M. Balazinska, and D. Wang. Arraystore: A storage manager for
complex parallel array processing. In Proceedings of the 2011 ACM SIGMOD
International Conference on Management of Data, SIGMOD '11, pages 253{
264, New York, NY, USA, 2011. ACM.
[37] T. Tu, H. Yu, J. Bielak, O. Ghattas, J. C. Lopez, K.-L. Ma, D. R. O'Hallaron,
L. Ramirez-Guzman, N. Stone, R. Taborda-Rios, and J. Urbanic. Remote
runtime steering of integrated terascale simulation and visualization. In Pro-
ceedings of the 2006 ACM/IEEE Conference on Supercomputing, SC '06, New
York, NY, USA, 2006. ACM.
[38] V. Vishwanath, M. Hereld, V. Morozov, and M. E. Papka. Topology-aware
data movement and staging for i/o acceleration on blue gene/p supercomputing
systems. In Proceedings of 2011 International Conference for High Performance
Computing, Networking, Storage and Analysis, SC '11, pages 19:1{19:11, 2011.
[39] A. Witkowski, M. Colgan, A. Brumm, T. Cruanes, and H. Baer. Performant
and Scalable Data Loading with Oracle Database 11g, 2011.
[40] K. Wu, S. Ahern, E. W. Bethel, J. Chen, H. Childs, E. Cormier-Michel,
C. Geddes, J. Gu, H. Hagen, B. Hamann, W. Koegler, J. Lauret, J. Meredith,
P. Messmer, E. Otoo, V. Perevoztchikov, A. Poskanzer, Prabhat, O. Rubel,
A. Shoshani, A. Sim, K. Stockinger, G. Weber, and W.-M. Zhang. FastBit:
Interactively searching massive data. In SciDAC, 2009.
[41] H. Yu, C. Wang, R. W. Grout, J. H. Chen, and K.-L. Ma. In situ visualization
for large-scale combustion simulations. IEEE Comput. Graph. Appl., 30(3):45{
57, May 2010.
[42] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark:
Cluster computing with working sets. In Proceedings of the 2Nd USENIX Con-
ference on Hot Topics in Cloud Computing, HotCloud'10, pages 10{10, 2010.
[43] Y. Zhang, M. Kersten, and S. Manegold. Sciql: Array data processing inside
an rdbms. In Proceedings of the 2013 ACM SIGMOD International Conference
on Management of Data, SIGMOD '13, pages 1049{1052, 2013.
[44] F. Zheng, H. Abbasi, C. Docan, J. Lofstead, Q. Liu, S. Klasky, M. Parashar,
N. Podhorszki, K. Schwan, and M. Wolf. PreDatA: Preparatory Data Analytics
on Peta-scale Machines. In Parallel Distributed Processing (IPDPS), 2010 IEEE

全文公開日期本全文未授權公開 (校內網路)
全文公開日期本全文未授權公開 (校外網路)

簡易檢索 / 詳目顯示

相關論文