研究生: |
李祁鴻 Lee, Chi Hung |
---|---|
論文名稱: |
透過實作資料局部性排程演算法優化Hadoop-MapReduce之效能 Optimizing the Performance of Hadoop-MapReduce Through Implemented Data Locality Scheduling Algorithm |
指導教授: |
石維寬
Shih, Wei-Kuan |
口試委員: |
呂政修
Wei, Hsin-Wen 衛信文 Leu, Jenq-Shiou 徐讚昇 Hsu, Tsan-sheng 石維寬 Shih, Wei-Kuan |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊系統與應用研究所 Institute of Information Systems and Applications |
論文出版年: | 2013 |
畢業學年度: | 101 |
語文別: | 中文 |
論文頁數: | 32 |
中文關鍵詞: | 雲端運算 、資料局部性 |
外文關鍵詞: | Hadoop, MapReduce |
相關次數: | 點閱:3 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
雲端運算越來越受到歡迎並持續於結構、網路以及軟體上發展。Hadoop-MapReduce使用分散式叢集來平行處理處理大量資料,是一個很常見的軟體架構。它裡面的處理節點可以擴充到一個相當大的數量,因此挾著強大的運算能力的Hadoop-MapReduce可以提供相當好的一個處理平台。而網路流量一直以來都是資料密集性運算最大的瓶頸,在資料平行系統對效能會造成顯著的影響。此網路瓶頸是網路頻寬所導致,使得網路速度比硬碟資料存取還要慢上許多。然而,好的資料局部性可以減少網路流量並使資料密集的HPC(High-performance computing)系統效能增加。不過Hadoop的排程在資源分配上有個缺乏考慮資料局部性的缺點,所以本論文提出了一個Hadoop-MapReduce位置感知排程演算法。首先我們提出了一個Hadoop排程的資料影響權重數學模型,其次,使用資料局部性排程演算法與資料影響權重搭配來提供位置感知的資源分配。最後建立三台安裝Xen Cloud Platform的實體機器,而每台實體機器上運行兩個裝有hadoop的虛擬機並使用模擬來驗證此演算法的效能。
Cloud computing has become more popular, and it has been continuously developed in architecture, software, and network. Hadoop-MapReduce is a common software framework processing parallelizable problem across big datasets using a distributed cluster. Cloud Hadoop-MapReduce can scale incrementally in the number of processing nodes. Hence, the Hadoop-MapReduce is designed to provide a processing platform with powerful computation. Network traffic is always a most important bottleneck in data-intensive computing and network latency decreases significant performance in data parallel systems. Network bottleneck is caused by network bandwidth and the network speed is much slower than disk data access. So that, good data locality can reduces network traffic and increases performance in data-intensive HPC systems. However, Hadoop’s scheduler has a defect of data locality in resource assignment. This paper includes a locality-aware scheduling algorithm for Hadoop-MapReduce scheduler. Firstly, we propose a mathematical model of weight of data interference in Hadoop scheduler. Secondly, we present the algorithm to use weight of data interference to provide data locality-aware resource assignment in Hadoop scheduler. Finally, we build an experimental environment with 3 physical machines which were installed Xen Cloud Platform and 2 virtual machines which are installed hadoop on each physical machine. Then, run simulation to verify the performance of locality-aware scheduling algorithm for Hadoop-MapReduce scheduler.
[1] Jeffrey Dean and Sanjay Ghemawat, “MapReduce: simplified data processing on large clusters,” Magazine Communication of the ACM, vol. 51, pp. 107-113, Jan. 2008.
[2] Apache HBase, http://hbase.apache.org/
[3] D.Borthakur, “The Hadoop Distributed File System: Architecture and Design,” The Apache Software Foundation, 2007.
[4] AJG Hey, S Tansley, KM Tolle, “The fourth paradigm: data-intensive scientific discovery,” iw.fh-potsdam.de, 2009
[5] Zhiyong Zhong, Shengzhong Feng, Bibo Tu and Jianping Fan, “Improving Data Locality of MapReduce by Scheduling in Homogeneous Computing Environments,” Parallel and Distributed Processing with Applications (ISPA), 2011
[6] Wikipedia- hypervisor, https://en.wikipedia.org/wiki/Hypervisor
[7] Tseng-Yi Chen, Hsin-Wen Wei, Ming-Feng Wei, Ying-Jie Chen, Tsan-sheng Hsu and Wei-Kuan Shih,"LaSA: A locality-aware scheduling algorithm for Hadoop-MapReduce resource assignment", in Collaboration Technologies and Systems (CTS), 2013 International Conference on, pp. 342-346, May. 2013
[8] S. Devine, E. Bugnion, and M. Rosenblum. “Virtualization system including a virtual machine monitor for a computer with a segmented architecture.” US Patent, 6397242, Oct. 1998.
[9] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield. “Xen and the art of virtualization.” In SOSP '03: Proceedings of the nineteenth ACM symposium on Operating systems principles, pages 164-177, New York, NY, USA, 2003. ACM.
[10] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, “The Google file system,” Proceedings of the nineteenth ACM symposium on Operating systems principles, October 19-22, 2003, Bolton Landing, NY, USA
[11] B Hendrickson and TG Kolda, “Graph partitioning models for parallel computing,” Parallel computing, 2000
[12] Cheng T. Chu, Sang K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun, "Map-Reduce for Machine Learning on Multicore," in Proc. of Neural Information Processing Systems (NIPS), 2006.
[13] T. Tu, C. A. Rendleman, D. W. Borhani, R. O. Dror, J. Gullingsrud, M. O. Jensen, J. L. Klepeis, P. Maragakis, P. Miller, K. A. Stafford, and D. E. Shaw, "A Scalable Parallel Framework for Analyzing Terascale Molecular Dynamics Simulation Trajectories," in Proc. Of the ACM/IEEE Conference on Supercomputing, 2008
[14] Satish Srirama, Oleg Batrashev, and Eero Vainikko, “SciCloud: Scientific Computing on the Cloud,” Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, p.579-580, May 17-20, 2010
[15] HDFS Architecture Guide, http://hadoop.apache.org/docs/stable/hdfs_design.html
[16] Running Hadoop On Ubuntu Linux (Multi-Node Cluster), http://www.quuxlabs.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/
[17] http://www.rtlab.cs.nthu.edu.tw/hadoop/