考量多核系統下執行的模板應用程式利用動態資料搬移來消除記憶庫干擾

簡易檢索 / 詳目顯示

回結果列表

研究生：	陳衍昊 Chen, Yen-Hao
論文名稱：	考量多核系統下執行的模板應用程式利用動態資料搬移來消除記憶庫干擾 Dynamic Data Migration to Eliminate Bank-level Interference for Stencil Applications in Multicore Systems
指導教授：	黃婷婷 Hwang, TingTing
口試委員:	金仲達 King, Chung-Ta 黃俊達 Huang, Juinn-Dar
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊工程學系 Computer Science
論文出版年：	2014
畢業學年度：	102
語文別：	英文
論文頁數：	40
中文關鍵詞：	動態排程、模板應用程式、記憶庫衝突、記憶體干擾
外文關鍵詞：	dynamic scheduling, stencils, bank conflicts, memory interference
相關次數：	點閱：62 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

模板應用程式的特性是不斷地使用自身以及鄰近的點來進行相同的運算。新穎的自動轉換編譯技術可以有效率的產生磁磚式平行化模板應用程式。動態排成平行化模板應用程式大幅度的增加系統效能，然而，因為較少的閒置的核心以及較多的記憶體需求在一個時間被送至記憶體，造成了記憶體干擾問題惡化。傳統作業系統虛擬頁著色方法將記憶體虛擬頁分開來，但是沒辦法有效消除動態排成平行化模板應用程式的記憶體干擾。實驗結果顯示，與原本動態排成平行化模板應用程式相比，在八個核心、四個記憶庫的系統上面，我們的方法增快系統效能7%；在十六個核心、四個記憶庫的系統上面，則是增快9.3%。

A stencil computation repeatedly updates each point of a d-dimensional grid as a func-tion of itself and its near neighbors. Modern automatic transformation compiler framework can generate ecient tiling parallel stencil codes. Dynamically scheduling parallel stencils signicantly improves system performance. However, memory contention problem exacer-bates because of less idling cores and more memory requests sent to the DRAM memory in the same period of time. Traditional OS page coloring method which partitions the memory pages in advance can not alleviate the memory contention in dynamic scheduling parallel stencils. To address this issue, we provide a new software/hardware cooperation dynamic data migration method. Experimental evaluation in a 8-core x86 system shows that our method can improve the system performance by 7% as compared with dynamic scheduling stencils in 8-cores 4-memory banks system and by 9.3% in 16-cores 4 memory banks system.

Introduction 1
Previous Work 5
1 OS-Level Thread Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Memory Page Mapping Scheme . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Compiler-Based Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4 Scheduling Algorithms for Memory Controller . . . . . . . . . . . . . . . . . 7
Motivation 9
Methodology 15
1 Overview of System Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 Updated-and-Reused Aware Page Allocation Policy in OS . . . . . . . . . . 18
3 Migrate-On-Write . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4 Memory Controller for Bank-level Interference Elimination . . . . . . . . . . 21
Experimental Results 25
1 Simulation Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2 Comparison Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3 Eect of the Number of Entries in Mapping Table . . . . . . . . . . . . . . . 29
4 Eect of the Number of Banks in a Group . . . . . . . . . . . . . . . . . . . 30
5 Scalability with Cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Conclusion
                                

[1] M. M. Baskaran, N. Vydyanathan, U. K. Bondhugula, J. Ramanujam, A. Rountev, P.
Sadayappan, \The Compiler-Assisted Dynamic Scheduling for Eective Parallelization
of Loop Nests on Multicore Processors," PPoPP'09, pp. 219-228, 2009.
[2] L. Liu, Z. Cui, M. Xing, Y. Bao, M. Chen, C. Wu, \A Software Memory Partition
Approach for Eliminating Bank-level Interference in Multicore Systems," PACT'12, pp.
367-376, 2012.
[3] Y. Kim, D. Han, O. Mutlu, M. Harchol-balter, \ATLAS: A Scalable and High-
Performance Scheduling Algorithm for Multiple Memory Controllers," HPCA'10, pp.
1-12, 2010.
[4] Y. Kim, M. Papamichael, O. Mutlu, M. Harchol-Balter, \Thread Cluster Memory
Scheduling: Exploiting Dierences in Memory Access Behavior," MICRO'10, pp. 65-76,
2010.
[5] E. Ebrahimi, R. Miftakhutdinov, C. Fallin, C. J. Lee, O. Mutlu, and Y. N. Patt, \Parallel
Application Memory Scheduling," MICRO'11, pp. 362-373, 2011.
[6] C. Ancourt and F. Irigoin, \Scanning Polyhedra with DO Loops," PPoPP'91, pp. 39-50,
1991.
[7] C. Bastoul, \Code Generation in the Polyhedral Model Is Easier Than You Think,"
PACT'04, pp. 7-16, 2004.
[8] P. Feautrier, \Data
ow Analysis of Array and Scalar References," IJPP, vol. 20, iss. 1,
pp. 23-53, 1991.
[9] S. Girbal, N. Vasilache, C. Bastoul, A. Cohen, D. Parello, M. Sigler, O. Temam, \Semi-
Automatic Composition of Loop Transformations for Deep Parallelism and Memory
Hierarchies," IJPP, vol. 34, iss. 3, pp. 261-317, 2006.
[10] A. Lim, \Improving Parallelism And Data Locality With Ane Partitioning," PhD
thesis, Stanford University, 2001.
[11] W. Pugh, \The Omega Test: a fast and practical integer programming algorithm for
dependence analysis," Supercomputing'91, pp. 4-13, 1991.
[12] F. Quillere, S. V. Rajopadhye, and D. Wilde, \Generation of Ecient Nested Loops
from Polyhedra," IJPP, vol. 28, iss. 5, pp. 469-498, 2000.
[13] U. Bondhugula, J. Ramanujam, and P. Sadayappan, \Pluto: A practical and fully au-
tomatic polyhedral parallelizer and locality optimizer," Technical Report OSU-CISRC-
10/07-TR70, The Ohio State University, 2007.
[14] V. Bandishti, I. Pananilath, and U. Bondhugula, \Tiling Stencil Computations to Max-
imize Parallelism," SC'12, pp. 1-11, 2012.
[15] CLooG: The Chunky Loop Generator. http://www.cloog.org.
[16] J. Demme, S. Sethumadhavan, \Rapid Identication of Architectural Bottlenecks via
Precise Event Counting," ISCA'11, pp. 353-364, 2011.
[17] JEDEC. Standard No. 21-C. Annex K: Serial Presence Detect (SPD) for DDR3 SDRAM
Modules, 2011.
[18] P. S. Magnusson, Virtutech AB, Stockholm, Sweden, M. Christensson, J. Eskilson, D.
Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, B. Werner, \Simics: A
Full System Simulation Platform," Computer, vol. 35, iss. 2, pp. 50-58, 2002.
[19] M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen,
K. E. Moore, M. D. Hill, D. A. Wood, \Multifacets General Execution-driven Multi-
processor Simulator (GEMS) Toolset," dasCMP'05, vol. 33, iss. 4, pp. 92-99, 2005.
[20] M. Christen, O. Schenk, H. Burkhart, \PATUS: A Code Generation and Autotuning
Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures,"
IPDPS'11, pp. 676-687, 2011.
[21] O. Mutlu, T. Moscibroda, \Parallelism-Aware Batch Scheduling: Enhancing both Per-
formance and Fairness of Shared DRAM Systems," ISCA'08, pp. 63-74, 2008.
[22] O. Mutlu, T. Moscibroda, \Parallelsim-Aware Batch Scheduling: Enabling High-
Performance and Fair Shared Memory Controllers," MICRO'09, pp. 22-32, 2009.
[23] S. P. Muralidhara, L. Subramanian, O. Mutlu, \Reducing Memory Interference in Mul-
ticore Systems via Application-Aware Memory Channel Partitioning," MICRO'11, pp.
374-385, 2011.
[24] K. Sudan, N. Chatterjee, D. Nellans, M. Awasthi, \Micro-Pages: Increasing DRAM
Eciency with Locality-Aware Data Placement," ASPLOS'10, pp. 219-230, 2010.
[25] A. Snavely, D. M. Tullsen, \Symbiotic Jobscheduling for a Simultaneous Multithreading
Processor," ASPLOS IX, pp. 234-244, 2000.
[26] U. Bondhugula, M. Baskaran, S. Krishnamoorthy, J. Ramanujam, A. Rountev P. Sa-
dayappan, \Automatic Transformations for Communication-minimized Parallelization
and Locality Optimization in the Polyhedral Model," CC'08/ETAPS'08, pp.132-146,
2008.
[27] U. Bondhugula, A. Hartono, J. Ramanujan, P. Sadayappan, \A Practical Automatic
Polyhedral Parallelizer and Locality Optimizer," PLDI '08, pp. 101-113, 2008.
[28] L. -N. Pouchet, C. Bastoul, A. Cohen, J. Cavazos, \Iterative Optimization in the Poly-
hedral Model: Part II, Multidimensional Time," PLDI'08, pp. 90-100, 2008.
[29] A. W. Lim, G. I. Cheong, and M. S. Lam, \An Ane Partitioning Algorithm to Maxi-
mize Parallelism and Minimize Communication," ICS'99, pp. 228-237, 1999.
[30] L.-N. Pouchet, C. Bastoul, A. Cohen, N. Vasilache, \Iterative Optimization in the
Polyhedral Model: Part I, One-Dimensional Time," CGO'07, pp.144-156, 2007.
[31] T. Henrety, K. Stock, Louis. -N. Pouchet, F. Franchetti, J. Ramanujam, P. Sadayap-
pan, \Data Layout Transformation for Stencil Computations on Short-Vector SIMD
Architectures," CC'11/ETAPS'11, pp. 225-245, 2011.
[32] S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, J. D. Owens, \Memory Access Schedul-
ing," ISCA'00, pp. 128-138, 2000.
[33] M. Awasthi, D. Nellans, K. Sudan, R. Balasubramonian, A. Davis, \Handling the Prob-
lems and Opportunities Posed by Multiple On-Chip Memory Controllers," PACT'10,
pp. 319-330, 2010
[34] L. Subramanian, V. eshadri, Y. Kim, B. Jaiyen, O. Mutlu, \MISE: Providing Per-
formance Predictability and Improving Fairness in Shared Main Memory Systems,"
HPCA'13, pp. 639-650, 2013.
[35] T. Henretty, R. Veras, F. Franchetti, L.-N Pouchet, J. Ramanujam, P. Sadayappan, \A
Stencil Compiler for Short-Vector SIMD Architectures," ICS'13, pp. 13-24, 2013.
[36] R. Strzodka, M. Shaheen, D. Pajak, H.-P Sedel, \Cache Oblivious Parallelograms in
Iterative Stencil Computations," ICS'10, pp. 49-59, 2010.
[37] N. Guan, M. Stigge, W. Yi, and G. Yu, \Cache-Aware Scheduling and Analysis for
Multicores," EMSOFT'09, pp. 245-254, 2009.
[38] Y. Jiang, X. Shen, C. Jie, R. Tripathi, \Analysis and Approximation of Optimal Co-
Scheduling on Chip Multiprocessors," PACT'08, pp. 220-229, 2008.
[39] A. El-Moursy, R. Garg, D. H. Albonesi, and S. Dwarkadas, \Compatible phase co-
scheduling on a cmp of multi-threaded processors," IPDPS'06, pp. 141, 2006.
[40] A. Fedorova, M. Seltzer, C. Small, and D. Nussbaum, \Performance of Multithreaded
Chip Multiprocessors And Implications For Operating System Design," ATEC'05, pp.
26, 2005.
[41] A. Fedorova, M. Seltzer, and M. D. Smith, \Improving Performance Isolation on Chip
Multiprocessors via an Operating System Scheduler," PACT'07, pp. 25-38, 2007.
[42] S. Kim, D. Chandra, and Y. Solihin, \Fair Cache Sharing and Partitioning in a Chip
Multiprocessor Architecture," PACT'04, pp. 111-122, 2004.
[43] Z. Zhang, Z. Zhu, X. Zhang, \A Permutation-based Page Interleaveing Scheme to Re-
duce Row-buer Con
icts and Exploit Data Locality," Micro'00, pp. 32-41, 2000.

全文公開日期本全文未授權公開 (校內網路)
全文公開日期本全文未授權公開 (校外網路)

簡易檢索 / 詳目顯示

相關論文