利用資料區塊化及執行緒分組以提升多核心平台快取資源之使用效率

簡易檢索 / 詳目顯示

回結果列表

研究生：	呂威儀 Lu, Wei-I
論文名稱：	利用資料區塊化及執行緒分組以提升多核心平台快取資源之使用效率 Improving Multi-core Cache Utilization with Data Blocking and Thread Grouping
指導教授：	徐爵民 Shyu, Jyuo-Min
口試委員:
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊工程學系 Computer Science
論文出版年：	2010
畢業學年度：	98
語文別：	中文
論文頁數：	83
中文關鍵詞：	多核心、快取
相關次數：	點閱：2 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

近年來由於多核心平台的快速發展，複雜的使用環境使得我們重新審視
既有方法的有效性，以建立一套優化方法，讓系統及程式充分發揮效能。
本研究以處理器快取記憶體(Cache) 作為研究標的，以軟體的技巧來對多
核心平台快取資源進行優化。我們結合程式設計師的專業考量、編譯器的
自動化技術以及系統的資訊，整合了軟體執行期間三個不同層級之間的重
要訊息，協助解決平行程式對於快取資源分配及使用的問題。本論文提出
資料區塊化與執行緒分組技術，並以實驗驗證對於快取資源的影響。

緒論10
1 快取使用效率對多核心性能的影響. . . . . . . . . . . . . . . 10
2 研究動機. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
相關研究15
1 OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 Cetus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3 PAPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
資料區塊化(Data Blocking) 與執行緒分組(Thread Grouping)
技術28
1 資料區塊化(Data Blocking) . . . . . . . . . . . . . . . . . . . 28
2 執行緒分組(Thread Grouping) . . . . . . . . . . . . . . . . . 36
3 結合資料區塊化與執行緒分組技術. . . . . . . . . . . . . . . 47
實驗結果57
1 測試程式. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
1.1 實驗程式透過GCC 編譯之結果. . . . . . . . . . . . . 57
1.2 實驗程式透過Open64 編譯之結果. . . . . . . . . . . 62
2 結果分析. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
2.1 Data Size 與Cache Size 之關係. . . . . . . . . . . . . 65
2.2 編譯器及資料區塊化與執行緒分組對快取的影響. . . 67
結論與未來研究方向70
A 程式碼及部分實驗數據77

                                

[1] The cetus project. http://cetus.ecn.purdue.edu/.
[2] Creating a gcc optimization plugin. http:// ehren.wordpress.com/
2009/11/04/creating-a-gcc-optimization-plugin/.
[3] Gcc, the gnu compiler collection. http://gcc.gnu.org/.
[4] Intel thread affinity environment variable for openmp. http://
www.intel.com/ software/ products/ compilers/ docs/ fmac/ doc files/
source/extfile/optaps for/common/optaps openmp thread affinity.htm.
[5] Linux taskset to retrieve or set a processes cpu affinity. http://
www.cyberciti.biz/faq/taskset-cpu-affinity-command/.
[6] Memory part 2: Cpu caches. http://lwn.net/Articles/252125/.
[7] Nas parallel benchmarks. http://www.nas.nasa.gov/Resources/Software/
npb.html.
[8] The omni project. http://phase.hpcc.jp/Omni/home.html.
[9] The open64 compiler. http://www.open64.net/.
[10] The openmp api specification for parallel programming. http://
openmp.org/wp/.
[11] Options that control optimization. http://gcc.gnu.org/onlinedocs/gcc/
Optimize-Options.html.
[12] Parkbench matrix kernel benchmarks. http://www.netlib.org/parkbench/
html/matrix-kernels.html.
[13] Parkbench openmp version. http://www.netlib.org/parkbench/distribution/
pbll.tar.gz.
[14] Performance application programming interface. http://icl.cs.utk.edu/
papi/.
[15] The / proc file system utilities. http:// procps.sourceforge.net/ index.
html.
[16] Quad-core intelR xeonR processor 5400 series. Intel.
[17] sched setaffinity(2) - linux man page. http://linux.die.net/man/2/
sched setaffinity.
[18] Openmp application program interface. May 2008.
[19] H. Bae, L. Bachega, C. Dave, S. Lee, S. Lee, S. Min, R. Eigenmann,
and S. Midki. Cetus: A source-to-source compiler infrastructure for
multicores. PPoPP, 2009.
[20] S. Carr and K. Kennedy. Compiler blockability of numerical algorithms.
IEEE, 1992.
[21] A. M. Devices. Using the x86 open64 compiler suite for x86 open64
version 4.2.2.
[22] J. Dongarra. Fault tolerance, performance api and multicore optimization.
LACSI, 2006.
[23] R. Eigenmann, Z. Li, and S. P. Midkiff. Languages and compilers
for high performance computing. 17th International Workshop, LCPC,
2004.
[24] A. Fedorova, M. Seltzer, and M. D. Smith. Cache-fair thread scheduling
for multicore processors.
[25] F. S. Foundation. The gnu openmp implementation. 2006.
[26] T. P. Group. PgiR user’s guide.
[27] J. L. Hennessy and D. A. Patterson. Computer architecture a quantitative
approach, 4th edition.
[28] C. Jung, D. Lim, J. Lee, and Y. Solihin. Helper thread prefetching for
loosely-coupled multiprocessor systems. IEEE IPADPS, 2006.
[29] P. Kerly. Cache blocking techniques on hyper-threading technology enabled
processors.
[30] M. S. Lam, E. E. Rothberg, and M. E. Wolf. The cache performance
and optimizations of blocked algorithms. ASPLOS, April 1991.
[31] S.-I. Lee, T. A. Johnson, and R. Eigenmann. Cetus - an extensible compiler
infrastructure for source-to-source transformation. LCPC 2003,
LNCS 2958, pages 539–553, 2004.
[32] A. Marowka. Performance of openmp benchmarks on multicore processors.
ICA3PP 2008, LNCS 5022, pages 208–219, 2008.
[33] A. Morris, A. D. Malony, and S. S. Shende. Supporting nested openmp
parallelism in the tau performance system. International Journal of
Parallel Programming, August 2007.
[34] D. Novillo. Openmp and automatic parallelization in gcc. GCC developers
summit, 2006.
[35] C. Nyberg, T. Barclay, Z. Cvetanovic, J. Gray, and D. Lomet. Alphasort:
A risc machine sort. ACM SIGMOD, pages 233–242, May 1994.
[36] O’Reilly. Linux system programming.
[37] N. Park and V. K. Prasanna. Tiling, block data layout, and memory
hierarchy performance. IEEE PADS, 2003.
[38] S. Park, A. Shrivastava, and Y. Paek. Hiding cache miss penalty using
priority-based execution for embedded processors. DATE, 2008.
[39] M. Sato, K. Kusano, and S. Satoh. Openmp benchmark using parkbench.
Real World Computing Partnership.
[40] A. Shatdal, C. Kant, and J. F. Naughton. Cache conscious algorithms
for relational query processing. VLDB, 1994.
[41] G. E. Suh, L. Rudolph, and S. Devadas. Dynamic cache partitioning for
simultaneous multithreading systems. IEEE ICPADS, August 2001.
[42] D. Tam, R. Azimi, and M. Stumm. Thread clustering: Sharing-aware
scheduling on smp-cmp-smt multiprocessors. ACM SIGOPS, March
2007.
[43] C. Terboven, D. an Mey, D. Schmidl, H. Jin, and T. Reichstein. Data
and thread affinity in openmp programs. MAW, May 2008.
[44] T. Tian. Tips for effective usage of the shared cache in multi-core architectures.
Intel Corp.
[45] M. M. Tikir, L. Carrington, E. Strohmaier, and A. Snavely. A genetic
algorithms approach to modeling the performance of memory-bound
computations. Association for Computing Machinery, November 2007.
[46] P. University. Cetus tutorials. Principles and Practices of Parallel Programming,
2009.
[47] D. G. Waddington, N. Roy, and D. C. Schmidt. Dynamic analysis and
profiling of multi-threaded systems.
[48] M. Wolf and M. Lam. A data locality optimizing algorithm. SIGPLAN,
June 1991.

全文公開日期本全文未授權公開 (校內網路)
全文公開日期本全文未授權公開 (校外網路)
全文公開日期本全文未授權公開 (國家圖書館：臺灣博碩士論文系統)

簡易檢索 / 詳目顯示

相關論文