簡易檢索 / 詳目顯示

研究生: 呂威儀
Lu, Wei-I
論文名稱: 利用資料區塊化及執行緒分組以提升多核心平台快取資源之使用效率
Improving Multi-core Cache Utilization with Data Blocking and Thread Grouping
指導教授: 徐爵民
Shyu, Jyuo-Min
口試委員:
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2010
畢業學年度: 98
語文別: 中文
論文頁數: 83
中文關鍵詞: 多核心快取
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來由於多核心平台的快速發展,複雜的使用環境使得我們重新審視
    既有方法的有效性,以建立一套優化方法,讓系統及程式充分發揮效能。
    本研究以處理器快取記憶體(Cache) 作為研究標的,以軟體的技巧來對多
    核心平台快取資源進行優化。我們結合程式設計師的專業考量、編譯器的
    自動化技術以及系統的資訊,整合了軟體執行期間三個不同層級之間的重
    要訊息,協助解決平行程式對於快取資源分配及使用的問題。本論文提出
    資料區塊化與執行緒分組技術,並以實驗驗證對於快取資源的影響。


    1 緒論10 1.1 快取使用效率對多核心性能的影響. . . . . . . . . . . . . . . 10 1.2 研究動機. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2 相關研究15 2.1 OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 Cetus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3 PAPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3 資料區塊化(Data Blocking) 與執行緒分組(Thread Grouping) 技術28 3.1 資料區塊化(Data Blocking) . . . . . . . . . . . . . . . . . . . 28 3.2 執行緒分組(Thread Grouping) . . . . . . . . . . . . . . . . . 36 3.3 結合資料區塊化與執行緒分組技術. . . . . . . . . . . . . . . 47 4 實驗結果57 4.1 測試程式. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.1.1 實驗程式透過GCC 編譯之結果. . . . . . . . . . . . . 57 4.1.2 實驗程式透過Open64 編譯之結果. . . . . . . . . . . 62 4.2 結果分析. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.2.1 Data Size 與Cache Size 之關係. . . . . . . . . . . . . 65 4.2.2 編譯器及資料區塊化與執行緒分組對快取的影響. . . 67 5 結論與未來研究方向70 A 程式碼及部分實驗數據77

    [1] The cetus project. http://cetus.ecn.purdue.edu/.
    [2] Creating a gcc optimization plugin. http:// ehren.wordpress.com/
    2009/11/04/creating-a-gcc-optimization-plugin/.
    [3] Gcc, the gnu compiler collection. http://gcc.gnu.org/.
    [4] Intel thread affinity environment variable for openmp. http://
    www.intel.com/ software/ products/ compilers/ docs/ fmac/ doc files/
    source/extfile/optaps for/common/optaps openmp thread affinity.htm.
    [5] Linux taskset to retrieve or set a processes cpu affinity. http://
    www.cyberciti.biz/faq/taskset-cpu-affinity-command/.
    [6] Memory part 2: Cpu caches. http://lwn.net/Articles/252125/.
    [7] Nas parallel benchmarks. http://www.nas.nasa.gov/Resources/Software/
    npb.html.
    [8] The omni project. http://phase.hpcc.jp/Omni/home.html.
    [9] The open64 compiler. http://www.open64.net/.
    [10] The openmp api specification for parallel programming. http://
    openmp.org/wp/.
    [11] Options that control optimization. http://gcc.gnu.org/onlinedocs/gcc/
    Optimize-Options.html.
    [12] Parkbench matrix kernel benchmarks. http://www.netlib.org/parkbench/
    html/matrix-kernels.html.
    [13] Parkbench openmp version. http://www.netlib.org/parkbench/distribution/
    pbll.tar.gz.
    [14] Performance application programming interface. http://icl.cs.utk.edu/
    papi/.
    [15] The / proc file system utilities. http:// procps.sourceforge.net/ index.
    html.
    [16] Quad-core intelR xeonR processor 5400 series. Intel.
    [17] sched setaffinity(2) - linux man page. http://linux.die.net/man/2/
    sched setaffinity.
    [18] Openmp application program interface. May 2008.
    [19] H. Bae, L. Bachega, C. Dave, S. Lee, S. Lee, S. Min, R. Eigenmann,
    and S. Midki. Cetus: A source-to-source compiler infrastructure for
    multicores. PPoPP, 2009.
    [20] S. Carr and K. Kennedy. Compiler blockability of numerical algorithms.
    IEEE, 1992.
    [21] A. M. Devices. Using the x86 open64 compiler suite for x86 open64
    version 4.2.2.
    [22] J. Dongarra. Fault tolerance, performance api and multicore optimization.
    LACSI, 2006.
    [23] R. Eigenmann, Z. Li, and S. P. Midkiff. Languages and compilers
    for high performance computing. 17th International Workshop, LCPC,
    2004.
    [24] A. Fedorova, M. Seltzer, and M. D. Smith. Cache-fair thread scheduling
    for multicore processors.
    [25] F. S. Foundation. The gnu openmp implementation. 2006.
    [26] T. P. Group. PgiR user’s guide.
    [27] J. L. Hennessy and D. A. Patterson. Computer architecture a quantitative
    approach, 4th edition.
    [28] C. Jung, D. Lim, J. Lee, and Y. Solihin. Helper thread prefetching for
    loosely-coupled multiprocessor systems. IEEE IPADPS, 2006.
    [29] P. Kerly. Cache blocking techniques on hyper-threading technology enabled
    processors.
    [30] M. S. Lam, E. E. Rothberg, and M. E. Wolf. The cache performance
    and optimizations of blocked algorithms. ASPLOS, April 1991.
    [31] S.-I. Lee, T. A. Johnson, and R. Eigenmann. Cetus - an extensible compiler
    infrastructure for source-to-source transformation. LCPC 2003,
    LNCS 2958, pages 539–553, 2004.
    [32] A. Marowka. Performance of openmp benchmarks on multicore processors.
    ICA3PP 2008, LNCS 5022, pages 208–219, 2008.
    [33] A. Morris, A. D. Malony, and S. S. Shende. Supporting nested openmp
    parallelism in the tau performance system. International Journal of
    Parallel Programming, August 2007.
    [34] D. Novillo. Openmp and automatic parallelization in gcc. GCC developers
    summit, 2006.
    [35] C. Nyberg, T. Barclay, Z. Cvetanovic, J. Gray, and D. Lomet. Alphasort:
    A risc machine sort. ACM SIGMOD, pages 233–242, May 1994.
    [36] O’Reilly. Linux system programming.
    [37] N. Park and V. K. Prasanna. Tiling, block data layout, and memory
    hierarchy performance. IEEE PADS, 2003.
    [38] S. Park, A. Shrivastava, and Y. Paek. Hiding cache miss penalty using
    priority-based execution for embedded processors. DATE, 2008.
    [39] M. Sato, K. Kusano, and S. Satoh. Openmp benchmark using parkbench.
    Real World Computing Partnership.
    [40] A. Shatdal, C. Kant, and J. F. Naughton. Cache conscious algorithms
    for relational query processing. VLDB, 1994.
    [41] G. E. Suh, L. Rudolph, and S. Devadas. Dynamic cache partitioning for
    simultaneous multithreading systems. IEEE ICPADS, August 2001.
    [42] D. Tam, R. Azimi, and M. Stumm. Thread clustering: Sharing-aware
    scheduling on smp-cmp-smt multiprocessors. ACM SIGOPS, March
    2007.
    [43] C. Terboven, D. an Mey, D. Schmidl, H. Jin, and T. Reichstein. Data
    and thread affinity in openmp programs. MAW, May 2008.
    [44] T. Tian. Tips for effective usage of the shared cache in multi-core architectures.
    Intel Corp.
    [45] M. M. Tikir, L. Carrington, E. Strohmaier, and A. Snavely. A genetic
    algorithms approach to modeling the performance of memory-bound
    computations. Association for Computing Machinery, November 2007.
    [46] P. University. Cetus tutorials. Principles and Practices of Parallel Programming,
    2009.
    [47] D. G. Waddington, N. Roy, and D. C. Schmidt. Dynamic analysis and
    profiling of multi-threaded systems.
    [48] M. Wolf and M. Lam. A data locality optimizing algorithm. SIGPLAN,
    June 1991.

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)
    全文公開日期 本全文未授權公開 (國家圖書館:臺灣博碩士論文系統)
    QR CODE