簡易檢索 / 詳目顯示

研究生: 邱柄雄
Ping-Hsiung Chiu
論文名稱: 虛擬快取及全共享轉譯後備緩衝區
Virtual Caching with Shared-Only TLB
指導教授: 許雅三
Yarsun Hsu
口試委員: 鐘太郎
Tai-Lang Jong
李政崑
Jenq-Kuen Lee
蔡仁松
Ren-Song Tsay
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2016
畢業學年度: 104
語文別: 英文
論文頁數: 66
中文關鍵詞: 虛擬快取同義詞共享轉譯後備緩衝區反向轉譯
外文關鍵詞: virtual cache, virtual caching, synonym, shared TLB, reverse translation
相關次數: 點閱:1下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 傳統的虛擬快取設計缺乏由快取階層、轉譯後備緩衝區(TLB)以及快取一致
    性協議等通盤考量的觀點,所以,在多數的虛擬快取設計中,針對同義詞問題的
    解決方式是透過禁止同義詞出現來達成,而這樣的做法並沒有有效地利用轉譯後
    備緩衝區。這篇論文的主要工作目標在於實現在全共享TLB 架構下的虛擬快取設
    計。
    在我們提出的方案中,所有的TLB 可以被所有的運算單元所存取,而且新增
    了反向轉譯的功能及支援TLB 一致性的硬體設計。此外,我們所提出的設計允許
    虛擬快取中存在同義詞,並且藉由快取一致性協議來達成同義詞的同步,因此,
    傳統虛擬快取設計裡偵測造成快取誤失的實體位址(physical address)是否存
    在於虛擬快取的動作可以被取消。另外,我們還提出了兩種優化方法可以提升全
    共享TLB 架構下的虛擬快取的效能:頁面步行緩存(Page walk cache)及TLB 與
    末級快取對齊(Alignment of TLBs and LLCs),其中頁面步行緩存單純由硬體來
    達成,而TLB 與末級快取對齊則需要作業系統的支援。
    最後,根據我們的實驗結果,在不影響系統效能的前提下,藉由付出2.5%
    的快取額外面積,我們可以節省平均18%的動態功率消耗,並且藉由共享TLB 減
    少45%的TLB 誤失。


    Traditional virtual cache designs lack of global view of cache hierarchy, TLB and coherence
    protocol. So in most of designs, synonyms are forbidden to appear in virtual caches,
    and reverse translation mechanism do not take advantage of TLB. This work introduce a
    new scheme of implementing virtual caches with shared-only TLBs.
    In our proposed scheme, TLBs are shared by all cores, and enhanced to support reverse
    translation, hardware TLB consistency. In addition, synonyms are allowed to present in
    virtual caches so that it is unnecessary to detect synonym on cache misses, and synonym
    consistency is ensured by the coherence protocol with TLB support. Moreover, we develop
    two optimizations in which one is a pure hardware solution and the other one requires OS
    support.
    Finally, we show that by paying 2.5% area overhead, our proposed scheme save 18%
    dynamic power and reduce 45% TLB misses without harming performance.
    i

    Abstract i Contents ii 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Related Work 5 3 Background 8 3.1 Cache Coherence Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.2 Virtual Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2.1 Handling TLB Miss . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2.2 TLB Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.3 Virtual Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.3.1 Cache types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.3.2 Reverse Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3.3 Homonym and Synonym Problems . . . . . . . . . . . . . . . . . . . 19 3.4 TLB Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4 Design and Implementation 23 4.1 Modi cations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.1.1 Modi cations of the private caches . . . . . . . . . . . . . . . . . . . 25 4.1.2 The directory and the LLC . . . . . . . . . . . . . . . . . . . . . . . 26 4.1.3 Requests directed toward/issued by the LLC . . . . . . . . . . . . . . 27 4.1.4 Synonym-aware MESI coherence protocol . . . . . . . . . . . . . . . . 29 4.1.5 Modi cations of the TLB . . . . . . . . . . . . . . . . . . . . . . . . 33 4.2 Ensuring correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.3 Other Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.3.1 Page Walk Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.3.2 Alignment of TLBs and LLCs . . . . . . . . . . . . . . . . . . . . . . 40 5 Evaluation 42 5.1 Hardware Con gurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.3.1 TLB statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.3.2 Cache performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.3.3 System performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.3.4 Power and Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.3.5 Analysis of L3 cache accesses . . . . . . . . . . . . . . . . . . . . . . 55 6 Conclusion and Future Work 59 6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Bibliography 62

    [1] M. Cekleov and M. Dubois, \Virtual-address caches. part 1: problems and solutions in
    uniprocessors," Micro, IEEE, vol. 17, no. 5, pp. 64{71, Sep 1997.
    [2] M. Cekleov and M. Dubois, \Virtual-address caches. part 2: Multiprocessor issues,"
    Micro, IEEE, vol. 17, no. 6, pp. 69{74, Nov 1997.
    [3] J. R. Goodman, \Coherency for multiprocessor virtual address caches," SIGPLAN Not.,
    vol. 22, no. 10, pp. 72{81, Oct. 1987.
    [4] D. R. Cheriton, P. Boyle, and G. A. Slavenburg, \Comments on “ `coherency for
    multiprocessor virtual addresses caches' by james r. goodman"," SIGARCH Comput.
    Archit. News, vol. 16, no. 3, pp. 3{6, Jun. 1988.
    [5] W.-H. Wang, J. Baer, and H. M. Levy, \Organization and performance of a two-level
    virtual-real cache hierarchy," in Computer Architecture, 1989. The 16th Annual Inter-
    national Symposium on, May 1989, pp. 140{148.
    [6] D. H. Woo, M. Ghosh, E.  Ozer, S. Biles, and H.-H. S. Lee, \Reducing energy of virtual
    cache synonym lookup using bloom lters," in Proceedings of the 2006 Interna-
    tional Conference on Compilers, Architecture and Synthesis for Embedded Systems, ser.
    CASES '06. New York, NY, USA: ACM, 2006, pp. 179{189.
    [7] X. Qiu and M. Dubois, \The synonym lookaside bu er: A solution to the synonym
    problem in virtual caches," Computers, IEEE Transactions on, vol. 57, no. 12, pp.
    1585{1599, Dec 2008.
    [8] A. Basu, M. D. Hill, and M. M. Swift, \Reducing memory reference energy with opportunistic
    virtual caching," SIGARCH Comput. Archit. News, vol. 40, no. 3, pp. 297{308,
    Jun. 2012.
    [9] S. Kaxiras and A. Ros, \A new perspective for ecient virtual-cache coherence,"
    SIGARCH Comput. Archit. News, vol. 41, no. 3, pp. 535{546, Jun. 2013.
    [10] A. Ros and S. Kaxiras, \Complexity-e ective multicore coherence," in Proceedings of
    the 21st International Conference on Parallel Architectures and Compilation Techniques,
    ser. PACT '12. New York, NY, USA: ACM, 2012, pp. 241{252.
    [11] T. W. Barr, A. L. Cox, and S. Rixner, \Translation caching: Skip, don't walk (the page
    table)," SIGARCH Comput. Archit. News, vol. 38, no. 3, pp. 48{59, Jun. 2010.
    [12] P. Teller, \Translation-lookaside bu er consistency," Computer, vol. 23, no. 6, pp. 26{36,
    June 1990.
    [13] B. Romanescu, A. Lebeck, D. Sorin, and A. Bracy, \Uni ed instruction/translation/data
    (unitd) coherence: One protocol to rule them all," in High Performance Computer
    Architecture (HPCA), 2010 IEEE 16th International Symposium on, Jan 2010, pp. 1{
    12.
    [14] A. Sodani, \Race to exascale: Opportunities and chanllenges," MICRO Keynote talk,
    2011.
    [15] D. Patterson and J. Hennessy, Computer Organization and Design: The Hardware/-
    Software Interface. Morgan Kaufmann, 2005.
    [16] A. Bhattacharjee, D. Lustig, and M. Martonosi, \Shared last-level tlbs for chip multiprocessors,"
    in High Performance Computer Architecture (HPCA), 2011 IEEE 17th
    International Symposium on, Feb 2011, pp. 62{63.
    [17] M. Awasthi, K. Sudan, R. Balasubramonian, and J. Carter, \Dynamic hardware-assisted
    software-controlled page placement to manage capacity allocation and sharing within
    large caches," in High Performance Computer Architecture, 2009. HPCA 2009. IEEE
    15th International Symposium on, Feb 2009, pp. 250{261.
    [18] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness,
    D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D.
    Hill, and D. A.Wood, \The gem5 simulator," SIGARCH Comput. Archit. News, vol. 39,
    no. 2, pp. 1{7, Aug. 2011.
    [19] Mailing list of gem5 developement. [Online]. Available: http://www.mail-archive.com/
    gem5-dev@gem5.org/
    [20] Mailing list of gem5 user. [Online]. Available: http://www.mail-archive.com/
    gem5-users@gem5.org/
    [21] B. R. Muralimanohar, N. and Jouppi, \Cacti 6.0," H. P. Labs, Ed., 2009.
    [22] C. Bienia, S. Kumar, J. P. Singh, and K. Li, \The parsec benchmark suite: Characterization
    and architectural implications," in Proceedings of the 17th International Conference
    on Parallel Architectures and Compilation Techniques, ser. PACT '08. New York, NY,
    USA: ACM, 2008, pp. 72{81.
    [23] E. F. P. G. S. W. K. Mark Gebhart, Joel Hestness, \Running parsec 2.1 on m5," T. R.
    TR-09-32, Ed. 2009: The University of Texas at Austin, Department of Computer
    Science, October 27.
    [24] J. Power, M. Hill, and D. Wood, \Supporting x86-64 address translation for 100s of
    gpu lanes," in High Performance Computer Architecture (HPCA), 2014 IEEE 20th
    International Symposium on, Feb 2014, pp. 568{578.
    [25] D. J. Sorin, M. D. Hill, and D. A. Wood, \A primer on memory consistency and cache
    coherence," Synthesis Lectures on Computer Architecture, vol. 6, no. 3, pp. 1{212,
    2011. [Online]. Available: http://dx.doi.org/10.2200/S00346ED1V01Y201104CAC016
    [26] Linux cross reference. [Online]. Available: http://lxr.free-electrons.com/
    [27] B. Jacob and T. Mudge, \Virtual memory: issues of implementation," Computer,
    vol. 31, no. 6, pp. 33{43, Jun 1998.
    [28] A. Bhattacharjee and M. Martonosi, \Inter-core cooperative tlb for chip multiprocessors,"
    in Proceedings of the Fifteenth Edition of ASPLOS on Architectural Support for
    Programming Languages and Operating Systems, ser. ASPLOS XV. New York, NY,
    USA: ACM, 2010, pp. 359{370.
    [29] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi,
    and K. Hazelwood, \Pin: Building customized program analysis tools with dynamic
    instrumentation," SIGPLAN Not., vol. 40, no. 6, pp. 190{200, Jun. 2005.
    [30] B. Choi, R. Komuravelli, H. Sung, R. Smolinski, N. Honarmand, S. V. Adve, V. S.
    Adve, N. P. Carter, and C.-T. Chou, \Denovo: Rethinking the memory hierarchy for
    disciplined parallelism," in Proceedings of the 2011 International Conference on Parallel
    Architectures and Compilation Techniques, ser. PACT '11. Washington, DC, USA:
    IEEE Computer Society, 2011, pp. 155{166.
    [31] E. Z. Zhang, Y. Jiang, and X. Shen, \Does cache sharing on modern cmp matter to
    the performance of contemporary multithreaded programs?" SIGPLAN Not., vol. 45,
    no. 5, pp. 203{212, Jan. 2010.
    [32] M. M. K. Martin, M. D. Hill, and D. J. Sorin, \Why on-chip cache coherence is here to
    stay," Commun. ACM, vol. 55, no. 7, pp. 78{89, Jul. 2012.

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)

    QR CODE