簡易檢索 / 詳目顯示

研究生: 黃世傑
Huang, Shih-Chieh
論文名稱: 以應用導向之晶片網路設計及其模擬機制
Application-Driven Design and Evaluation of Network-on-Chip for Many-Core Systems
指導教授: 金仲達
口試委員: 楊佳玲
陳添福
郭峻因
蔡仁松
鍾葉青
金仲達
學位類別: 博士
Doctor
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2013
畢業學年度: 101
語文別: 英文
論文頁數: 84
中文關鍵詞: 晶片網路多核心模擬
相關次數: 點閱:1下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 晶片網路已經成為多核心系統主要的連結架構,故對於晶片網路更深入的瞭解與探索其特性在近期的研究已成為一個重要的主題。在此論文中,我們以應用程式的角度切入,深入瞭解多核心應用程式之特性,探討這些特性與晶片網路之間的關係。利用應用程式之規律行為的特性(如迴圈,空間與時間的重複存取特性),設計出所對應之適合的晶片網路。此論文之主要貢獻為:1. 利用應用程式之特性,我們可以幫助晶片網路做更好的省電機制 2. 利用應用程式特性,設計出可以快速評估晶片網路效能之模擬機制。相較於過去的方法,我們的實驗證明以應用程式的角度切入,可以帶來比單純以硬體角度切入更好的效能。


    As the number of cores in a chip increases continuously, Network-on-Chip (NoC) becomes the primary choice for interconnecting the nodes, so exploring the design of NoC is hence important for fully utilizing the power of the nodes. Unfortunately, when designing the interconnect fabric, the implications coming from the applications running on top of the NoC are usually ignored and wasted. Instead, they mostly rely on fixed parameters decided in very early-design stage. Even if dynamic adaptations at runtime are designed, they rely only on simple counters available in the hardware. Therefore, the NoC design is usually conservative and even does not fit the target applications in the end.

    In fact, the information coming from the application can dramatically improve the design of NoC. In this thesis, we named this perspective as \emph{application-driven NoC design}. Based on the idea of application-driven design, two important topics in the design of NoC are well-researched and solved from the proposed perspective in this thesis. First, we exploited the traffic patterns of the applications running on top of the on-chip networks for saving the power consumption of on-chip network. In previous works, one promising solution is to dynamically adjust the working frequencies/voltages (DVFS) of the switches as well as the links between switches in the NoC to match the traffic flows. The question is when to adjust and by how much. Based on the idea of application-driven design, we propose a hardware mechanism to proactively adjust the frequencies/voltages of switches and/or links in NoC by predicting the application runtime traffic. Different from the previous approaches, our novel approach predicts the traffic from the implicit program behaviors in the network, i.e., loop, repeitive transmissions, which make the DVFS strategy more proactive and accurate. Following up the idea of application-driven design, we proposed a causality-aware simulation methodology for more accurate evaluation. By observing the application behaviors, the causalities between each pair of the nodes in the system can be extracted and embedded into the trace logs, and therefore the accuracy of simulation can be largely improved compared with the conventional trace-driven simulation. Moreover, we further take advantages of the causalities to compress the huge trace logs and design a corresponding traffic generator, named as Attackboard. In Attackboard, it reflects the interactions between each pair of processor and router recorded as several causality-aware small tables, and then it dynamically generates traffic injections to NoC on-the-fly to simulate the real application behaviors at runtime, whereas the traffic is similar to that generated by the original trace but with much lower space overhead.

    1 Introduction 3 2 Application-Driven End-to-End Traffic Predictions for Low Power NoC Design 8 2.1 A Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 System Design of ATPT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.1 Basic Ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.1.1 Design of prediction tables . . . . . . . . . . . . . . . . . . . 18 2.3.2 Design of Selector for Predictors . . . . . . . . . . . . . . . . . . . . . 19 2.4 Link-DVFS by ATPT-Based Predictor . . . . . . . . . . . . . . . . . . . . . 21 2.4.1 Superposing Link Utilization . . . . . . . . . . . . . . . . . . . . . . . 22 2.4.2 Strategies for Link-DVFS . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4.3 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5 Implementation Considerations . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.5.1 Table Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.5.1.1 Reducing the size of the 1st-level table . . . . . . . . . . . . 25 2.5.1.2 Sharing the 2nd-level table . . . . . . . . . . . . . . . . . . 26 2.5.1.3 Quantizing the values in the 2nd-level table . . . . . . . . . 27 2.5.1.4 Entry replacement in the 2nd-level table . . . . . . . . . . . 27 2.5.2 Aggregation for Traffic Predictions . . . . . . . . . . . . . . . . . . . 27 2.5.2.1 Na¨ıve broadcasting . . . . . . . . . . . . . . . . . . . . . . . 28 2.5.2.2 Control network . . . . . . . . . . . . . . . . . . . . . . . . 28 2.5.2.3 Packet piggybacking . . . . . . . . . . . . . . . . . . . . . . 29 2.5.3 Area Occupancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.5.4 Energy Consumption of ATPT-Based Predictors . . . . . . . . . . . . 30 2.5.4.1 Energy consumption of ATPTs . . . . . . . . . . . . . . . . 30 2.5.4.2 Energy consumption of DVFS switching . . . . . . . . . . . 30 2.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.6.1 Evaluation for ATPT-Based Predictors . . . . . . . . . . . . . . . . . 32 2.6.1.1 Pattern learning and misprediction recovery . . . . . . . . . 37 2.6.2 A Case Study for DVFS on Links . . . . . . . . . . . . . . . . . . . . 39 2.6.2.1 Simulation setup . . . . . . . . . . . . . . . . . . . . . . . . 39 2.6.2.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.6.2.3 The accuracy of link voltage and frequency level adjustment 41 2.6.2.4 Communication links power reduction results . . . . . . . . 42 2.7 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3 Attackboard: A Novel Dependency-Aware Traffic Generator for Exploring NoC Design Space 46 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.3.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.3.2 Structure of Attackboard’s attackboard . . . . . . . . . . . . . . . . . 53 3.3.3 Packets Dependencies Extraction . . . . . . . . . . . . . . . . . . . . 54 3.3.4 Simulation Methodology . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.4.1 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.4.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.4.3 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.4.3.1 Storage space overhead . . . . . . . . . . . . . . . . . . . . . 64 3.4.3.2 Traffic behavior . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.5 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.5.1 NoC Design Space Exploration . . . . . . . . . . . . . . . . . . . . . 68 3.5.2 Parallel Object Detection . . . . . . . . . . . . . . . . . . . . . . . . 69 3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4 Conclusion 76

    [1] “Arteris.” [Online]. Available: http://www.arteris.com/
    [2] S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung, J. MacKay, M. Reif,
    L. Bao, J. Brown, M. Mattina, C.-C. Miao, C. Ramey, D. Wentzlaff, W. Anderson,
    E. Berger, N. Fairbanks, D. Khan, F. Montenegro, J. Stickney, and J. Zook, “TILE64
    - processor: A 64-core soc with mesh interconnect,” in Proceedings of International
    Conference on Solid-State Circuits, 2008. ISSCC 2008., feb. 2008, pp. 88 –598.
    [3] M. B. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Greenwald, H. Hoffman,
    P. Johnson, J.-W. Lee, W. Lee, A. Ma, A. Saraf, M. Seneski, N. Shnidman, V. Strumpen,
    M. Frank, S. Amarasinghe, and A. Agarwal, “The raw microprocessor: a computational
    fabric for software circuits and general-purpose programs,” vol. 22, no. 2, pp. 25–35,
    Mar. 2002.
    [4] Z. Yu, M. Meeuwsen, R. Apperson, O. Sattari, M. Lai, J. Webb, E. Work, D. Truong,
    T. Mohsenin, and B. Baas, “Asap: An asynchronous array of simple processors,” IEEE
    Journal of Solid State Circuits, vol. 43, no. 3, p. 695, 2008.
    [5] S. Mukherjee, P. Bannon, S. Lang, A. Spink, and D. Webb, “The alpha 21364 network
    architecture,” IEEE micro, pp. 26–35, 2002.
    [6] A. B. Kahng, B. Li, L.-S. Peh, and K. Samadi, “Orion 2.0: A fast and accurate noc power
    and area model for early-stage design space exploration,” in Proc. DATE ’09. Design,
    Automation. Test in Europe Conference. Exhibition, Apr. 20–24, 2009, pp. 423–428.
    [7] A. K. Mishra, R. Das, S. Eachempati, R. Iyer, N. Vijaykrishnan, and C. R. Das, “A
    case for dynamic frequency tuning in on-chip networks,” in Proceedings of the 42nd
    International Symposium on Microarchitecture, ser. MICRO 42. New York, NY, USA:
    ACM, 2009, pp. 292–303.
    [8] A. K. Mishra, A. Yanamandra, R. Das, S. Eachempati, R. Iyer, N. Vijaykrishnan, and
    C. R. Das, “Raft: A router architecture with frequency tuning for on-chip networks,”
    J. Parallel Distrib. Comput., vol. 71, pp. 625–640, May 2011.
    [9] U. Y. Ogras and R. Marculescu, “Prediction-based flow control for network-on-chip
    traffic,” in Proc. 43rd ACM/IEEE Design Automation Conference, 2006, pp. 839–844.
    [10] G. Chen, F. Li, M. Kandemir, and M. Irwin, “Reducing noc energy consumption through
    compiler-directed channel voltage scaling,” in Proceedings of the 2006 ACM SIGPLAN
    conference on Programming language design and implementation. ACM New York,
    NY, USA, 2006, pp. 193–203.
    [11] C. Isci, G. Contreras, and M. Martonosi, “Live, runtime phase monitoring and prediction
    on real systems with application to dynamic power management,” in MICRO 39:
    Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture.
    Washington, DC, USA: IEEE Computer Society, 2006, pp. 359–370.
    [12] Y. S.-C. Huang, K. C.-K. Chou, C.-T. King, and S.-Y. Tseng, “Ntpt: On the endto-
    end traffic prediction in the on-chip networks,” in Proceedings of the 47th Design
    Automation Conference (DAC), 2010.
    [13] F. Worm, P. Ienne, P. Thiran, and G. De Micheli, “An adaptive low-power transmission
    scheme for on-chip networks,” in Proceedings of the 15th international symposium on
    System Synthesis. ACM, 2002, p. 100.
    [14] W. Kim, M. Gupta, G. Wei, and D. Brooks, “System level analysis of fast, per-core
    dvfs using on-chip switching regulators,” in High Performance Computer Architecture,
    2008. HPCA 2008. IEEE 14th International Symposium on. Ieee, 2008, pp. 123–134.
    [15] L.-S. Peh and W. Dally, “Flit-reservation flow control,” in High-Performance Computer
    Architecture, 2000. HPCA-6. Proceedings. Sixth International Symposium on, 2000, pp.
    73 –84.
    [16] G. He, A. Zhai, and P. Yew, “Ex-mon: An architectural framework for dynamic program
    monitoring on multicore processors,” in The Twelfth Workshop on Interaction between
    Compilers and Computer Architectures, Interact-12, 2008.
    [17] A. Sharifi, H. Zhao, and M. Kandemir, “Feedback control for providing qos in noc based
    multicores,” in Proc. Design, Automation, and Test in Europe, 2010.
    [18] J. Ouyang and Y. Xie, “Loft: A high performance network-on-chip providing qualityof-
    service support,” in Proceedings of the 2010 43rd Annual IEEE/ACM International
    Symposium on Microarchitecture, ser. MICRO ’43. Washington, DC, USA: IEEE Computer
    Society, 2010, pp. 409–420.
    [19] Y. S.-C. Huang, K. C.-K. Chou, and C.-T. King, “Implementation details of atpt-based
    predictions for low power noc design,” Tech. Rep., 2012.
    [20] N. Muralimanohar, R. Balasubramonian, and N. Jouppi, “Optimizing nuca organizations
    and wiring alternatives for large caches with cacti 6.0,” in Proceedings of the 40th
    Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer
    Society, 2007, pp. 3–14.
    [21] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta, “The SPLASH-2 programs:
    characterization and methodological considerations,” in Proc. of the 22nd Annual Int’l
    Symposium on Computer Architecture, 1995, pp. 24–36.
    [22] “Intel MPI Benchmarks.” [Online]. Available: http://software.intel.com/enus/
    articles/intel-mpi-benchmarks/
    [23] J. Hu and R. Marculescu, “Exploiting the routing flexibility for energy/performance
    aware mapping of regular noc architectures,” in Proc. Design, Automation and Test in
    Europe Conference and Exhibition, vol. 1, Feb. 16–20, 2004, p. xxix.
    [24] S. Murali and G. De Micheli, “Bandwidth-constrained mapping of cores onto noc architectures,”
    in Proc. Design, Automation and Test in Europe Conference and Exhibition,
    vol. 2, Feb. 16–20, 2004, pp. 896–901.
    [25] K. J. Nesbit and J. E. Smith, “Data cache prefetching using a global history buffer,” in
    Proc. 10th International Symposium on HPCA-10 High Performance Computer Architecture,
    Feb. 14–18, 2004, p. 96.
    [26] N. Agarwal, T. Krishna, L. Peh, and N. Jha, “Garnet: A detailed on-chip network
    model inside a full-system simulator,” in Proceedings of International Symposium on
    Performance Analysis of Systems and Software, 2009.
    [27] T. Sherwood, E. Perelman, and B. Calder, “Basic block distribution analysis to find
    periodic behavior and simulation points in applications,” in International Conference
    on Parallel Architectures and Compilation Techniques, 2001, pp. 3–14.
    [28] N. Agarwal, T. Krishna, L.-S. Peh, and N. Jha, “Garnet: A detailed on-chip network
    model inside a full-system simulator,” in Proceedings of IEEE International Symposium
    on Performance Analysis of Systems and Software, 2009. ISPASS 2009., April 2009,
    pp. 33 –42.
    [29] N. Binkert, B. Beckmann, G. Black, S. Reinhardt, A. Saidi, A. Basu, J. Hestness,
    D. Hower, T. Krishna, S. Sardashti et al., “The gem5 simulator,” ACM SIGARCH
    Computer Architecture News, vol. 39, no. 2, pp. 1–7, 2011.
    [30] C. Hughes, V. Pai, P. Ranganathan, and S. Adve, “Rsim: Simulating shared-memory
    multiprocessors with ilp processors,” Computer, vol. 35, no. 2, pp. 40–49, 2002.
    [31] “BookSim 2.0,” Network-on-Chip project at Standford University. [Online]. Available:
    https://nocs.stanford.edu
    [32] A. B. Kahng, B. Lin, K. Samadi, and R. S. Ramanujam, “Trace-driven optimization
    of networks-on-chip configurations,” in Proceedings of the 47th Design Automation
    Conference. New York, NY, USA: ACM, 2010, pp. 437–442. [Online]. Available:
    http://doi.acm.org/10.1145/1837274.1837384
    [33] C. Nitta, M. Farrens, K. Macdonald, and V. Akella, “Inferring packet dependencies
    to improve trace based simulation of on-chip networks,” in Proceedings of
    the Fifth ACM/IEEE International Symposium on Networks-on-Chip, ser. NOCS
    ’11. New York, NY, USA: ACM, 2011, pp. 153–160. [Online]. Available:
    http://doi.acm.org/10.1145/1999946.1999971
    [34] J. Hestness, B. Grot, and S. W. Keckler, “Netrace: Dependency-driven trace-based
    network-on-chip simulation,” in Proceedings of the Third International Workshop on
    Network on Chip Architectures. New York, NY, USA: ACM, 2010, pp. 31–36. [Online].
    Available: http://doi.acm.org/10.1145/1921249.1921258
    [35] Y. S.-C. Huang, Y.-C. Chang, T.-C. Tsai, Y.-Y. Chang, and C.-T. King, “Attackboard:
    A novel dependency-aware traggic generator for exploring noc design space,” in Proceedings
    of the 49th Design Automation Conference (DAC), 2012.
    [36] F. Fazzino, M. Palesi, and D. Patti, “Noxim: Network-on-chip simulator,” 2008.
    [37] F. Trivino, F. J. Andujar, F. J. Alfaro, J. L. Sanchez, and A. Ros, “Self-related traces:
    An alternative to full-system simulation for nocs,” in High Performance Computing and
    Simulation (HPCS), 2011 International Conference on, july 2011, pp. 819 –824.
    [38] N. Binkert, R. Dreslinski, L. Hsu, K. Lim, A. Saidi, and S. Reinhardt, “The M5 simulator:
    Modeling networked systems,” in Proc. of the 39th Int’l Symposium on Microarchitecture,
    vol. 26, no. 4, 2006, pp. 52–60.
    [39] Z. Tan, A. Waterman, H. Cook, S. Bird, K. Asanovi´c, and D. Patterson, “A case
    for FAME: FPGA architecture model execution,” in Proc. of the 37th Annual Int’l
    Symposium on Computer Architecture, 2010, pp. 290–301.
    [40] Y. S.-C. Huang, J. Ouyang, Y.-Y. Chang, Y. Xie, and C.-T. King, “Tilesim: A scalable
    and parallel simulator for noc-centric research,” in Proc. submitted to TVLSI (under
    review), 2012.

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)

    QR CODE