研究生: |
張雍昌 Chang, Yung-Chang |
---|---|
論文名稱: |
基於冗餘路由器之晶片網路容錯架構分析與設計 Design and Analysis of Fault-tolerant Networks-on-Chip Using Spare Routers |
指導教授: |
邱瀞德
Chiu, Ching-Te |
口試委員: |
林輝堂
蔡仁松 李政崑 李哲榮 范倫達 |
學位類別: |
博士 Doctor |
系所名稱: |
電機資訊學院 - 資訊工程學系 Computer Science |
論文出版年: | 2020 |
畢業學年度: | 108 |
語文別: | 英文 |
論文頁數: | 63 |
中文關鍵詞: | 晶片網路 、容錯設計 |
外文關鍵詞: | NoC, Fault-tolerant |
相關次數: | 點閱:2 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著 CMOS 製程技術的進步,大型晶片網絡(NoC)架構的可靠 性問題日趨重要。在基於網狀網路 (Mesh) 的 NoC 架構中,失效的 路由器與失效的鏈結,可能會使得功能良好的處理元件(PE)因無 法連接而導致無法使用。同樣的,一組故障路由器可能會在 NoC 中 形成互相無法連通的隔離區域,這可能會造成整體晶片系統無法運 作。在本論文中,我們針對大型 NoC 提出新的失效模型來進行分 析,並推導出一種基於冗餘路由器(RLR)之容錯架構,該架構不 同於傳統的微架構層級冗餘(MLR)方法,可以在大型 NoC 的架 構下,以較低的成本有效緩解隔離的 PE 和隔離區域的問題。通過 在網狀網路中的每個路由器集上,加上一個冗餘路由器,就可以創 建 RLR,並且藉此創造出更多的相鄰路由器之間之連接路徑。為了 利用此額外創造出的連結資源,我們提出兩種重新配置路由器與連 結的演算法,以繞過失效的路由器與鏈接。在重新配置後,本架構 可維持原本的網格拓撲,因此所提出的容錯結構不需要網絡層路由 演算法的支援。本論文提出的 RLR 容錯設計,應用於二維網狀網路 中,一個路由器集之中,最多可容忍一個路由器故障,我們將 RLR 闊展到三維的設計,加上一層冗餘的路由器層,結合二維與三維的 設計,原本二維空間的錯誤,可利用三維的冗餘資源來回復,而三 維的多個錯誤,也可利用二維的冗餘資源來回復,因此在多個錯誤 同時發生時,仍可以進行重組與回復。我們針對三個容錯指標進行 了評估:可靠性,平均故障時間(MTTF)和良率。實驗結果顯示,隨著 NoC 尺寸增加,RLR 的容錯性能也隨之提高,而相對的連接 成本卻是同時是降低的,這一特性使我們的 RLR 架構適合於大規模 NoC 的容錯設計。
The aggressively scaled CMOS technology is increasingly threatening the dependability of network-on-chips (NoCs) architecture. In a mesh-based NoC, a faulty router or broken link may isolate a well functional processing element (PE). Also, a set of faulty routers may form isolated regions, which can degrade the design. In this thesis, we propose a router-level redundancy (RLR) fault-tolerant scheme that differs from the traditional microarchitecture-level redundancy (MLR) approach to relieve the problem of isolated PE and isolated region. By simply adding one spare router within each router set in a mesh, RLR can be created and connection paths between adjacent routers can be diversified. The proposed RLR fault-tolerant scheme in 2-D mesh can tolerate at most one faulty router within a router set. We also extend the RLR from 2-D to 3-D NoCs using an extra spare layer. The RLR in 3-D NoCs can tolerate multiple non-overlapped failures in a vertical router set. Moreover, combining RLRs in 2-D/3-D NoCs provides mutual recovery capabilities: multiple failures in 2-D RLR router sets can be recovered by the 3-D RLR and vice versa. To exploit this extra resource, two reconfiguration algorithms are demonstrated to detour observed faulty routers/ links. After the reconfiguration, the original mesh topology is maintained. As a result, the proposed architecture does not need any support from the network layer routing algorithms. The scheme has been evaluated based on the three fault-tolerant metrics: reliability, mean time to failure (MTTF), and yield. The experimental results show that the performance of RLR increases as the size of NoC grows; however, the relative connection cost decreases at the same time. This characteristic makes our architecture suitable for large-scale NoC designs.
[1] F. Akopyan, J. Sawada, A. Cassidy, R. Alvarez-Icaza, J. Arthur, P. Merolla, N. Imam, Y. Nakamura, P. Datta, G.-J. Nam, and Others. Truenorth: Design and tool flow of a 65 mw 1 million neuron programmable neurosynaptic chip. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 34(10):1537–1557, 2015.
[2] A. Q. Ansari, M. R. Ansari, and M. A. Khan. Performance evaluation of various parameters of Network-on-Chip (NoC) for different topologies. In 2015 Annual IEEE India Conference (INDICON), pages 1–4, dec 2015.
[3] M. Braga, E. Cota, F. L. Kastensmidt, and M. Lubaszewski. Efficiently using data splitting and retransmission to tolerate faults in networks-on-chip interconnects. In Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium on, pages 4101–4104, 2010.
[4] J. M. Carulli and T. J. Anderson. The impact of multiple failure modes on estimating product field reliability. Design & Test of Computers, IEEE, 23(2):118–126, 2006.
[5] H. S. Castro and O. A. de Lima. A fault tolerant NoC architecture based upon external router backup paths. In 2013 IEEE 11th International New Circuits and Systems Conference (NEWCAS), pages 1–4. IEEE, 2013.
[6] V. Catania, A. Mineo, S. Monteleone, M. Palesi, and D. Patti. Noxim: An open, extensible and cycle-accurate network on chip simulator. In 2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP), pages 162–163, jul 2015.
[7] Y.-C. Chang, C.-T. Chiu, S.-Y. Lin, and C.-K. Liu. On the design and analysis of fault tolerant NoC architecture using spare routers. In Proceedings of the 16th Asia and South Pacific Design Automation Conference, pages 431–436. IEEE Press, 2011.
[8] Y.-C. Chang, L.-R. Huang, H.-C. Liu, C.-J. Yang, and C.-T. Chiu. Assess- ing automotive functional safety microprocessor with ISO 26262 hardware requirements. In Technical Papers of 2014 International Symposium on VLSI Design, Automation and Test, pages 1–4. IEEE, 2014.
[9] N.Chatterjee,S.Chattopadhyay,andK.Manna.Asparerouterbasedreliable network-on-chip design. In 2014 IEEE International Symposium on Circuits and Systems (ISCAS), pages 1957–1960. IEEE, 2014.
[10] C. Chen, Y. Fu, and S. Cotofana. Towards Maximum Utilization of Remained Bandwidth in Defected NoC Links. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 36(2):285–298, 2016.
[11] X. Chen, Z. Lu, Y. Lei, Y. Wang, and S. Chen. Multi-bit transient fault control for NoC links using 2D fault coding method. In 2016 Tenth IEEE/ ACM International Symposium on Networks-on-Chip (NOCS), pages 1–8. IEEE, 2016.
[12] Y.-H. Chen, T.-J. Yang, J. Emer, and V. Sze. Eyeriss v2: A Flexible Acceler- ator for Emerging Deep Neural Networks on Mobile Devices. arXiv preprint arXiv:1807.07928, 2018.
[13] L. Cheng, Z. Lei, H. Yinhe, and L. Xiaowei. A resilient on-chip router design through data path salvaging. In Design Automation Conference (ASP-DAC), 2011 16th Asia and South Pacific, pages 437–442, 2011.
[14] C. Constantinescu. Trends and challenges in VLSI circuit reliability. Micro, IEEE, 23(4):14–19, 2003.
[15] K. Constantinides, S. Plaza, J. Blome, Z. Bin, V. Bertacco, S. Mahlke, T. Austin, and M. Orshansky. BulletProof: a defect-tolerant CMP switch ar- chitecture. In High-Performance Computer Architecture, 2006. The Twelfth International Symposium on, pages 5–16, 2006.
[16] É. Cota, A. d. M. Amory, and M. S. Lubaszewski. Reliability, Availability and Serviceability of Networks-on-Chip. Springer, 2011.
[17] S. Davidson, S. Xie, C. Torng, K. Al-Hawai, A. Rovinski, T. Ajayi, L. Vega, C. Zhao, R. Zhao, S. Dai, A. Amarnath, B. Veluri, P. Gao, A. Rao, G. Liu, R. K. Gupta, Z. Zhang, R. Dreslinski, C. Batten, and M. B. Taylor. The Celerity Open-Source 511-Core RISC-V Tiered Accelerator Fabric: Fast Ar- chitectures and Design Methodologies for Fast Chips. IEEE Micro, 38(2): 30–41, mar 2018.
[18] D. DiTomaso, A. Kodi, and A. Louri. QORE: A fault tolerant network-on- chip architecture with power-efficient quad-function channel (QFC) buffers. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), pages 320–331. IEEE, 2014.
[19] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam. ShiDianNao: Shifting vision processing closer to the sensor.In ACM SIGARCH Computer Architecture News, volume 43, pages 92–104. ACM, 2015.
[20] M. A. El-Moursy, D. Korzec, M. Ismail, and Others. High throughput archi- tecture for OCTAGON network on chip. In 2009 16th IEEE International Conference on Electronics, Circuits and Systems-(ICECS 2009), pages 101– 104. IEEE, 2009.
[21] D. Fick, A. DeOrio, H. Jin, V. Bertacco, D. Blaauw, and D. Sylvester. Vicis: A reliable network for unreliable silicon. In Design Automation Conference, 2009. DAC ’09. 46th ACM/IEEE, pages 812–817, 2009.
[22] M. S. Gaur, V. Laxmi, M. Zwolinski, M. Kumar, N. Gupta, and Ashish. Network-on-chip: Current issues and challenges. In 2015 19th International Symposium on VLSI Design and Test, pages 1–3, jun 2015.
[23] C.-S. Gong, Y.-C. Chang, L.-R. Huang, C.-J. Yang, K.-M. Ji, K.-L. Lu, and J.- C. Liou. Two Dimensional Parity Check with Variable Length Error Detection Code for the Non-Volatile Memory of Smart Data. Applied Sciences, 8(8): 1211, jul 2018.
[24] C. Grecu, P. Pande, A. Ivanov, and R. Saleh. BIST for Network-on-Chip Interconnect Infrastructures. In 24th IEEE VLSI Test Symposium, pages 30–35. IEEE.
[25] L. Huang, J. Wang, M. Ebrahimi, M. Daneshtalab, X. Zhang, G. Li, and A. Jantsch. Non-Blocking Testing for Network-on-Chip. IEEE Transactions on Computers, 65(3):679–692, mar 2016.
[26] N. E. Jerger and L.-S. Peh. On-chip networks. Synthesis Lectures on Computer Architecture, 4(1):1–141, 2009.
[27] M. R. Kakoee, V. Bertacco, and L. Benini. At-Speed Distributed Functional Testing to Detect Logic and Delay Faults in NoCs. IEEE Transactions on Computers, 63(3):703–717, mar 2014.
[28] K. Khalil, O. Eldash, A. Kumar, and M. Bayoumi. Flexible Self-Healing Router for Reliable and High-Performance Network-on-Chips Architecture. In 2018 31st IEEE International System-on-Chip Conference (SOCC), pages 152–157, 2018.
[29] H. S. Kia and C. Ababei. Improving Fault Tolerance of Network-on-Chip Links via Minimal Redundancy and Reconfiguration. In Reconfigurable Com- puting and FPGAs (ReConFig), 2011 International Conference on, pages 363–368, 2011.
[30] M. Koibuchi, H. Matsutani, H. Amano, and T. M. Pinkston. A Lightweight Fault-Tolerant Mechanism for Network-on-Chip. In Networks-on-Chip, 2008. NoCS 2008. Second ACM/IEEE International Symposium on, pages 13–22, 2008.
[31] T. Lehtonen, D. Wolpert, P. Liljeberg, J. Plosila, and P. Ampadu. Self- Adaptive System for Addressing Permanent Errors in On-Chip Interconnects. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, 18(4): 527–540, 2010.
[32] Z. Lei, H. Yinhe, X. Qiang, L. Xiao-Wei, and L. Huawei. On Topology Recon- figuration for Defect-Tolerant NoC-Based Homogeneous Manycore Systems.
Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, 17(9): 1173–1186, 2009.
[33] C. Li, M. Yang, and P. Ampadu. An energy-efficient noc router with adaptive fault-tolerance using channel slicing and on-demand tmr. IEEE Transactions on Emerging Topics in Computing, 6(4):538–550, 2016.
[34] S. Loucif. Performance evaluation of hierarchical-torus NoC. In 2013 27th International Conference on Advanced Information Networking and Applica- tions Workshops, pages 837–842. IEEE, 2013.
[35] K.-L. Lu, Y.-Y. Chen, and L.-R. Huang. FMEDA-Based Fault Injection and Data Analysis in Compliance with ISO-26262. In 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W), pages 275–278. IEEE, 2018.
[36] S. Murali and G. De Micheli. Bandwidth-constrained mapping of cores onto NoC architectures. In Design, Automation and Test in Europe Conference and Exhibition, 2004. Proceedings, volume 2, pages 896–901 Vol.2, 2004.
[37] Y. Nishi and R. Doering. Handbook of Semiconductor Manufacturing Tech- nology. CRC Press, 2012.
[38] H. Orsila, T. Kangas, E. Salminen, and T. D. Hamalainen. Parameterizing Simulated Annealing for Distributing Task Graphs on Multiprocessor SoCs. In 2006 International Symposium on System-on-Chip, pages 1–4, nov 2006.
[39] K. A. Papry and A. B. M. Alim Al Islam. Fault-tolerant 3D mesh for network- on-chip. In 2017 IEEE 36th International Performance Computing and Com- munications Conference (IPCCC), pages 1–2, dec 2017.
[40] D. Park, S. Eachempati, R. Das, A. K. Mishra, Y. Xie, N. Vijaykrishnan, and C. R. Das. MIRA: A Multi-layered On-Chip Interconnect Router Archi- tecture. In 2008 International Symposium on Computer Architecture, pages 251–261, jun 2008.
[41] A. Patooghy and S. G. Miremadi. LTR: A low-overhead and reliable routing algorithm for network on chips. In 2008 International SoC Design Conference, volume 01, pages I–129–I–133, nov 2008.
[42] M. Pirretti, G. M. Link, R. R. Brooks, N. Vijaykrishnan, M. Kandemir, and M. J. Irwin. Fault tolerant algorithms for network-on-chip interconnect. In VLSI, 2004. Proceedings. IEEE Computer society Annual Symposium on, pages 46–51, 2004.
[43] I. Polian, J. P. Hayes, S. M. Reddy, and B. Becker. Modeling and Mitigating Transient Errors in Logic Circuits. Dependable and Secure Computing, IEEE Transactions on, 8(4):537–547, 2011.
[44] P. Poluri and A. Louri. A soft error tolerant network-on-chip router pipeline for multi-core systems. IEEE computer architecture letters, 14(2):107–110, 2014.
[45] Y. Ren, L. Liu, S. Yin, J. Han, Q. Wu, and S. Wei. A fault tolerant NoC architecture using quad-spare mesh topology and dynamic reconfiguration. Journal of Systems Architecture, 59(7):482–491, 2013.
[46] S. Shamshiri and C. Kwang-Ting. Yield and Cost Analysis of a Reliable NoC. In VLSI Test Symposium, 2009. VTS ’09. 27th IEEE, pages 173–178, 2009.
[47] L. Shu-Yen, H. Chun-Hsiang, C. Chih-Hao, H. Keng-Hsien, and W. An-Yeu. Traffic-Balanced Routing Algorithm for Irregular Mesh-Based On-Chip Net- works. Computers, IEEE Transactions on, 57(9):1156–1168, 2008.
[48] L. Shu-Yen, S. Wen-Chung, H. Chan-Cheng, C. Chih-Hao, and W. An-Yeu. Fault-tolerant router with built-in self-test/self-diagnosis and fault-isolation circuits for 2D-mesh based chip multiprocessor systems. In VLSI Design, Automation and Test, 2009. VLSI-DAT ’09. International Symposium on, pages 72–75, 2009.
[49] A. Sodani, R. Gramunt, J. Corbal, H.-S. Kim, K. Vinod, S. Chinthamani, S. Hutsell, R. Agarwal, and Y.-C. Liu. Knights landing: Second-generation intel xeon phi product. Ieee micro, 36(2):34–46, 2016.
[50] P.Sung-JuiandC.Kwang-Ting.AFrameworkforSystemReliabilityAnalysis Considering Both System Error Tolerance and Component Test Quality. In Design, Automation & Test in Europe Conference & Exhibition, 2007. DATE ’07, pages 1–6, 2007.
[51] L. Wang, S. Ma, C. Li, W. Chen, and Z. Wang. A high performance reliable NoC router. Integration, 58:583–592, 2017.
[52] Z. Wang, W. Liu, J. Xu, B. Li, R. Iyer, R. Illikkal, X. Wu, W. H. Mow, and W. Ye. A Case Study on the Communication and Computation Behaviors of Real Applications in NoC-Based MPSoCs. In 2014 IEEE Computer Society Annual Symposium on VLSI, pages 480–485, jul 2014.
[53] Z. Wang, J. Xu, X. Wu, Y. Ye, W. Zhang, M. Nikdast, X. Wang, and Z. Wang. Floorplan Optimization of Fat-Tree-Based Networks-on-Chip for Chip Mul- tiprocessors. IEEE Transactions on Computers, 63(6):1446–1459, jun 2014.
[54] S. Werner, J. Navaridas, and M. Luján. A survey on design approaches to circumvent permanent faults in networks-on-chip. ACM Computing Surveys (CSUR), 48(4):59, 2016.
[55] L. Xie, K. Mei, and Y. Li. Repair: A reliable partial-redundancy-based router in noc. In 2013 IEEE Eighth International Conference on Networking, Architecture and Storage, pages 173–177. IEEE, 2013.
[56] Q. Yu, M. Zhang, and P. Ampadu. Exploiting inherent information redun- dancy to manage transient errors in NoC routing arbitration, 2011.
[57] C. Yuan, L. Huang, J. Wang, and Q. Li. Micro-Architecture Design for Low Overhead Fault Tolerant Network-on-Chip. In 2018 IEEE International Symposium on Circuits and Systems (ISCAS), pages 1–5, may 2018.
[58] Z. Zhen, A. Greiner, and S. Taktak. A reconfigurable routing algorithm for a fault-tolerant 2D-Mesh Network-on-Chip. In Design Automation Conference, 2008. DAC 2008. 45th ACM/IEEE, pages 441–446, 2008.
[59] Q. Zhiliang, T. Ying Fei, and T. Chi-ying. A fault-tolerant network-on-chip design using dynamic reconfiguration of partial-faulty routing resources. In VLSI and System-on-Chip (VLSI-SoC), 2011 IEEE/IFIP 19th International Conference on, pages 192–195, 2011.