簡易檢索 / 詳目顯示

研究生: 涂詠甯
Tu, Yung-Ning
論文名稱: 應用於深度神經網絡資料處理以基於區域計算單元6T靜態隨機存取記憶體之記憶體內運算巨集
A Local Computing Cell Based 6T SRAM Computing-in-Memory Macro for Deep Neural Network Data Processing
指導教授: 張孟凡
Chang, Meng-Fan
口試委員: 呂仁碩
Liu, Ren-Shuo
邱瀝毅
Chiou, Lih-Yih
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2019
畢業學年度: 108
語文別: 英文
論文頁數: 55
中文關鍵詞: 靜態隨機存取記憶體記憶體內運算人工智慧卷積神經網絡
外文關鍵詞: SRAM, CIM, AI, CNN
相關次數: 點閱:3下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來,深度神經網絡的快速發展和各種邊緣設備的廣泛使用大大增加了數據傳輸量。然而,數據在中央處理器和記憶體之間移動的過程消耗了傳統馮紐曼架構中的大部分能量,我們將其稱為馮紐曼瓶頸。因此,記憶體內運算已經成為解決這一瓶頸的發展目標。
    記憶體內運算計算可以同時具有運算和存儲功能。在讀取數據之前就完成計算,並可以直接傳輸結果,以減少移動大量的數據。為了達到上述目的,本研究的設計概念是輸入神經網絡特徵圖並將其並行激活,使其能夠實現乘法和累加(MAC)的功能。
    這項研究提出了一個使用6T SRAM進行多位元MAC操作的SRAM-CIM巨集,最高能在CNN應用提供8位元輸入,8位元權重和20位元輸出精度,使用(1)位元式權重MAC(WbwMAC)以增大感測裕度以提高輸入、權重和輸出精度,(2)6T區域計算單元(LCC),提供緊湊面積和針對工藝變化的可靠讀取,(3)位元式權重之低MAC偵測讀取電路(Wbw-LMAR),基於軟硬體協同設計,以改進能源效率(EFMAC)。製造的28nm 64Kb 6T SRAM-CIM巨集實現了最高8位元輸入、8位元權重(8bIN-8bW)MAC操作,具有目前SRAM-CIM工作中最高的輸出精度,達到運算時間4.1-8.4ns和能源效率11.5-68.4 TOPS/W。


    In recent years, the rapid development of deep neural networks and the widespread use of various edge devices have greatly increased the amount of data transmission. However, the process of moving data between the CPU and memory consumes most of the energy in the traditional Von Neumann architecture, which is called the Von Neumann bottleneck. Therefore, computing-in-memory has become a development goal to solve this bottleneck.
    Computing-in-memory can have both calculation and storage capabilities. Performing calculations before reading data can directly transfer results to reduce the process of moving large amounts of data. In order to achieve the above objectives, the design idea of this work is to input the neural network feature map and activate it in parallel, so that it has the functions of operation multiplication and accumulation (MAC).
    This work presents an SRAM-CIM macro using 6T SRAM for multibit MAC operations with up to 8b-IN, 8b-W, and 20b output precision for CNN applications using (1) weight bit-wise MAC (WbwMAC) operations to enlarge the sensing margin and enhance IN/W/OUT precision, (2) a 6T local-computing-cell (LCC) for compact area and robust read against process variations, (3) a weight bit-wise low-MAC aware readout (Wbw-LMAR) scheme based on software-hardware co-design to improve EFMAC. A fabricated 28nm 64Kb 6T SRAM-CIM macro demonstrated 8bIN-8bW MAC operations with the highest output accuracy in current SRAM-CIM works, reaching an access time of 4.1-8.4ns and Energy efficiency 11.5-68.4TOPS / W.

    致謝..........................................i 摘要..........................................ii Abstract..........................................iii Contents..........................................v List of Figures ..........................................viii List of Tables ..........................................x Chapter 1 Introduction..........................................1 1.1 Memory Landscape..........................................1 1.1.1 RAM..........................................3 1.1.2 CAM..........................................4 1.1.3 ROM..........................................4 1.1.4 Programmable NVMs..........................................5 1.2 Von Neumann bottleneck..........................................6 1.3 Computing-in-Memory (CIM)..........................................7 Chapter 2 Introduction for SRAM..........................................10 2.1 Introduction for Conventional 6T SRAM..........................................10 2.1.1 Structure of SRAM ..........................................10 2.1.2 Write Operation and Write Margin..........................................11 2.1.3 Read Operation..........................................12 2.2 Introduction for Hierarchical Bitline 6T SRAM..........................................13 2.2.1 Structure of Hierarchical Bitline 6T SRAM Array..........................................13 2.2.2 Write Operation..........................................15 2.2.3 Read Operation..........................................16 Chapter 3 Previous Work..........................................18 3.1 10T SRAM-CIM with Binary Weight Analog Computing..........................................18 3.2 6T SRAM-CIM with Multi-bit Weight Word-wise Analog computing ..........................................20 3.3 Twin-8T SRAM-CIM with Multi-bit Weight Word-wise Analog computing..........................................24 3.4 8T SRAM-CIM with Binary Weight Analog Computing..........................................26 Chapter 4 Proposed Circuit Schemes and Analysis..........................................29 4.1 Proposed Computing-In-Memory Circuit Scheme..........................................30 4.1.1 Proposed CIM Structure..........................................30 4.1.2 Proposed Computing-In-Memory Operation..........................................33 4.1.3 Proposed weight bit-wise low-MAC aware readout (Wbw-LMAR) scheme..........................................37 4.2 Analysis and Comparison..........................................40 4.2.1 Proposed Sensing Scheme Analysis..........................................40 Chapter 5 Macro Implementation..........................................44 5.1 Floor Plan of SRAM-CIM Macro..........................................44 5.2 Design for Test chip..........................................45 Chapter 6 Experimental Results and Conclusion..........................................47 6.1 Measured Performance..........................................47 6.2 Comparison to Previous Work ..........................................49 6.3 Conclusions and Future Work ..........................................51 Reference..........................................53

    [1] H. Qin, et al., “SRAM leakage suppression by minimizing standby supply voltage,” in IEEE International Symposium on Quality Electronic Design, pp. 55-60, 2004.
    [2] K. Nii, et al., “A Low Power SRAM using Auto-Backgate-Controlled MT-CMOS,”in IEEE International Symposium on Low Power Electronics and Design, pp. 293-298, Aug. 1998.
    [3] C. Morishima, et al., “A 1-V 20-ns 512-Kbit MT-CMOS SRAM with Auto-Power-Cut Scheme Using Dummy Memory Cells,”in IEEE European Solid-State Circuit Conference, pp. 452-455, Sept. 1998.
    [4] A. G. Hanlon et al., “Content-Addressable and Associative Memory Systems a Survey,” IEEE Transactions on Electronic Computers, vol. EC-15, no.4, pp.509-521, Aug. 1966.
    [5] C. C. Wang et al., “An Adaptively Dividable Dual-Port BiTCAM for Virus-Detection Processors in Mobile Devices,” IEEE International Solid-State Circuits Conference, pp.390-622, Feb. 2008.
    [6] J. Li et al., “1 Mb 0.41 µm² 2T-2R Cell Nonvolatile TCAM With Two-Bit Encoding and Clocked Self-Referenced Sensing,” IEEE Journal of Solid-State Circuits, vol. 49, Issue 4, pp. 896-907, Apr. 2014.
    [7] M. F. Chang et al., “A 3T1R Nonvolatile TCAM Using MLC ReRAM with Sub-1ns Search Time,” IEEE International Solid-State Circuits Conference, pp. 1-3, Feb. 2015.
    [8] D. Smith et al., “A 3.6ns 1Kb ECL I/O BiCMOS U.V. EPROM,” IEEE International Symposium on Circuits and Systems, vol. 3, pp. 1987-1990, May. 1990.
    [9] C. Kuo et al., “A 512-kb flash EEPROM embedded in a 32-b microcontroller,” IEEE Journal of Solid-State Circuits, vol. 27, Issue 4, pp. 574-582, Apr. 1992.
    [10] S. H. Kulkarni et al., “A 4 kb Metal-Fuse OTP-ROM Macro Featuring a 2 V Programmable 1.37 μm2 1T1R Bit Cell in 32 nm High-k Metal-Gate CMOS,” IEEE Journal of Solid-State Circuits, vol. 45, Issue 4, pp. 863-868, Apr. 2010.
    [11] Y. H. Tsai et al., “45nm Gateless Anti-Fuse Cell with CMOS Fully Compatible Process,” IEEE International Electron Devices Meeting, pp. 95-98, Dec. 2007.
    [12] Webfeet Inc., “Semiconductor industry outlook,” Non-Volatile Memory Conference, 2002.
    [13] S. L. Min et al., “Current trends in flash memory technology,” IEEE Asia and South Pacific Conference on Design Automation, pp. 24-27, Jan. 2006.
    [14] F. Masuoka et al., “New ultra high density EPROM and flash EEPROM with NAND structure cell,” IEEE International Electron Devices Meeting, vol. 33, pp. 552-555, 1987.
    [15] A. Bergemont et al., “NOR virtual ground (NVG)-a new scaling concept for very high density flash EEPROM and its implementation in a 0.5 um process,” IEEE International Electron Devices Meeting, pp. 15-18, Dec. 1993.
    [16] D. Kuzum et al., “Nanoelectronic programmable synapses based on phase change materials for brain-inspired computing,” Nano Letters 12 (5), 2179-2186, 2012.
    [17] B. Chen et al., “Efficient in-memory computing architecture based on crossbar arrays,” IEEE International Electron Devices Meeting, pp. 17.5.1-17.5.4, 2015.
    [18] S. Li et al., “Pinatubo: A processing-in-memory architecture for bulk bitwise operations in emerging non-volatile memories,” ACM/EDAC/IEEE Design Automation Conference, pp. 1-6, 2016.
    [19] Q. Dong et al., “A 0.3V VDDmin 4+2T SRAM for searching and in-memory computing using 55nm DDC technology,” IEEE Symposium on VLSI Circuits, pp. C160-C161, 2017.
    [20] J. Zhang, Z. Wang and N. Verma, "In-Memory Computation of a Machine-Learning Classifier in a Standard 6T SRAM Array," in IEEE Journal of Solid-State Circuits, vol. 52, no. 4, pp. 915-924, April 2017.
    [21] A. Biswas, et al., “Conv-RAM: An energy-efficient SRAM with embedded convolution computation for low-power CNN-based machine learning applications” IEEE International Solid-State Circuits Conference, pp. 488-489, 2018
    [22] S. K. Gonugondla, et al., “A 42pJ/decision 3.12TOPS/W robust in-memory machine learning classifier with on-chip training” IEEE International Solid-State Circuits Conference, pp. 490-491, 2018
    [23] M. Motomura, et al., "BRein Memory: A Single-Chip Binary/Ternary Reconfigurable in-Memory Deep Neural Network Accelerator Achieving 1.4 TOPS at 0.6 W," IEEE Journal of Solid-State Circuits, vol. 53, no. 4, pp. 983-994, April 2018.
    [24] W. Khwa et al., “A 65nm 4Kb algorithm-dependent computing-in-memory SRAM unit-macro with 2.3ns and 55.8TOPS/W fully parallel product-sum operation for binary DNN edge processors” IEEE International Solid-State Circuits Conference, pp. 496-497, 2018
    [25] X. Si et al., “A Twin-8T SRAM Computation-In-Memory Macro for Multiple-Bit CNN-Based Machine Learning” IEEE International Solid-State Circuits Conference, pp. 396-397, 2019
    [26] J. Yang et al., “Sandwich-RAM: An Energy-Efficient In-Memory BWN rchitecture with Pulse-Width Modulation” IEEE International Solid-State Circuits Conference, pp. 394-395,2019
    [27] J. Yue et al., “A 65nm 0.39-to-140.3TOPS/W 1-to-12b Unified Neural-Network Processor Using Block-Circulant-Enabled Transpose-Domain Acceleration with 8.1× Higher TOPS/mm2 and 6T HBST-TRAM-Based 2D Data-Reuse Architecture” IEEE International Solid-State Circuits Conference, pp. 138-139,2019

    QR CODE