研究生: |
陳韋豪 Chen, Wei-Hao |
---|---|
論文名稱: |
應用於AI邊緣裝置之高能源效率非揮發性記憶體內運算巨集電路技術 Circuit Techniques for energy-efficient ReRAM based Non-volatile computing-in-memory macros in AI edge device |
指導教授: |
張孟凡
Chang, Meng-Fan |
口試委員: |
邱瀝毅
洪浩喬 謝志成 鄭桂忠 |
學位類別: |
博士 Doctor |
系所名稱: |
電機資訊學院 - 電子工程研究所 Institute of Electronics Engineering |
論文出版年: | 2019 |
畢業學年度: | 107 |
語文別: | 英文 |
論文頁數: | 74 |
中文關鍵詞: | 非揮發性記憶體 、非揮發性記憶體內運算巨集 、電阻式記憶體 |
外文關鍵詞: | Non-volatile memory, Non-volatile computing-in-memory macro, ReRAM |
相關次數: | 點閱:3 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
傳統處理器架構中,運算資料於處理器與記憶體之間透過傳輸線(Bus)進行傳遞稱馮諾伊曼(Von Neumann)架構。隨著大數據技術與AI晶片發展,系統運算的資料量出現突破性的增加,在傳統架構中資料於記憶體與處理器間傳輸介面的輸入輸出端(IO)數成為速度瓶頸,而搬動資料需要消耗大量額外能量亦成為效能上的限制。近年,記憶體內運算成為目前最具潛力的研究項目。有別於傳統馮諾伊曼架構,記憶體內運算可於單晶片中實現平行運算,降低需要傳遞與暫存的資料量達到快速且低功耗之運算目標。
本研究使用利用高密度、高低阻態比值(R-ratio)大的電阻式記憶體(ReRAM)提出兩項創新之非揮發性記憶體內運算巨集(Nonvolatile Computing-In-Memory, nvCIM),並應用於深度學習(Deep Learning, DL)神經網路中進行系統驗證。本研究之記憶體內計算巨集之記憶體不僅可作為存取單元並可於記憶體中進行資料運算,可有效降低資料傳輸量與多餘能量損耗。目標為應用於下世代能量與硬體資源有限之AI邊緣裝置(edge device),以下為本研究之電路特色:
1. 使用150奈米製程,RERAM為基底之16 Mb 雙重計算模式(Dual-mode computing, DMc)) 記憶體內運算巨集。
• 國際發表中容量最大且操作速度最快的非揮發性記憶體內運算巨集。
• 2-input logic (AND/OR/XOR) CIM讀取速度小於14ns, 比傳統nvCIM架構讀取速度快86倍。
2. 使用65奈米製程,Contact RERAM為基底之16 Mb 多重計算模式(Multi-mode computing ,MMc)) 記憶體內運算巨集。
• 本研究於非揮發性記憶內運算巨集中提出2-input 與3-input logic (AND/OR/XOR) 與乘加 (multiply and accumulation, MAC)功能。
• 國際發表中第一個具可進行 Logic 與 MAC 功能轉換之非揮發性記憶體內運算巨集。
• 利用 nvCIM chip 為基底之文字辨識 (MNIST database)系統整合驗證,辨識成功率高達98.8%。
• 記憶體內運算電路開發。
a) Mode-and-input-aware reference current generator (MIA-RCG) 可在同一個參考記憶胞陣列中產生應用於Logic或MAC功能之參考電流,不須多餘的面積代價。
b) Input-Aware dynamic IREF scheme (IA-REF)參考電流生成方案提升訊號裕度(Signal margin)由-27.9uA提升至7.8uA。
c) 結合IA-REF 參考電流生成方案與small offset sense amplifier兩種電路之非揮發性記憶內運算巨集,應用於DNN文字辨識(MNIST database)中,相較於傳統電路架構可降低50倍的錯誤發生率。
The challenges faced by the von Neumann architecture stem from the large amounts of data transmissions passing through memory hierarchies to processing elements (PEs) by bus. Due to limited IO bandwidth, this not only consumes large volumes of energy, but also leads to significant delays. Recently, nonvolatile Computing-in-Memory (nvCIM) has become a promising solution that enables highly energy-efficient computing for AI edge devices. In particular, nvCIM can achieve fast speed, high throughput and low power consumption via parallel processing. In this work, we propose two fully-integrated nvCIM macros based on ReRAM technology to meet the stringent performance, energy, and area constraints required by macro-level implementation. The main contributions of this work are listed as follows:
1. A 16Mb Dual-mode computing (DMc) nvCIM macro was fabricated using 150nm CMOS process with 1T1R HfOx ReRAM devices. This macro can achieve both memory and logic (AND/OR/XOR) CIM operation.
• The largest capacity and fastest non-volatile computing in memory macro.
• The measured delay of the 2-input logic in-memory operations is less than 14ns.
2. A 1Mb Multiple-mode computing (MMc) nvCIM macro was fabricated using 65nm CMOS process with 1T1R contact ReRAM (CRRAM) devices. This nvCIM macro can achieve both memory, 2/3 input logic (AND/OR/XOR) CIM and multiply-and-accumulation (MAC) CIM functions.
• The largest capacity and the fastest non-volatile computing in memory macro with MAC and logic operation.
• The Logic and MAC reference current can be generated in the same reference (REF) array by a Mode-and-input-aware reference current generator (MIA-RCG) without incurring additional area overhead.
• The signal margin was improved from -27.9 uA to 7.8 uA by the proposed Input-Aware dynamic IREF (IA-REF) reference generation scheme.
• The inference error rate was reduced 50x via the MNIST database by the proposed IA-REF reference generation scheme with small offset current sense amplifier.
[1] Goldstine, H. H. The computer from Pascal to von Neumann, (Princeton University Press 1980).
[2] Chai, L., Gao, Q. & Panda, D. K., Understanding the impact of multi-core architecture in cluster computing: A case study with intel dual-core system. In Cluster Computing and the Grid, 2007. CCGRID 2007. Seventh IEEE International Symposium on, 471–478 (2007).
[3] Kumar, R. et al. Single-ISA heterogeneous multi-core architectures: The potential for processor power reduction. Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture, IEEE Computer Society, 81 (2003).
[4] LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
[5] Sze, V. et al. Efficient Processing of Deep Neural Networks: A Tutorial and Survey. Proceeding of IEEE 105, 2295-2329 (2017).
[6] Cloud TPUs; https://ai.google/tools/cloud-tpus/ (2017)
[7] Microsoft unveils Project Brainwave for real-time AI. Microsoft (22 August 2017); https://www.microsoft.com/en-us/research/blog/microsoft-unveilsproject-brainwave/ (2017)
[8] Intel unveils neural compute engine in Movidius Myriad X VPU to uleash AI the edge (2017);https://newsroom.intel.com/news/intel-unveils-neural-compute-engine-movidius-myriad-x-vpu-unleash-ai-edge/ (2017)
[9] Arm ML Processor: Powering Machine Learning at the Edge (2017); https://community.arm.com/processors/b/blog/posts/arm-ml-processor (2017)
[10] Ventra, M. D. et al. The parallel approach. Nature Physics 9, 200–202 (2013).
[11] NVIDIA Jetson TX2 Enables AI at the Edge (March, 2017); https://nvidianews.nvidia.com/news/nvidia-jetson-tx2-enables-ai-at-the-edge?ncid=so-fac-jnt2lh-10040 (2017)
[12] Intel Arria 10 FPGAs (2017); https://www.altera.com/products/fpga/arria-series/arria-10/overview.html (2017)
[13] Price, M. et al. A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating. Int. Solid-State Circuits Conference (ISSCC) Digest of Technical Papers, 244–245 (2017).
[14] Shin, D. et al. DNPU: An 8.1TOPS/W reconfigurable CNN-RNN processor for general-purpose deep neural networks. Int. Solid-State Circuits Conference (ISSCC) Digest of Technical Papers, 240–241 (2017).
[15] Chen, Y.-H., Krishna, T., Emer, J. & Sze, V. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid-State Circuits 52, 127–138 (2017).
[16] Liu, S. et al. Cambricon: An instruction set architecture for neural networks. 2016 ACM/IEEE 43rd Ann. Int. Symp. Computer Architecture, 393–405 (2016).
[17] Yin, S. et al. A High Energy Efficient Reconfigurable Hybrid Neural Network Processor for Deep Learning Applications. IEEE J. Solid-State Circuits 53, 968–982 (2018).
[18] Pawlowski, J. T. Hybrid memory cube (HMC). 2011 IEEE Hot Chips 23 Symp. (HCS) 1–24 (2011).
[19] Lee, D.-U. et al. A 1.2 V 8Gb 8-channel 128GB/s high-bandwidth memory (HBM) stacked DRAM with effective microbump I/O test methods using 29 nm process and TSV. IEEE Int. Solid-State Circuits Conf. (ISSCC) Digest Tech. Papers, 432–433 (2014).
[20] Xu, A. et al. Scaling for edge inference of deep neural networks. Nat. Electron 1, 216–222 (2018)
[21] Chen, W.-H. et al. A 16Mb Dual-Mode ReRAM Macro with Sub-14ns Computing-In-Memory and Memory Functions Enabled by Self-Write Termination Scheme. In Tech. Digest International Electron Devices Meeting (IEDM), 28.2.1–28.2.4 (2017)
[22] Chen, W.-H. et al. A 65nm 1Mb Nonvolatile Computing-in-Memory ReRAM Macro with sub-16ns Multiply-and-Accumulate for Binary DNN AI Edge Processor. IEEE Int. Solid-State Circuits Conference (ISSCC) Digest of Technical Papers, 494–49 (2018)
[23] Li, S. et al. Pinatubo: A processing-in-memory architecture for bulk bitwise operations in emerging non-volatile memories. ACM/EDAC/IEEE Design Automation Conference (DAC), 170 (2016)
[24] Li, H. & Wong, H.-S. P. et al. Hyperdimensional Computing with 3D VRERAM In-Memory Kernels: Device-Architecture Co-Design for Energy-Efficient, Error-Resilient Language Recognition. In Tech. Digest International Electron Devices Meeting (IEDM), 16.1.1–16.1.4 (2016)
[25] Chen, B. et al. Efficient in-memory computing architecture based on crossbar arrays. In Tech. Digest International Electron Devices Meeting (IEDM), 17.5.1–17.5.4. (2015).
[26] Prezioso, M. et al. Training and operation of an integrated neuromorphic network based on metal-oxide memristors. Nature 521, 61–64 (2015)
[27] Yang, J.J., Strukov, D. B. & Stewart, D.R. Memristive devices for computing. Nat. Nanotech. 8, 13–24 (2013).
[28] Kim, J.-H. et al. Highly manufacturable SONOS non-volatile memory for the embedded SoC solution. Digest of Technical Papers. 2003 Symposium on, 31–32 (2003).
[29] Trentzsch, M. et al. A 28nm HKMG super low power embedded NVM technology based on ferroelectric FETs. In Tech. Digest International Electron Devices Meeting (IEDM), 11.5.1–11.5.4 (2016).
[30] Burr, G. W. et al. Overview of candidate device technologies for storage-class memory." IBM Journal of Research and Development. 52.4.5, 449–464 (2008).
[31] Freitas, R. F. & Winfried, W. W. Storage-class memory: The next storage system technology. IBM Journal of Research and Development. 52.4.5, 439–447 (2008).
[32] Wong, H. -S. P. et al. Phase Change Memory in Proceedings of the IEEE, 98, 2201–2227 (2011)
[33] Noguchi, H. et al. A 3.3ns-Access-Time 71.2μW/MHz 1Mb Embedded STT-MRAM Using Physically Eliminated Read-Disturb Scheme and Normally-Off Memory Architecture. IEEE Int. Solid-State Circuits Conference (ISSCC) Digest of Technical Papers 136–137 (2015)
[34] Rho, K. et al. A 4Gb LPDDR2 STT-MRAM with Compact 9F2 1T1MTJ Cell and Hierarchical Bitline Architecture. IEEE Int. Solid-State Circuits Conference (ISSCC) Digest of Technical Papers, 396–397 (2017)
[35] Kim, C. et al. A Covalent-Bonded Cross-Coupled Current-Mode Sense Amplifier for STT-MRAM with 1T1MTJ Common Source-Line Structure Array. IEEE Int. Solid-State Circuits Conference (ISSCC) Digest of Technical Papers, 134–136 (2015)
[36] Yang, T.-H. et al. A 28nm 32Kb Embedded 2T2MTJ STT-MRAM Macro with 1.3ns Read-Access-Time for Fast and Reliable Read Applications. IEEE Int. Solid-State Circuits Conference (ISSCC) Digest of Technical Papers, 480–481 (2018)
[37] Wen, C. Y. et al. A non-volatile look-up table design using PCM (phase-change memory) cells. Symposium on VLSI Circuits Dig. Tech. Papers, C302–303 (2011).
[38] Pozidis, H. et al. Reliable MLC data storage and retention in phase-change memory after endurance cycling. IEEE International Memory Workshop, 100–103 (2013).
[39] Khwa, W.-S. et al. Novel Inspection and Annealing Procedure to Rejuvenate Phase Change Memory from Cycling-Induced Degradations for Storage Class Memory Applications. In Tech. Digest International Electron Devices Meeting (IEDM), 29.8.1 –29.8.4 (2014)
[40] Rizzi, M. et al. Statistics of set transition in phase change memory (PCM) arrays. In Tech. Digest International Electron Devices Meeting (IEDM), 29.6.1–29.6.4 (2014).
[41] Khwa, W.-S. et al. A Resistance Drift Compensation Scheme to Reduce MLC PCM Raw BER by Over 100X for Storage Class Memory Applications. IEEE Int. Solid-State Circuits Conference (ISSCC) Digest of Technical Papers, 134–135 (2016)
[42] Tseng, Y. -H. et al. High Density and Ultra Small Cell Size of Contact ReRAM (CR-RAM) in 90nm CMOS Logic Technology and Circuits. In Tech. Digest International Electron Devices Meeting (IEDM), 1–4 (2009)
[43] Sheu, S.-S. et al. A 4Mb embedded SLC Resistive-RAM macro with 7.2ns read-write random access time and 160ns MLC-access capability. IEEE Int. Solid-State Circuits Conf. (ISSCC) Digest Tech. Papers, 200–201 (2011)
[44] Chang, M.-F. et al. A 0.5V 4Mb Logic-Process Compatible Embedded Resistive RAM (ReRAM) in 65nm CMOS Using Low Voltage Current-Mode Sensing Scheme with 45ns Random Read Time. IEEE Int. Solid-State Circuits Conf. (ISSCC) Digest Tech. Papers, 434–435 (2012)
[45] Chang, M.-F. et al. A 0.5V 4Mb Logic-Process Compatible Embedded Resistive RAM (ReRAM) in 65nm CMOS Using Low Voltage Current-Mode Sensing Scheme with 45ns Random Read Time. IEEE Int. Solid-State Circuits Conference (ISSCC) Digest of Technical Papers, 434–435 (2012)
[46] Chang, M.-F. et al. Low VDDmin Swing-Sample-and-Couple Sense Amplifier and Energy-Efficient Self-Boost-Write-Termination Scheme for Embedded ReRAM Macros against Resistance and Switch-Time Variations. IEEE J. Solid-State Circuits 50, 2786–2795 (2015)
[47] Chou, C.-C. et al. An N40 256Kx44 Embedded RERAM Macro with SL-Precharge SA and Low-Voltage Current Limiter to improve Read and Write Performance. Solid-State Circuits Conference (ISSCC) Digest of Technical Papers 478–450 (2018)
[48] Jang, J. et al. Vertical cell array using TCAT(Terabit Cell Array Transistor) technology for ultra high density NAND flash memory. Symposium on VLSI Circuits Dig. Tech. Papers, T192–T193 (2009).
[49] Maeda, T. et al. Multi-stacked 1G cell/layer Pipe-shaped BiCS flash memory. Symposium on VLSI Circuits Dig. Tech. Papers, T22–T23 (2009).
[50] Kim, J. et al. Novel Vertical-Stacked-Array-Transistor (VSAT) for ultra-high-density and cost-effective NAND Flash memory devices and SSD (Solid State Drive). Symposium on VLSI Circuits Dig. Tech. Papers, T186–T187 (2009).
[51] Lee, M. J. et al. 2-stack 1D-1R Cross-point Structure with Oxide Diodes as Switch Elements for High Density Resistance RAM Applications. Digest International Electron Devices Meeting (IEDM), 771–774 (2007).
[52] Cheng, H. Y. et al. An Ultra High Endurance and Thermally Stable Selector based on TeAsGeSiSe Chalcogenides Compatible with BEOL IC Integration for Cross-Point PCM. In Tech. Digest International Electron Devices Meeting (IEDM), 2.2.1–2.2.4 (2017)
[53] Chen, A. A Highly Efficient and Scalable Model for Crossbar Arrays with Nonlinear Selectors. In Tech. Digest International Electron Devices Meeting (IEDM), 37.2.1–37.2.4 (2018)
[54] Engel, J. H. et al. Capacity optimization of emerging memory systems: A shannon-inspired approach to device characterization. In Tech. Digest International Electron Devices Meeting (IEDM), 29.4.1–29.4.4 (2014)
[55] Bichler, O. et al. Visual pattern extraction using energy-efficient “2-PCM synapse neuromorphic architecture." IEEE Transactions on Electron Devices, 59.8, 2206–2214 (2012)
[56] Kim, S. et al. NVM neuromorphic core with 64k-cell (256-by-256) phase change memory synaptic array with On-chip neuron circuits for continuous in-situ learning. In Tech. Digest International Electron Devices Meeting (IEDM), 17.1.1–17.1.4 (2015).
[57] Moon, K. et al. High density neuromorphic system with Mo/Pr0.7Ca0.3MnO3 synapse and NbO2 IMT oscillator neuron. In Tech. Digest International Electron Devices Meeting (IEDM), 17.6.1–17.6.4 (2015).
[58] Kuzum, D. et al. Nanoelectronic programmable synapses based on phase change materials for braininspired computing. Nano letters, 12.5, 2179–2186 (2011)
[59] Kuzum, D. et al. Synaptic electronics: materials, devices and applications. Nanotechnology, 24.38, 382001 (2013)
[60] Su, F., et al. A 462gops/j rram-based nonvolatile intelligent processor for energy harvesting ioe system featuring nonvolatile logics and processing-in-memory. Symposium on VLSI Circuits Dig. Tech. Papers, T260–T261 (2017).
[61] Rastegari, M. et al. XNORNet: ImageNet Classification Using Binary Convolutional Neural Networks. European Conference on Computer Vision(ECCV), 525–542 (2016)
[62] Wicht, B., et al. Yield and speed optimization of a latch-type voltage sense amplifier. IEEE J. Solid-State Circuits, 39.7, 1148-1158 (2004).
[63] Chi, Ping, et al. "Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory." ACM SIGARCH Computer Architecture News, IEEE Press, 44.3, 27-39 (2016).
[64] Liu, R. et al. Parallelizing SRAM arrays with customized bit-cell for binary neural networks. Proceedings of the 55th Annual Design Automation Conference. ACM, 21 (2018)
[65] Bauer, M. et al. A multilevel-cell 32 Mb flash memory. Proceedings 30th IEEE International Symposium on Multiple-Valued Logic (ISMVL 2000), 367–368 (2000).
[66] Ambrogio, S. et al. Equivalent-accuracy accelerated neural-network training using analogue memory. Nature 558, 60–67 (2018)
[67] Mochida, R. et al. A 4M Synapses integrated Analog ReRAM based 66.5 TOPS/W Neural-Network Processor with Cell Current Controlled Writing and Flexible Network Architecture. IEEE Symposium on VLSI Circuits Digest of Technical Papers, T175–T176 (2018)