簡易檢索 / 詳目顯示

研究生: 鐘彥麟
Chung, Yen-Lin
論文名稱: 應用於深度神經網絡資料處理以基於時域脈衝邊緣6T靜態隨機存取記憶體之記憶體內運算結構
A Time Domain Pulse Edge Based 6T SRAM Computing-in-Memory Scheme for Deep Neural Network Data Processing
指導教授: 張孟凡
Chang, Meng-Fan
口試委員: 呂仁碩
Liu, Ren-Shuo
邱瀝毅
Chiou, Lih-Yih
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2021
畢業學年度: 109
語文別: 英文
論文頁數: 53
中文關鍵詞: 記憶體記憶體內運算深度神經網路加速器靜態隨機存取記憶體
外文關鍵詞: memory, computing-in-memory, deep neural network, SRAM, accelerator
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 人工智慧的普及如今已是全球趨勢,其中又以深度學習(Deeplearning)最受大眾期待,這同時也導致電腦運算的負荷量大幅上升。然而,我們也會需要大量的電能將資料從記憶體傳輸到運算單元,並將運算完的結果存回記憶體之中,這部分能量損耗稱之為「馮紐曼瓶頸」。因此,「記憶體內運算」已經成為解決這一瓶頸的發展目標。記憶體內運算可以同時支援運算以及存儲功能。在運算單元讀取數據前就先完成簡單且大量的乘加運算,並將結果匯入運算單元進行後續複雜且少量的其他運算,以避免移動大量的數據。本篇研究提出基於靜態隨機存取記憶體(SRAM)進行多位元乘加運算,可支援卷積神經網路應用,最高規格提供8位元輸入、8位元權重和22位元輸出精度。此架構有3項特點:(1)使用時域累加無上限的特性,克服以往電壓或者是電流式的記憶體內運算天生被限制住的訊號裕度(signal margin)不足的缺點;(2)基於電壓緣的延遲單元(Edged­based Delay Cell, EDC)搭配多組SRAM單元,組合成緊湊面積和針對工藝變化的可靠讀取;(3)雙重比對列架構(Double reference column scheme, DRCS)節省大量讀取電路所需的能量。本篇研究使用台積電22nm Logic Ultra­Low­Power Process驗證,具有1Mb SRAM容量,達到運算時間5.6 ns和平均能源效率8.7 TOPS/W。


    The popularization of artificial intelligence is now a global trend, amongwhich deep learning is the most anticipated by the public, which has also led toa substantial increase in the load of computer operations. However, we also needa lot of power to transfer data from the memory to the arithmetic unit and store theresult of the calculation back into the memory. This part of the energy loss is calledthe “Von Neumann bottleneck.” Therefore, “in­memory computing” has become adevelopment goal to solve this bottleneck.In­memory computing can support computation and storage functions at thesame time. Before the arithmetic unit reads the data, a simple and large numberof multiplication and accumulation operations are completed, and the result is im­ported into the arithmetic unit for subsequent complex and small amount of otheroperations to avoid moving a large amount of data.This research proposes multi­bit multiplication and accumulation (MAC) op­erations based on static random access memory (SRAM), which can support con­volutional neural network applications. The highest specification provides 8­bitinput, 8­bit weight and 22­bit output accuracy.This architecture has 3 features: (1) The use of time­domain accumulationhas no upper limit, which overcomes the inherent shortcomings of insufficient sig­nal margin in voltage or current CIM operations in the past; (2) The structure basedon the proposed Edged­based Delay Cell (EDC) is combined with multiple sets ofSRAM cells to form a compact area and reliable reading for process changes; (3)Double reference column scheme (DRCS) saves a lot of reading circuits energyiii
    required.This study uses TSMC 22nm Logic Ultra­Low­Power Process to verify that ithas 1Mb SRAM capacity, achieving the cycle time of 5.6ns and the average energyefficiency of 8.7TOPS/W.

    致謝i摘要iiAbstractiii1 Introduction11.1 Memory Landscape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11.1.1 Random Access Memory (RAM) & SRAM & DRAM . . . . . . . . .31.1.2 Content­Addressable Memory (CAM) . . . . . . . . . . . . . . . . . .41.1.3 Read­Only Memory (ROM) . . . . . . . . . . . . . . . . . . . . . . .41.1.4 Programmable Non­Volatile Memory (Programmable NVM) . . . . . .51.2 Von Neumann bottleneck . . . . . . . . . . . . . . . . . . . . . . . . . . . . .61.3 Computing­In­Memory (CIM) . . . . . . . . . . . . . . . . . . . . . . . . . .62 In­depth introduction to SRAM92.1 Introduction to Conventional 6T SRAM . . . . . . . . . . . . . . . . . . . . .92.1.1 Structure of 6T SRAM cell . . . . . . . . . . . . . . . . . . . . . . . .92.1.2 Write Operation and Write Margin . . . . . . . . . . . . . . . . . . . .102.1.3 Read Operation and Static Noise Margin . . . . . . . . . . . . . . . .112.2 Introduction to Hierarchical Bit Line (HBL) 6T SRAM . . . . . . . . . . . . .122.2.1 Structure of HBL 6T SRAM Array . . . . . . . . . . . . . . . . . . .122.2.2 Write Operation of HBL 6T SRAM Array . . . . . . . . . . . . . . . .142.2.3 Read Operation of HBL 6T SRAM Array . . . . . . . . . . . . . . . .143 Previous CIM Work163.1 Conv­RAM: An Energy­Efficient SRAM with Embedded Convolution Com­putation for Low­Power CNN­Based Machine Learning Applications [1] . . . .163.2 A42pJ/Decision3.12TOPS/WRobustIn­MemoryMachineLearningClassifierwith On­Chip Training [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . .193.3 Sandwich­RAM:AnEnergy­EfficientIn­MemoryBWNArchitecturewithPulse­Width Modulation [3] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .224 Proposed Circuit Schemes and Analysis254.1 Proposed Computing­In­Memory Circuit Scheme . . . . . . . . . . . . . . . .264.1.1 Proposed CIM Structure . . . . . . . . . . . . . . . . . . . . . . . . .264.1.2 Proposed CIM Operation . . . . . . . . . . . . . . . . . . . . . . . . .314.2 Analysis and Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . .36v 5 Macro Implementation395.1 Floor Plan of SRAM CIM . . . . . . . . . . . . . . . . . . . . . . . . . . . .395.2 Design for Access Time Measurement . . . . . . . . . . . . . . . . . . . . . .426 Experimental Results and Conclusion436.1 Measured Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .436.2 Comparison to Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . .446.3 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . .45References47

    [1]A. Biswas and A. P. Chandrakasan, “Conv­ram: An energy­efficient sram with embed­ded convolution computation for low­power cnn­based machine learning applications,” in2018 IEEE International Solid ­ State Circuits Conference ­ (ISSCC), pp. 488–490, 2018.[2]S. K. Gonugondla, M. Kang, and N. Shanbhag, “A 42pj/decision 3.12tops/w robust in­memory machine learning classifier with on­chip training,” in 2018 IEEE InternationalSolid ­ State Circuits Conference ­ (ISSCC), pp. 490–492, 2018.[3]J. Yang, Y. Kong, Z. Wang, Y. Liu, B. Wang, S. Yin, and L. Shi, “24.4 sandwich­ram:An energy­efficient in­memory bwn architecture with pulse­width modulation,” in 2019IEEE International Solid­ State Circuits Conference ­ (ISSCC), pp. 394–396, 2019.[4]X. Si, J.­J. Chen, Y.­N. Tu, W.­H. Huang, J.­H. Wang, Y.­C. Chiu, W.­C. Wei, S.­Y. Wu,X. Sun, R. Liu, S. Yu, R.­S. Liu, C.­C. Hsieh, K.­T. Tang, Q. Li, and M.­F. Chang, “24.5a twin­8t sram computation­in­memory macro for multiple­bit cnn­based machine learn­ing,” in 2019 IEEE International Solid­ State Circuits Conference ­ (ISSCC), pp. 396–398,2019.[5]X. Si, Y.­N. Tu, W.­H. Huang, J.­W. Su, P.­J. Lu, J.­H. Wang, T.­W. Liu, S.­Y. Wu, R. Liu,Y.­C. Chou, Z. Zhang, S.­H. Sie, W.­C. Wei, Y.­C. Lo, T.­H. Wen, T.­H. Hsu, Y.­K. Chen,W.Shih, C.­C.Lo, R.­S.Liu, C.­C.Hsieh, K.­T.Tang, N.­C.Lien, W.­C.Shih, Y.He, Q.Li,and M.­F. Chang, “15.5 a 28nm 64kb 6t sram computing­in­memory macro with 8b macoperation for ai edge chips,” in 2020 IEEE International Solid­ State Circuits Conference­ (ISSCC), pp. 246–248, 2020.[6]Hulfang Qin, Yu Cao, D. Markovic, A. Vladimirescu, and J. Rabaey, “Sram leakage sup­pression by minimizing standby supply voltage,” in International Symposium on Signals,Circuits and Systems. Proceedings, SCS 2003. (Cat. No.03EX720), pp. 55–60, 2004.[7]C. Morishima, K. Nii, Y. Tsujihashi, Y. Hayakawa, and H. Makino, “A 1­v 20­ns 512­kbitmt­cmos sram with auto­power­cut scheme using dummy memory cells,” in Proceedingsof the 24th European Solid­State Circuits Conference, pp. 452–455, 1998.[8]A. G. Hanlon, “Content­addressable and associative memory systems a survey,” IEEETransactions on Electronic Computers, vol. EC­15, no. 4, pp. 509–521, 1966.[9]C. Wang, C. Cheng, T. Chen, and J. Wang, “An adaptively dividable dual­port bitcamfor virus­detection processors in mobile devices,” in 2008 IEEE International Solid­StateCircuits Conference ­ Digest of Technical Papers, pp. 390–622, 2008.[10]J. Li, R. K. Montoye, M. Ishii, and L. Chang, “1 mb 0.41 μm² 2t­2r cell nonvolatile tcamwith two­bit encoding and clocked self­referenced sensing,” IEEE Journal of Solid­StateCircuits, vol. 49, no. 4, pp. 896–907, 2014.47
    [11]M. F. Chang, C. C. Lin, A. Lee, C. C. Kuo, G. H. Yang, H. J. Tsai, T. F. Chen, S. S.Sheu, P. L. Tseng, H. Y. Lee, and T. K. Ku, “17.5 a 3t1r nonvolatile tcam using mlc reramwith sub­1ns search time,” in 2015 IEEE International Solid­State Circuits Conference ­(ISSCC) Digest of Technical Papers, pp. 1–3, 2015.[12]D. Smith, J. Zeiter, T. Bowman, J. Rahm, B. Kertis, A. Hall, S. Natan, L. Sanderson,R. Tromp, and J. Tsang, “A 3.6 ns 1 kb ecl i/o bicmos uv eprom,” in 1990 IEEE Interna­tional Symposium on Circuits and Systems (ISCAS), pp. 1987–1990 vol.3, 1990.[13]C. Kuo, M. Weidner, T. Toms, H. Choe, K. . Chang, A. Harwood, J. Jelemensky, andP. Smith, “A 512­kb flash eeprom embedded in a 32­b microcontroller,” IEEE Journal ofSolid­State Circuits, vol. 27, no. 4, pp. 574–582, 1992.[14]S. H. Kulkarni, Z. Chen, J. He, L. Jiang, M. B. Pedersen, and K. Zhang, “A 4 kb metal­fuse otp­rom macro featuring a 2 v programmable 1.37μm21t1r bit cell in 32 nm high­kmetal­gate cmos,” IEEE Journal of Solid­State Circuits, vol. 45, no. 4, pp. 863–868, 2010.[15]Y. Tsai, H. Chen, H. Chiu, H. Shih, H. Lai, Y. King, and C. J. Lin, “45nm gateless anti­fusecell with cmos fully compatible process,” in 2007 IEEE International Electron DevicesMeeting, pp. 95–98, 2007.[16]A. Niebel, “Business outlook for the non­volatile memory market,” in 2006 21st IEEENon­Volatile Semiconductor Memory Workshop, pp. 6–7, 2006.[17]Sang Lyul Min and Eyee Hyun Nam, “Current trends in flash memory technology,” inAsia and South Pacific Conference on Design Automation, 2006., pp. 2 pp.–, 2006.[18]F. Masuoka, M. Momodomi, Y. Iwata, and R. Shirota, “New ultra high density eprom andflash eeprom with nand structure cell,” in 1987 International Electron Devices Meeting,pp. 552–555, 1987.[19]W. H. Lee, C. Hur, H. Lee, H. Yoo, S. Lee, B. Lee, C. Park, and K. Kim, “Post­cyclingdata retention failure in multilevel nor flash memory with nitrided tunnel­oxide,” in 2009IEEE International Reliability Physics Symposium, pp. 907–908, 2009.[20]G. Strawn, “Masterminds of the electronic digital computer,” IT Professional, vol. 16,no. 2, pp. 10–12, 2014.[21]B. Chen, F. Cai, J. Zhou, W. Ma, P. Sheridan, and W. D. Lu, “Efficient in­memory com­puting architecture based on crossbar arrays,” in 2015 IEEE International Electron DevicesMeeting (IEDM), pp. 17.5.1–17.5.4, 2015.[22]S. Li, C. Xu, Q. Zou, J. Zhao, Y. Lu, and Y. Xie, “Pinatubo: A processing­in­memoryarchitecture for bulk bitwise operations in emerging non­volatile memories,” in 2016 53ndACM/EDAC/IEEE Design Automation Conference (DAC), pp. 1–6, 2016.[23]Q. Dong, S. Jeloka, M. Saligane, Y. Kim, M. Kawaminami, A. Harada, S. Miyoshi,D. Blaauw, and D. Sylvester, “A 0.3v vddmin 4+2t sram for searching and in­memorycomputing using 55nm ddc technology,” in 2017 Symposium on VLSI Circuits, pp. C160–C161, 2017.48
    [24]J. Zhang, Z. Wang, and N. Verma, “In­memory computation of a machine­learning clas­sifier in a standard 6t sram array,” IEEE Journal of Solid­State Circuits, vol. 52, no. 4,pp. 915–924, 2017.[25]J.­W. Su, X. Si, Y.­C. Chou, T.­W. Chang, W.­H. Huang, Y.­N. Tu, R. Liu, P.­J. Lu, T.­W.Liu, J.­H. Wang, Z. Zhang, H. Jiang, S. Huang, C.­C. Lo, R.­S. Liu, C.­C. Hsieh, K.­T. Tang, S.­S. Sheu, S.­H. Li, H.­Y. Lee, S.­C. Chang, S. Yu, and M.­F. Chang, “15.2 a28nm 64kb inference­training two­way transpose multibit 6t sram compute­in­memorymacro for ai edge chips,” in 2020 IEEE International Solid­ State Circuits Conference ­(ISSCC), pp. 240–242, 2020.[26]Q. Dong, M. E. Sinangil, B. Erbagci, D. Sun, W.­S. Khwa, H.­J. Liao, Y. Wang, andJ. Chang, “15.3 a 351tops/w and 372.4gops compute­in­memory sram macro in 7nm finfetcmos for machine­learning applications,” in 2020 IEEE International Solid­ State CircuitsConference ­ (ISSCC), pp. 242–244, 2020.[27]J. Yue, R. Liu, W. Sun, Z. Yuan, Z. Wang, Y.­N. Tu, Y.­J. Chen, A. Ren, Y. Wang, M.­F.Chang, X. Li, H. Yang, and Y. Liu, “7.5 a 65nm 0.39­to­140.3tops/w 1­to­12b unified neu­ral network processor using block­circulant­enabled transpose­domain acceleration with8.1 × higher tops/mm2and 6t hbst­tram­based 2d data­reuse architecture,” in 2019 IEEEInternational Solid­ State Circuits Conference ­ (ISSCC), pp. 138–140, 2019.

    QR CODE