研究生: |
鐘彥麟 Chung, Yen-Lin |
---|---|
論文名稱: |
應用於深度神經網絡資料處理以基於時域脈衝邊緣6T靜態隨機存取記憶體之記憶體內運算結構 A Time Domain Pulse Edge Based 6T SRAM Computing-in-Memory Scheme for Deep Neural Network Data Processing |
指導教授: |
張孟凡
Chang, Meng-Fan |
口試委員: |
呂仁碩
Liu, Ren-Shuo 邱瀝毅 Chiou, Lih-Yih |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
論文出版年: | 2021 |
畢業學年度: | 109 |
語文別: | 英文 |
論文頁數: | 53 |
中文關鍵詞: | 記憶體 、記憶體內運算 、深度神經網路 、加速器 、靜態隨機存取記憶體 |
外文關鍵詞: | memory, computing-in-memory, deep neural network, SRAM, accelerator |
相關次數: | 點閱:4 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
人工智慧的普及如今已是全球趨勢,其中又以深度學習(Deeplearning)最受大眾期待,這同時也導致電腦運算的負荷量大幅上升。然而,我們也會需要大量的電能將資料從記憶體傳輸到運算單元,並將運算完的結果存回記憶體之中,這部分能量損耗稱之為「馮紐曼瓶頸」。因此,「記憶體內運算」已經成為解決這一瓶頸的發展目標。記憶體內運算可以同時支援運算以及存儲功能。在運算單元讀取數據前就先完成簡單且大量的乘加運算,並將結果匯入運算單元進行後續複雜且少量的其他運算,以避免移動大量的數據。本篇研究提出基於靜態隨機存取記憶體(SRAM)進行多位元乘加運算,可支援卷積神經網路應用,最高規格提供8位元輸入、8位元權重和22位元輸出精度。此架構有3項特點:(1)使用時域累加無上限的特性,克服以往電壓或者是電流式的記憶體內運算天生被限制住的訊號裕度(signal margin)不足的缺點;(2)基於電壓緣的延遲單元(Edgedbased Delay Cell, EDC)搭配多組SRAM單元,組合成緊湊面積和針對工藝變化的可靠讀取;(3)雙重比對列架構(Double reference column scheme, DRCS)節省大量讀取電路所需的能量。本篇研究使用台積電22nm Logic UltraLowPower Process驗證,具有1Mb SRAM容量,達到運算時間5.6 ns和平均能源效率8.7 TOPS/W。
The popularization of artificial intelligence is now a global trend, amongwhich deep learning is the most anticipated by the public, which has also led toa substantial increase in the load of computer operations. However, we also needa lot of power to transfer data from the memory to the arithmetic unit and store theresult of the calculation back into the memory. This part of the energy loss is calledthe “Von Neumann bottleneck.” Therefore, “inmemory computing” has become adevelopment goal to solve this bottleneck.Inmemory computing can support computation and storage functions at thesame time. Before the arithmetic unit reads the data, a simple and large numberof multiplication and accumulation operations are completed, and the result is imported into the arithmetic unit for subsequent complex and small amount of otheroperations to avoid moving a large amount of data.This research proposes multibit multiplication and accumulation (MAC) operations based on static random access memory (SRAM), which can support convolutional neural network applications. The highest specification provides 8bitinput, 8bit weight and 22bit output accuracy.This architecture has 3 features: (1) The use of timedomain accumulationhas no upper limit, which overcomes the inherent shortcomings of insufficient signal margin in voltage or current CIM operations in the past; (2) The structure basedon the proposed Edgedbased Delay Cell (EDC) is combined with multiple sets ofSRAM cells to form a compact area and reliable reading for process changes; (3)Double reference column scheme (DRCS) saves a lot of reading circuits energyiii
required.This study uses TSMC 22nm Logic UltraLowPower Process to verify that ithas 1Mb SRAM capacity, achieving the cycle time of 5.6ns and the average energyefficiency of 8.7TOPS/W.
[1]A. Biswas and A. P. Chandrakasan, “Convram: An energyefficient sram with embedded convolution computation for lowpower cnnbased machine learning applications,” in2018 IEEE International Solid State Circuits Conference (ISSCC), pp. 488–490, 2018.[2]S. K. Gonugondla, M. Kang, and N. Shanbhag, “A 42pj/decision 3.12tops/w robust inmemory machine learning classifier with onchip training,” in 2018 IEEE InternationalSolid State Circuits Conference (ISSCC), pp. 490–492, 2018.[3]J. Yang, Y. Kong, Z. Wang, Y. Liu, B. Wang, S. Yin, and L. Shi, “24.4 sandwichram:An energyefficient inmemory bwn architecture with pulsewidth modulation,” in 2019IEEE International Solid State Circuits Conference (ISSCC), pp. 394–396, 2019.[4]X. Si, J.J. Chen, Y.N. Tu, W.H. Huang, J.H. Wang, Y.C. Chiu, W.C. Wei, S.Y. Wu,X. Sun, R. Liu, S. Yu, R.S. Liu, C.C. Hsieh, K.T. Tang, Q. Li, and M.F. Chang, “24.5a twin8t sram computationinmemory macro for multiplebit cnnbased machine learning,” in 2019 IEEE International Solid State Circuits Conference (ISSCC), pp. 396–398,2019.[5]X. Si, Y.N. Tu, W.H. Huang, J.W. Su, P.J. Lu, J.H. Wang, T.W. Liu, S.Y. Wu, R. Liu,Y.C. Chou, Z. Zhang, S.H. Sie, W.C. Wei, Y.C. Lo, T.H. Wen, T.H. Hsu, Y.K. Chen,W.Shih, C.C.Lo, R.S.Liu, C.C.Hsieh, K.T.Tang, N.C.Lien, W.C.Shih, Y.He, Q.Li,and M.F. Chang, “15.5 a 28nm 64kb 6t sram computinginmemory macro with 8b macoperation for ai edge chips,” in 2020 IEEE International Solid State Circuits Conference (ISSCC), pp. 246–248, 2020.[6]Hulfang Qin, Yu Cao, D. Markovic, A. Vladimirescu, and J. Rabaey, “Sram leakage suppression by minimizing standby supply voltage,” in International Symposium on Signals,Circuits and Systems. Proceedings, SCS 2003. (Cat. No.03EX720), pp. 55–60, 2004.[7]C. Morishima, K. Nii, Y. Tsujihashi, Y. Hayakawa, and H. Makino, “A 1v 20ns 512kbitmtcmos sram with autopowercut scheme using dummy memory cells,” in Proceedingsof the 24th European SolidState Circuits Conference, pp. 452–455, 1998.[8]A. G. Hanlon, “Contentaddressable and associative memory systems a survey,” IEEETransactions on Electronic Computers, vol. EC15, no. 4, pp. 509–521, 1966.[9]C. Wang, C. Cheng, T. Chen, and J. Wang, “An adaptively dividable dualport bitcamfor virusdetection processors in mobile devices,” in 2008 IEEE International SolidStateCircuits Conference Digest of Technical Papers, pp. 390–622, 2008.[10]J. Li, R. K. Montoye, M. Ishii, and L. Chang, “1 mb 0.41 μm² 2t2r cell nonvolatile tcamwith twobit encoding and clocked selfreferenced sensing,” IEEE Journal of SolidStateCircuits, vol. 49, no. 4, pp. 896–907, 2014.47
[11]M. F. Chang, C. C. Lin, A. Lee, C. C. Kuo, G. H. Yang, H. J. Tsai, T. F. Chen, S. S.Sheu, P. L. Tseng, H. Y. Lee, and T. K. Ku, “17.5 a 3t1r nonvolatile tcam using mlc reramwith sub1ns search time,” in 2015 IEEE International SolidState Circuits Conference (ISSCC) Digest of Technical Papers, pp. 1–3, 2015.[12]D. Smith, J. Zeiter, T. Bowman, J. Rahm, B. Kertis, A. Hall, S. Natan, L. Sanderson,R. Tromp, and J. Tsang, “A 3.6 ns 1 kb ecl i/o bicmos uv eprom,” in 1990 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1987–1990 vol.3, 1990.[13]C. Kuo, M. Weidner, T. Toms, H. Choe, K. . Chang, A. Harwood, J. Jelemensky, andP. Smith, “A 512kb flash eeprom embedded in a 32b microcontroller,” IEEE Journal ofSolidState Circuits, vol. 27, no. 4, pp. 574–582, 1992.[14]S. H. Kulkarni, Z. Chen, J. He, L. Jiang, M. B. Pedersen, and K. Zhang, “A 4 kb metalfuse otprom macro featuring a 2 v programmable 1.37μm21t1r bit cell in 32 nm highkmetalgate cmos,” IEEE Journal of SolidState Circuits, vol. 45, no. 4, pp. 863–868, 2010.[15]Y. Tsai, H. Chen, H. Chiu, H. Shih, H. Lai, Y. King, and C. J. Lin, “45nm gateless antifusecell with cmos fully compatible process,” in 2007 IEEE International Electron DevicesMeeting, pp. 95–98, 2007.[16]A. Niebel, “Business outlook for the nonvolatile memory market,” in 2006 21st IEEENonVolatile Semiconductor Memory Workshop, pp. 6–7, 2006.[17]Sang Lyul Min and Eyee Hyun Nam, “Current trends in flash memory technology,” inAsia and South Pacific Conference on Design Automation, 2006., pp. 2 pp.–, 2006.[18]F. Masuoka, M. Momodomi, Y. Iwata, and R. Shirota, “New ultra high density eprom andflash eeprom with nand structure cell,” in 1987 International Electron Devices Meeting,pp. 552–555, 1987.[19]W. H. Lee, C. Hur, H. Lee, H. Yoo, S. Lee, B. Lee, C. Park, and K. Kim, “Postcyclingdata retention failure in multilevel nor flash memory with nitrided tunneloxide,” in 2009IEEE International Reliability Physics Symposium, pp. 907–908, 2009.[20]G. Strawn, “Masterminds of the electronic digital computer,” IT Professional, vol. 16,no. 2, pp. 10–12, 2014.[21]B. Chen, F. Cai, J. Zhou, W. Ma, P. Sheridan, and W. D. Lu, “Efficient inmemory computing architecture based on crossbar arrays,” in 2015 IEEE International Electron DevicesMeeting (IEDM), pp. 17.5.1–17.5.4, 2015.[22]S. Li, C. Xu, Q. Zou, J. Zhao, Y. Lu, and Y. Xie, “Pinatubo: A processinginmemoryarchitecture for bulk bitwise operations in emerging nonvolatile memories,” in 2016 53ndACM/EDAC/IEEE Design Automation Conference (DAC), pp. 1–6, 2016.[23]Q. Dong, S. Jeloka, M. Saligane, Y. Kim, M. Kawaminami, A. Harada, S. Miyoshi,D. Blaauw, and D. Sylvester, “A 0.3v vddmin 4+2t sram for searching and inmemorycomputing using 55nm ddc technology,” in 2017 Symposium on VLSI Circuits, pp. C160–C161, 2017.48
[24]J. Zhang, Z. Wang, and N. Verma, “Inmemory computation of a machinelearning classifier in a standard 6t sram array,” IEEE Journal of SolidState Circuits, vol. 52, no. 4,pp. 915–924, 2017.[25]J.W. Su, X. Si, Y.C. Chou, T.W. Chang, W.H. Huang, Y.N. Tu, R. Liu, P.J. Lu, T.W.Liu, J.H. Wang, Z. Zhang, H. Jiang, S. Huang, C.C. Lo, R.S. Liu, C.C. Hsieh, K.T. Tang, S.S. Sheu, S.H. Li, H.Y. Lee, S.C. Chang, S. Yu, and M.F. Chang, “15.2 a28nm 64kb inferencetraining twoway transpose multibit 6t sram computeinmemorymacro for ai edge chips,” in 2020 IEEE International Solid State Circuits Conference (ISSCC), pp. 240–242, 2020.[26]Q. Dong, M. E. Sinangil, B. Erbagci, D. Sun, W.S. Khwa, H.J. Liao, Y. Wang, andJ. Chang, “15.3 a 351tops/w and 372.4gops computeinmemory sram macro in 7nm finfetcmos for machinelearning applications,” in 2020 IEEE International Solid State CircuitsConference (ISSCC), pp. 242–244, 2020.[27]J. Yue, R. Liu, W. Sun, Z. Yuan, Z. Wang, Y.N. Tu, Y.J. Chen, A. Ren, Y. Wang, M.F.Chang, X. Li, H. Yang, and Y. Liu, “7.5 a 65nm 0.39to140.3tops/w 1to12b unified neural network processor using blockcirculantenabled transposedomain acceleration with8.1 × higher tops/mm2and 6t hbsttrambased 2d datareuse architecture,” in 2019 IEEEInternational Solid State Circuits Conference (ISSCC), pp. 138–140, 2019.