研究生: |
羅允辰 Lo, Yun-Chen |
---|---|
論文名稱: |
以硬體及演算法協同設計實現具位元寬度彈性之卷積神經網路及可調式非揮發記憶體內運算加速器 Hardware-Algorithm Co-Design to Enable Bitwidth-Flexible Convolutional Neural Networks and Tunable Non-Volatile In-Memory-Computing Accelerators |
指導教授: |
呂仁碩
Liu, Ren-Shuo |
口試委員: |
張孟凡
Chang, Meng-Fan 鄭桂忠 Tang, Kea-Tiong 謝志成 Hsieh, Chih-Cheng |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
論文出版年: | 2020 |
畢業學年度: | 108 |
語文別: | 英文 |
論文頁數: | 37 |
中文關鍵詞: | 可彈性位元寬度 、卷積神經網路 、非揮發性記憶體 、記憶體內運算 、加速器 |
外文關鍵詞: | Bitwidth Flexibility, Convolutional Neural Network, Non-volatile Memory, In-memory Computing, Accelerators |
相關次數: | 點閱:1 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年來,卷積神經網路(Convolutional Neural Networks,簡稱CNNs)竄起成為了雲端與終端裝置上的殺手級別應用。但執行時期所需的大量運算與能耗阻礙了CNNs的廣泛部署。一般來說,CNNs需要數億個乘法才能運算一張小照片,並且乘法運算時附帶的大量資料搬移也成為了主要的延遲與能耗來源。
為了解決上述問題,先前研究提出了低位元寬度CNN (簡稱LB-CNN)加速器和非揮發性記憶體內運算(簡稱nvCIM)加速器。本篇論文考量了上述兩大趨勢提出了FlexNet,第一個以硬體與演算法協同設計來提供nvCIM加速器執行時期位元寬度彈性與對應的能耗-延遲-準確度調整能力。更具體來說,我們提出了五項機制包括:一個創新數值系統(硬體相關),一個位元寬度漸進式訓練方法(演算法相關),一個位元取樣訓練方法(演算法相關)、一個取樣權重調整策略(演算法相關)及訓練與使用多組Batch Normalization參數(硬體與演算法相關)。
所達到的位元寬度彈性帶來三大優勢:一、卷積神經網路加速器多了一個新的維度來管理能耗與散熱。二、卷積神經網路能根據不同的nvCIM加速器需求選擇最適合的執行位元寬度。三、nvCIM加速器能在不同的位元寬度間快速切換來達到近乎連續的能耗-延遲-準確度權衡線。我們在ImageNet辨識圖庫上用AlexNet、VGG、ResNet-18、MobileNet-v2、與ShuffleNet-v2進行實驗,在4-b ResNet CNN的結果顯示相比於傳統低位元寬度網路FlexNet能有 83.34% top-5準確度的提升。
Convolutional neural networks (CNNs) are killer technology for both cloud servers and end devices nowadays. One major roadblock hindering the wide deployment of CNNs in servers and devices lies in the fact that CNNs are compute-intensive and power-hungry. CNNs typically invoke billions of multiplications to process a small patch of the image, and the enormous number of multiplications come with an enormous amount of data movements, which have been identified as the primary source of latency and energy overhead.
Prior researches have worked on low-bitwidth CNN (LB-CNN) accelerators and non-volatile computing-in-memory (nvCIM) accelerators to tackle the roadblock mentioned above. Considering these two directions, this work proposes FlexNet, the first hardware-algorithm co-design work that enables run-time bitwidth flexibility for nvCIM accelerators to facilitate energy-latency-accuracy tunability. More specifically, we present five enabling techniques: a novel binary number system (hardware related), a bit-progressive training procedure (algorithm related), a bit-sample training procedure (algorithm related), a bit-sample ratio adjustment strategy (algorithm related), and the use of multiple batch normalization settings (both hardware and algorithm related).
Achieving bitwidth flexibility brings three advantages. First, nvCIM accelerators are given one more dimension to perform power and thermal management. Second, CNN applications become compatible with different nvCIM accelerators regarding bitwidths. Third, since switching between different bitwidths is available, it is feasible to modulate the bitwidth of a nvCIM accelerator to achieve a nearly continuous energy-latency-accuracy tradeoff line. We perform experiments on ImageNet recognition tasks using AlexNet, VGG, ResNet-18, MobileNet-v2, and ShuffleNet-v2. The results show up to 83.34% top-5 accuracy gain for a nvCIM accelerator that executes a 4-bit ResNet CNN.
[1]J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in IEEE Conference on ComputerVision and Pattern Recognition (CVPR), 2009.
[2]K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassinghuman-level performance on imagenet classification,”arXiv preprint arXiv:1502.01852, 2015.
[3]Y. Shoham, R. Perrault, E. Brynjolfsson, J. Clark, and C. LeGassick, “Object detection, artificial intelligence index 2017 annual report (page 26, https://aiindex.org/2017-report.pdf),” tech. rep., Stanford, 2017.
[4]M. Horowitz, “1.1 computing’s energy problem (and what we can do about it),” vol. 57, pp. 10–14, 02 2014.
[5]Y. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks,” IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127–138, 2017.
[6]C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained ternary quantization,”arXiv preprint arXiv:1612.01064, 2016.
[7]F. Li and B. Liu, “Ternary weight networks,”arXiv preprint arXiv:1605.04711, 2016.
[8]S. Zhou, Z. Ni, X. Zhou, H. Wen, Y. Wu, and Y. Zou, “DoReFa-Net: Train-ing low bitwidth convolutional neural networks with low bitwidth gradients,”arXiv preprint arXiv:1606.06160, 2016.
[9]M. Courbariaux, Y. Bengio, and J.-P. David, “BinaryConnect: Training deep neural networks with binary weights during propagations,” in International Conference on Neural Information Processing Systems (NIPS), 2015.
[10]M. Courbariaux and Y. Bengio, “BinaryNet: Training deep neural networks with weights and activations constrained to +1 or -1,”arXiv preprint arXiv:/1602.02830, 2016.
[11]M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net: Ima-genet classification using binary convolutional neural networks,” in European conference on Computer Vision (ECCV), 2016.33
[12]A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Stra-chan, M. Hu, R. S. Williams, and V. Srikumar, “Isaac: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars,” in2016ACM/IEEE 43rd Annual International Symposium on Computer Architecture(ISCA), pp. 14–26, 2016.
[13]A. Ankit, I. E. Hajj, S. R. Chalamalasetti, G. Ndu, M. Foltin, R. S. Williams, P. Faraboschi, W. mei Hwu, J. P. Strachan, K. Roy, and D. S. Milojicic, “Puma: A programmable ultra-efficient memristor-based accelerator for machine learning inference,” 2019.
[14]P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie, “Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory,” in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 27–39, 2016.
[15]S. Tang, S. Yin, S. Zheng, P. Ouyang, F. Tu, L. Yao, J. Wu, W. Cheng,L. Liu, and S. Wei, “Aepe: An area and power efficient rram crossbar-based accelerator for deep cnns,” in 2017 IEEE 6th Non-Volatile Memory Systems and Applications Symposium (NVMSA), pp. 1–6, 2017.
[16]L. Song, X. Qian, H. Li, and Y. Chen, “Pipelayer: A pipelined reram-based accelerator for deep learning,” in2017 IEEE International Symposium onHigh Performance Computer Architecture (HPCA), pp. 541–552, 2017.
[17]X. Qiao, X. Cao, H. Yang, L. Song, and H. Li, “Atomlayer: A universal reram-based cnn accelerator with atomic layer computation,” in2018 55thACM/ESDA/IEEE Design Automation Conference (DAC), pp. 1–6, 2018.
[18]T. Yang, H. Cheng, C. Yang, I. Tseng, H. Hu, H. Chang, and H. Li, “Sparsereram engine: Joint exploration of activation and weight sparsity in com-pressed neural networks,” in2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA), pp. 236–249, 2019.
[19]T. Chou, W. Tang, J. Botimer, and Z. Zhang, “Cascade: Connecting rramsto extend analog dataflow in an end-to-end in-memory processing paradigm,”in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO’52, (New York, NY, USA), p. 114–125, Association for Computing Machinery, 2019.
[20]S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International Conference on Machine Learning (ICML), 2015.
[21]X. Lin, C. Zhao, and W. Pan, “Towards accurate binary convolutional neural network,”arXiv preprint arXiv:1711.11294, 2017.[22]S. Zhu, X. Dong, and H. Su, “Binary ensemble neural network: More bits per network or more networks per bit?”arXiv preprint arXiv:1806.07550, 2018.[23]A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen, “Incremental network quantization: Towards lossless CNNs with low-precision weights,”arXiv preprintarXiv:1702.03044, 2017.34
[24]H. Sharma, J. Park, N. Suda, L. Lai, B. Chau, V. Chandra, and H. Esmaeilzadeh, “Bit fusion: Bit-level dynamically composable architecture for accelerating deep neural network,” in International Symposium on Computer Architecture (ISCA), 2018.
[25]Y. Umuroglu, L. Rasnayake, and M. Själander, “BISMO: A scalable bit-serial matrix multiplication overlay for reconfigurable computing,” in International Conference on Field Programmable Logic and Applications (FPL), 2018.
[26]J. Lee, C. Kim, S. Kang, D. Shin, S. Kim, and H. J. Yoo, “UNPU: A50.6TOPS/W unified deep neural network accelerator with 1b-to-16b fully-variable weight bit-precision,” in International Solid-State Circuits Conference (ISSCC), 2018.
[27]D. Shin, J. Lee, J. Lee, and H. Yoo, “DNPU: An 8.1TOPS/W reconfigurable CNN-RNN processor for general-purpose deep neural networks,” in International Solid-State Circuits Conference (ISSCC), 2017.
[28]B. Moons, R. Uytterhoeven, W. Dehaene, and M. Verhelst, “Envision: A 0.26-to-10TOPS/W subword-parallel dynamic-voltage-accuracy-frequency-scalable convolutional neural network processor in 28nm FDSOI,” in Inter-national Solid-State Circuits Conference (ISSCC), 2017.
[29]V. Peluso and A. Calimera, “Weak-MAC: Arithmetic relaxation for dynamic energy-accuracy scaling in ConvNets,” in2018 IEEE International Symposium on Circuits and Systems (ISCAS), 2018.
[30]S. Boroumand, H. P. Afshar, P. Brisk, and S. Mohammadi, “Exploration of approximate multipliers design space using carry propagation free compressors,” in Asia and South Pacific Design Automation Conference (ASP-DAC),2018.
[31]T. Yang, T. Ukezono, and T. Sato, “A low-power high-speed accuracy-controllable approximate multiplier design,” in Asia and South Pacific Design Automation Conference (ASP-DAC), 2018.
[32]M. Adelman and M. Silberstein, “Faster neural network training with approximate tensor operations,”arXiv preprint arXiv:1805.08079, 2018.[33]C. Xue, T. Huang, J. Liu, T. Chang, H. Kao, J. Wang, T. Liu, S. Wei, S. Huang, W. Wei, Y. Chen, T. Hsu, Y. Chen, Y. Lo, T. Wen, C. Lo, R. Liu, C. Hsieh, K. Tang, and M. Chang, “15.4 a 22nm 2mb reram compute-in-memory macro with 121-28tops/w for multibit mac computing for tiny ai edge devices,” in2020 IEEE International Solid-State Circuits Conference -(ISSCC), pp. 244–246, 2020.
[34]C. Xue, W. Chen, J. Liu, J. Li, W. Lin, W. Lin, J. Wang, W. Wei, T. Chang, T. Chang, T. Huang, H. Kao, S. Wei, Y. Chiu, C. Lee, C. Lo, Y. King, C. Lin, R. Liu, C. Hsieh, K. Tang, and M. Chang, “24.1 a 1mb multibit reram computing-in-memory macro with 14.6ns parallel mac computing time for cnn based ai edge processors,” in2019 IEEE International Solid- StateCircuits Conference - (ISSCC), pp. 388–390, 2019.35
[35]W. Chen, K. Li, W. Lin, K. Hsu, P. Li, C. Yang, C. Xue, E. Yang, Y. Chen, Y. Chang, T. Hsu, Y. King, C. Lin, R. Liu, C. Hsieh, K. Tang, and M. Chang, “A 65nm 1mb nonvolatile computing-in-memory reram macro with sub-16nsmultiply-and-accumulate for binary dnn ai edge processors,” in2018 IEEEInternational Solid-State Circuits Conference - (ISSCC), pp. 494–496, 2018.
[36]E. Kim, C. Ahn, and S. Oh, “Nestednet: Learning nested sparse structures in deep neural networks,” in2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8669–8678, 2018.
[37]G. Huang, D. Chen, T. Li, F. Wu, L. van der Maaten, and K. Q. Weinberger, “Multi-scale dense networks for resource efficient image classification,” 2017.
[38]S. Teerapittayanon, B. McDanel, and H. T. Kung, “Branchynet: Fast inference via early exiting from deep neural networks,” in2016 23rd InternationalConference on Pattern Recognition (ICPR), pp. 2464–2469, 2016.
[39]Q. Lou, W. Wen, and L. Jiang, “3dict: A reliable and qos capable mobile process-in-memory architecture for lookup-based cnns in 3d xpoint rerams,” in Proceedings of the International Conference on Computer-Aided Design, ICCAD’18, (New York, NY, USA), Association for Computing Machinery,2018.
[40]H. Bagherinezhad, M. Rastegari, and A. Farhadi, “Lcnn: Lookup-based convolutional neural network,” in2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 860–869, 2017.
[41]A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in Py-Torch,” inNIPS-W, 2017.
[42]A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Pro-cessing Systems 25(F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, eds.), pp. 1097–1105, Curran Associates, Inc., 2012.
[43]K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014.
[44]K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016.
[45]M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4510–4520, 2018.
[46]N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “Shufflenet v2: Practical guide-lines for efficient cnn architecture design,” 2018.
[47]D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”arXiv preprint arXiv:1412.6980, 2014.36
[48]I. Loshchilov and F. Hutter, “SGDR: stochastic gradient descent with restarts,”arXiv preprint arXiv:1608.03983, vol. abs/1608.03983, 2016.
[49]X. Peng, S. Huang, Y. Luo, X. Sun, and S. Yu, “Dnn+neurosim: An end-to-end benchmarking framework for compute-in-memory accelerators with versatile device technologies,” in2019 IEEE International Electron DevicesMeeting (IEDM), pp. 32.5.1–32.5.4, 2019.