簡易檢索 / 詳目顯示

研究生: 謝軒顥
Sie, Syuan-Hao
論文名稱: 基於記憶體內運算架構之多巨集稀疏化感知深度學習硬體加速器
Multi-core Architecture Computing-In-Memory-based Sparsity Aware Deep Learning Accelerator
指導教授: 鄭桂忠
Tang, Kea-Tiong
口試委員: 張孟凡
Chang, Meng Fan
黃朝宗
Huang, Chao-Tsung
盧峙丞
Lu, Chih-Cheng
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2021
畢業學年度: 109
語文別: 中文
論文頁數: 52
中文關鍵詞: 記憶體內運算稀疏化加速器深度學習
外文關鍵詞: Computing-In-Memory, Sparsity, Accelerator, Deep Learning
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著近年來深度神經網路蓬勃發展,人工智慧已漸漸出現在生活周遭的各種應用上,在這之中又以應用於邊緣裝置為最大宗。然而深度神經網路在推論時所需的大量參數存取空間和計算量皆對於裝置的能耗與速度造成龐大的負擔。為了減少傳統馮.諾伊曼架構中資料在記憶體與中央處理器之間搬運所耗費大量的能耗,記憶體內運算之架構被提出以省去資料搬運過程所耗費的能量與時間。此外在基本定理中,零值乘任意數皆為零值,而網路模型中隨著層數的深度增加,權重值為零的分佈比例越高,故對於硬體裝置而言若能省略這些大量零的計算將在不影響結果的前提下有效提升運算速度及處理能耗。
    本研究提出一具備高效率且低功耗之硬體加速器架構,並依據此提出之架構設計下線一顆晶片。本架構採用記憶體內運算巨集設計專屬的資料流,並具備有省略含零值的計算與存取機制,以此達到更低的能耗、更快的推論速度、更少的記憶體資源存、通樣記憶體下更大的網路模型應用等優勢。
    此論文提出之架構與方法應用4bit圖片精度8bit權重精度下辨識CIFAR10資料集圖像,於VGG16及ResNet18此兩網路上分別達成99 fps與114 fps之結果。具備稀疏化感知機制之設計相較於無稀疏化感知提升了最高13倍的推論速度,以及最多440倍的記憶體存取次數。


    Deep neural networks have developed vigorously in recent years, artificial intelli-gence gradually appears in various edge devices applications. However, the huge amount of parameter storage and calculation during inferences all cause a massive bur-den on the power consumption and performance. In traditional Von-Neumann architec-ture, data transfer between the memory and the cpu consumes a lot of energy so In-Memory-Computing (CIM) is proposed to save the energy and time spent in the data transfer. In addition, zero multiplied by any number is zero, and as the depth of the layer increases in the network model, the higher the proportion of weight value is zero. Therefore, for hardware devices, if these large numbers of zero calculations can be omitted, it will effectively improve the performance and energy consumption without affecting the results.
    This research proposes a hardware accelerator architecture with high performance and low power consumption, and we tape out a chip based on the proposed architecture. This architecture adopts CIM and equipped with a mechanism that can skips the calcu-lation and storage about zero values, to achieve lower energy consumption, faster rea-soning speed, less memory utilization, larger network model.
    The architecture and method proposed apply 4bit activation and 8bit weight on CIFAR10 dataset. Achieving 99 fps and 114 fps on the convolutional layers of the two networks, VGG16 and ResNet18. Compared with non-sparseness perception, the de-sign with sparse perception mechanism improves the speed by up to 13 times and the memory access times by up to 440 times.

    摘要 i ABSTRACT ii 目錄 iii 圖目錄 v 表格目錄 vii 第 1 章 緒論 1 1.1 研究背景 1 1.2 研究動機與目的 5 1.3 章節簡介 8 第 2 章 文獻回顧 9 2.1 深度學習硬體加速器 9 2.1.1 低精度運算 9 2.1.2 資料複用性 10 2.1.3 資料搬運能耗 11 2.2 記憶體內運算 13 2.3 稀疏化感知設計 16 2.4 研究動機 19 第 3 章 基於記憶體內運算之稀疏化感知加速器 20 3.1 加速器架構與資料流 20 3.2 記憶體內運算行為模型與卷積 24 3.3 稀疏化感知架構與參數映射 28 3.3.1 稀疏化核分組與定義(Sparsity Kernel Segmentation Define) 28 3.3.2 稠密核映射 (Density Kernel Mapping) 29 3.3.3 稠密權重組索引編碼(Sparsity Index Code) 31 3.3.4 稀疏化感知位址搜尋(Sparsity Address Search) 33 3.4 運算核心擴充架構(Multi-core scalable) 34 第 4 章 實驗結果 36 4.1 環境設置 36 4.2 晶片規格與神經網路應用 38 4.3 稀疏化感知架構成效 43 4.4 與多種基於記憶體內運算加速器之比較 47 第 5 章 結論與未來發展 49 參考文獻 50

    [1] Olga Russakovsky, et al., “Imagenet large scale visual recognition challenge.” In International Journal of Computer Vision, 115.3: 211-252, 2015.
    [2] A. Krizhevsky, et al., “Imagenet classification with deep convolutional neural networks.” In NIPS, 2012.
    [3] Y. Lecun, et al., "Gradient-based learning applied to document recognition." In Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998.
    [4] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition.” In ICLR, 2015.
    [5] C. Szegedy et al., "Going deeper with convolutions." In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1-9, doi: 10.1109/CVPR.2015.
    [6] K. He, et al., “Deep Residual Learning for Image Recognition.” In CVPR, 2016.
    [7] N. P. Jouppi et al., "In-datacenter performance analysis of a tensor processing unit," In ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pp. 1-12, 2017.
    [8] E. Lindholm, et al., "NVIDIA Tesla: A Unified Graphics and Computing Architecture," In IEEE Micro, vol. 28, no. 2, pp. 39-55, 2008.
    [9] Yann LeCun, et al., “Deep learning.” In Nature,521(7553): 436–444, 2015.
    [10] Song Han, et al., “Learning both weights and connections for efficient neural networks” In NIPS, 2015.
    [11] Zhuang Liu et al., "Learning efficient convolutional networks through network slimming." In ICCV, 2017.
    [12] S. Han et al., "EIE: Efficient Inference Engine on Compressed Deep Neural Network." In ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 243-254, 2016.
    [13] M. Courbariaux, et al., “Binaryconnect: Training deep neural networks with binary weights during propagations.” In NIPS, 2015.
    [14] S. Zhou, et al., “Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients.” In arXiv:1606.06160, 2016.
    [15] Z. Cai, et al., “Deep learning with low precision by half-wave gaussian quantization.” In CVPR, 2017.
    [16] Shouyi Yin, et al., “A 141 uW, 2.46 pJ/Neuron Binarized Convolutional Neural Network based Self-learning Speech Recognition Processor in 28nm CMOS.”, In Symposia on VLSI Technology and Circuits, 2018.
    [17] Y.-H. Chen, et al., “Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks.” In JSSC, ISSCC Special Issue, Vol. 52, No. 1, pp. 127-138, 2017.
    [18] K. Ueyoshi, et al., “QUEST: A 7.49 TOPS multi-purpose log-quantized DNN inference engine stacked on 96MB 3D SRAM using inductive-coupling technology in 40nm CMOS.” In ISSCC, 2018.
    [19] S.-H. Sie, et al., “MARS: Multi-macro Architecture SRAM CIM-Based Accelerator with Co-designed Compressed Neural Networks.” In arXiv:2010.12861, 2020.
    [20] W. Wen, et al., “Learning Structured Sparsity in Deep Neural Network”. In NIPS, 2016.
    [21] V. Sze, T.-J. Yang, Y.-H. Chen, J. Emer, "Efficient Processing of Deep Neural Networks: A Tutorial and Survey." In Proceedings of the IEEE, vol. 105, no. 12, pp. 2295-2329, December 2017.
    [22] W. Wei et al., "A Relaxed Quantization Training Method for Hardware Limitations of Resistive Random Access Memory (ReRAM)-Based Computing-in-Memory." In IEEE Journal on Exploratory Solid-State Computational Devices and Circuits, vol. 6, no. 1, pp. 45-52, June 2020.
    [23] A. Shafiee et al., "ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars." In International Symposium on Computer Architecture (ISCA), pp. 14-26, 2016.
    [24] P. Chi et al., "PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory." In International Symposium on Computer Architecture (ISCA), pp. 27-39, 2016.
    [25] Y. Zhe, et al., “Sticker: A 0.41-62.1 TOPS/W 8Bit neural network processor with multi-sparsity compatible convolution arrays and online tuning acceleration for fully connected layers.” In VLSI, 2018.
    [26] H. Ji, L. Song, et al., "ReCom: An efficient resistive accelerator for compressed deep neural networks." In Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 237-240, 2018.
    [27] T. Yang et al., "Sparse ReRAM Engine: Joint Exploration of Activation and Weight Sparsity in Compressed Neural Networks." In International Symposium on Computer Architecture (ISCA), pp. 236-249, 2019.
    [28] Xizi Chen et al., "CompRRAE: RRAM-based Convolutional Neural Network Accelerator with Reduced Computations through a Runtime Activation Estimation." In ASP-DAC, 2019.
    [29] R. Guo et al., "A 5.1pJ/Neuron 127.3us/Inference RNN-based Speech Recognition Processor using 16 Computing-in-Memory SRAM Macros in 65nm CMOS." In Symposium on VLSI Circuits, pp. C120-C121, 2019.
    [30] J. Yue et al., "A 65nm Computing-in-Memory-Based CNN Processor with 2.9-to-35.8TOPS/W System Energy Efficiency Using Dynamic-Sparsity Performance-Scaling Architecture and Energy-Efficient Inter/Intra-Macro Data Reuse," In International Solid- State Circuits Conference (ISSCC), pp. 234-236, 2020.
    [31] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger and A. Moshovos, "Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing," In International Symposium on Computer Architecture (ISCA), pp. 1-13, 2016.
    [32] Wei Wen et al., " Learning structured sparsity in deep neural networks." In NIPS, pp. 2082-2090, 2016.
    [33] X. Si et al., "A 28nm 64Kb 6T SRAM Computing-in-Memory Macro with 8b MAC Operation for AI Edge Chips." In International Solid- State Circuits Conference (ISSCC), pp. 246-248, 2020.
    [34] X. Si et al., “A Twin-8T SRAM Computation-In-Memory Macro for Multiple-Bit CNN-Based Machine Learning”, in ISSCC, 2019.

    QR CODE