簡易檢索 / 詳目顯示

研究生: 張嘉宏
Chang, Chia-Hung
論文名稱: 基於稀疏行式核心之卷積神經網路加速器設計
Design of an Inference Accelerator for CNN with Sparse Row-wise Kernel
指導教授: 黃稚存
Huang, Chih-Tsun
口試委員: 劉靖家
Liou, Jing-Jia
呂仁碩
Liu, Ren-Shuo
學位類別: 碩士
Master
系所名稱:
論文出版年: 2019
畢業學年度: 107
語文別: 英文
論文頁數: 41
中文關鍵詞: 捲積神經網路稀疏神經網路加速器壓縮行式核心
外文關鍵詞: Convolution Neural Network, Sparse Neural Network, Accelerator, Compression, Row-wise Kernel
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 捲積神經網路在近年來已經成為機器學習演算法的主流,並能應用在不
    同的領域中,諸如圖形辨識、翻譯以及自駕車等。但捲積神經網路是種計
    算以及記憶體密集的模型,因此要將他們部屬在某些平台上時會造成高度
    的能源消耗以及延遲產生,特別是在移動平台上。稀疏神經網路是其中一
    種可以減少計算量與記憶體需求的解決方式。透過修剪神經網路模型中多
    餘的神經元連結,可以使得大量的權重變成零。這使得我們能有很好的機
    會來壓縮資料和跳過非必要的運算。
    諸多先前的研究也利用稀疏來減少時間與能源,但他們通常都採用逐個
    權重修剪的方式,這會造成權重分布的不規律性,使得加速器難以設計,
    因此我們的目標希望能利用較有規律地稀疏性來針對加速器進行設計。
    在這篇論文中,我們提出了一個基於行式核心的捲積神經網路加速器,
    並可以消除空的核心行的運算以增進效能。我們將稀疏核心壓縮成逐行資
    料格式,在這格式中我們儲存非零的核心行以及其對應的目錄,此舉能減
    少外接記憶體的存取次數也較能核保持核心的規律性。但在行與行之間還
    是存在不規律性,為了解決此問題,我們在加速器中增加了一個積累模組,
    不過會造成靜態隨機存取記憶體的存取次數增加進而影響能源耗損。在靜
    態隨機讀取記憶體中,我們固定每個輸出結果的位置,這樣便能透過核心
    目錄得知實際的記憶體位置。由於記憶體位置是固定的,若我們直接以原
    資料格式進行運算,會產生記憶體衝突,為了避免此情況發生,我們會需
    要進行核心排序,此舉雖能有效避免記憶體衝突,但也造成了延遲上的負
    擔。
    我們在三種模型上與Eyeriss 進行比較,分別為Googlenet、Resnet-50以及MobilenetV1。我們的設計分別能在延遲上得到2 倍、1.6 倍以及1.8
    倍的加速。在能源上則是1.35 倍、1.34 倍以及2.16 倍相較於Eyeriss,雖
    然能源消耗上都比Eyeriss 來的高,但在能源延遲乘積上分別為67%、84%
    以及118%,可以看出在Googlenet 以及Resnet 上,我們的設計能有好處,
    但在Mobilenet 則會因為1x1 層的關係造成加速的結果被能源消耗抵銷。
    我們對168 個處理單元的設計進行合成,得到的面積為14 mm2 相較於
    Eyeriss 為12.25 mm2。


    Convolutional neural networks (CNNs) have now become the majority of machine learning
    algorithm in various domains, such as image recognition, speech translation, and autonomous
    driving. CNN models are known to be computationally and memory intensive; therefore,
    they incur high energy and latency while deployed in many situations, especially mobile
    platforms. One of the solutions is sparse neural networks which can reduce computation
    amount and memory requirement. By pruning the redundant connections in CNN models, a
    high fraction of weights become zeros. This leaves us with a great opportunity to compress
    data and skip ineffectual computations.
    There are many previous work exploit sparsity in CNN to saving time or energy. Many
    of them apply element-wise pruning on CNN models and turn kernels very irregular. This
    make accelerator hard to compute. So our goal is to keep some regularity in sparse kernels
    and try to construct accelerator base on it.
    In this thesis, we propose a design of CNN accelerator which supports sparse row-wise
    kernels. The accelerator can eliminate computation of redundant kernels rows to improve
    performance. Our pruning models obtain from Skim Caffe which applies element-wise pruning
    to prune models. And we encode these sparse kernel data in row-wise data format,
    which stores non-zeros kernel rows and filter id (FID), to reduces off-chip memory access of
    weight and can also maintain some regularity of sparse kernels. Therefore, the architecture
    can perform 1D convolution and reuse data in row dimension like Eyeriss, a row stationary
    accelerator. Even though row-wise data format maintains some level of regularity, irregularity
    still exist between rows. To deal with this issue, we add an accumulation module
    to accumulate psum outside PE array. But this will increase access amount of SRAM and
    affect power consumption. We allocate each output channel of ofmap to a fix block of output
    SRAM to overcome the irregularity of compressed data flow. So while store psum back to
    accumulation module, output SRAM addresses can be acquired from FID. But fixed memory
    address will cause memory contention if we directly process kernel rows with the order of
    row-wise data format. To prevent this, filter scheduling is needed. We rescheduling the orderof kernel rows and add empty rows to maintain data structure. This will increase overhead
    of latency because of the empty rows. And the overhead can achieve 32% in our default
    14x12 PE array architecture in Googlenet.
    We compare our results on Eyeriss with three CNN models: Googlenet, Resnet-50 and
    Mobilenet v1. Their element-wise sparsities are 61%, 47%, and 50% and row-wise sparsities
    are 56%, 44%, and 50%. The speedups in latency can achieve 2.0x, 1.6x, and 1.8x
    respectively. Although power consumptions, which are 1.35x, 1.34x and 2.16x to Eyeriss,
    are increased due to the increment of SRAM access amount in accumulation modules. The
    architecture can still have benefit compare with energy delay products (EDP) which are 67%
    and 84% in Googlenet and Resnet-50 to Eyeriss. But it’s 118% in Mobilenet v1. This is
    because 1x1 layers are not suitable in row-wise data format. Even though the accelerator can
    save cycles in 1x1 layers, the speedup will be offset by power consumption. We synthesis our
    168 PE architecture, which does not consist of controller, with Deisgn Compiler and TSMC
    130 nm library. The area is 14 mm2 and compare to the area of Eyeriss is 12.25 mm2 using
    TSMC 65 nm.

    1 Introduction 1 1.1 Motivation 1 1.2 Contribution 2 2 Previous Works 4 3 Row-wise compression 6 3.1 Weight sparsity analysis 6 3.2 Row-wise data format 8 4 Proposed Architecture 10 4.1 Architecture 10 4.1.1 PE structure 10 4.1.2 Accumulation Module 13 4.1.3 SRAM 13 4.2 Filter Scheduling 15 4.3 Data flow 17 5 Experiment Result 29 5.1 Experiment Setup 29 5.2 Experimental Result 30 6 Conclusion and Future Work 36 6.1 Conclusion 36 6.2 Future Work 36

    [1] Y. L. Y. Bengio and G. Hinton, “Deep learning,” Nature, vol. 521, 2015.
    [2] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image
    recognition,” CoRR, vol. abs/1409.1556, 2014.
    [3] L. Deng, J. Li, J.-T. Huang, K. Yao, D. Yu, F. Seide, M. Seltzer, G. Zweig, X. He,
    J. Williams, Y. Gong, and A. Acero, “Recent advances in deep learning for speech
    research at microsoft,” May 2013.
    [4] C. Chen, A. Seff, A. Kornhauser, and J. Xiao, “Deepdriving: Learning affordance for
    direct perception in autonomous driving,” Dec 2015, pp. 2722–2730.
    [5] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional
    neural networks,” in Proceedings of the 25th International Conference on Neural
    Information Processing Systems - Volume 1, ser. NIPS’12. USA: Curran Associates
    Inc., 2012, pp. 1097–1105.
    [6] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan,
    V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” CoRR, vol.
    abs/1409.4842, 2014.
    [7] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”
    CoRR, vol. abs/1512.03385, 2015.
    [8] M. Zhu and S. Gupta, “To prune, or not to prune: exploring the efficacy of pruning for
    model compression,” arXiv e-prints, p. arXiv:1710.01878, Oct. 2017.
    [9] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to
    document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, Nov
    1998.
    [10] J. Park, S. R. Li, W. Wen, H. Li, Y. Chen, and P. Dubey, “Holistic sparsecnn: Forging
    the trident of accuracy, speed, and size,” CoRR, vol. abs/1608.01409, 2016.
    [11] S. R. Li, J. Park, and P. T. P. Tang, “Enabling sparse winograd convolution by native
    pruning,” CoRR, vol. abs/1702.08597, 2017.
    [12] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama,
    and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” arXiv
    preprint arXiv:1408.5093, 2014.
    [13] H. Mao, S. Han, J. Pool, W. Li, X. Liu, Y. Wang, and W. J. Dally, “Exploring the
    Regularity of Sparse Structure in Convolutional Neural Networks,” ArXiv e-prints, May
    2017.
    [14] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun,
    and O. Temam, “Dadiannao: A machine-learning supercomputer,” in 2014 47th Annual
    IEEE/ACM International Symposium on Microarchitecture, Dec 2014, pp. 609–622.
    [15] Y. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-efficient reconfigurable
    accelerator for deep convolutional neural networks,” IEEE Journal of Solid-State
    Circuits, vol. 52, no. 1, pp. 127–138, Jan 2017.
    [16] S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both weights and connections for
    efficient neural networks,” CoRR, vol. abs/1506.02626, 2015.
    [17] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural network
    with pruning, trained quantization and huffman coding,” CoRR, vol. abs/1510.00149,
    2015.
    [18] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “EIE: efficient
    inference engine on compressed deep neural network,” CoRR, vol. abs/1602.01528,
    2016.
    [19] S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen,
    “Cambricon-x: An accelerator for sparse neural networks,” in 2016 49th Annual
    IEEE/ACM International Symposium on Microarchitecture (MICRO), Oct 2016, pp.
    1–12.
    [20] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos, “Cnvlutin:
    Ineffectual-neuron-free deep neural network computing,” in 2016 ACM/IEEE
    43rd Annual International Symposium on Computer Architecture (ISCA), June 2016,
    pp. 1–13.
    [21] D. Kim, J. Ahn, and S. Yoo, “Zena: Zero-aware neural network accelerator,” IEEE
    Design Test, vol. 35, no. 1, pp. 39–46, Feb 2018.
    [22] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer,
    S. W. Keckler, and W. J. Dally, “Scnn: An accelerator for compressed-sparse convolutional
    neural networks,” in 2017 ACM/IEEE 44th Annual International Symposium on
    Computer Architecture (ISCA), June 2017, pp. 27–40.

    QR CODE