研究生: |
張嘉宏 Chang, Chia-Hung |
---|---|
論文名稱: |
基於稀疏行式核心之卷積神經網路加速器設計 Design of an Inference Accelerator for CNN with Sparse Row-wise Kernel |
指導教授: |
黃稚存
Huang, Chih-Tsun |
口試委員: |
劉靖家
Liou, Jing-Jia 呂仁碩 Liu, Ren-Shuo |
學位類別: |
碩士 Master |
系所名稱: |
|
論文出版年: | 2019 |
畢業學年度: | 107 |
語文別: | 英文 |
論文頁數: | 41 |
中文關鍵詞: | 捲積神經網路 、稀疏神經網路 、加速器 、壓縮 、行式核心 |
外文關鍵詞: | Convolution Neural Network, Sparse Neural Network, Accelerator, Compression, Row-wise Kernel |
相關次數: | 點閱:1 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
捲積神經網路在近年來已經成為機器學習演算法的主流,並能應用在不
同的領域中,諸如圖形辨識、翻譯以及自駕車等。但捲積神經網路是種計
算以及記憶體密集的模型,因此要將他們部屬在某些平台上時會造成高度
的能源消耗以及延遲產生,特別是在移動平台上。稀疏神經網路是其中一
種可以減少計算量與記憶體需求的解決方式。透過修剪神經網路模型中多
餘的神經元連結,可以使得大量的權重變成零。這使得我們能有很好的機
會來壓縮資料和跳過非必要的運算。
諸多先前的研究也利用稀疏來減少時間與能源,但他們通常都採用逐個
權重修剪的方式,這會造成權重分布的不規律性,使得加速器難以設計,
因此我們的目標希望能利用較有規律地稀疏性來針對加速器進行設計。
在這篇論文中,我們提出了一個基於行式核心的捲積神經網路加速器,
並可以消除空的核心行的運算以增進效能。我們將稀疏核心壓縮成逐行資
料格式,在這格式中我們儲存非零的核心行以及其對應的目錄,此舉能減
少外接記憶體的存取次數也較能核保持核心的規律性。但在行與行之間還
是存在不規律性,為了解決此問題,我們在加速器中增加了一個積累模組,
不過會造成靜態隨機存取記憶體的存取次數增加進而影響能源耗損。在靜
態隨機讀取記憶體中,我們固定每個輸出結果的位置,這樣便能透過核心
目錄得知實際的記憶體位置。由於記憶體位置是固定的,若我們直接以原
資料格式進行運算,會產生記憶體衝突,為了避免此情況發生,我們會需
要進行核心排序,此舉雖能有效避免記憶體衝突,但也造成了延遲上的負
擔。
我們在三種模型上與Eyeriss 進行比較,分別為Googlenet、Resnet-50以及MobilenetV1。我們的設計分別能在延遲上得到2 倍、1.6 倍以及1.8
倍的加速。在能源上則是1.35 倍、1.34 倍以及2.16 倍相較於Eyeriss,雖
然能源消耗上都比Eyeriss 來的高,但在能源延遲乘積上分別為67%、84%
以及118%,可以看出在Googlenet 以及Resnet 上,我們的設計能有好處,
但在Mobilenet 則會因為1x1 層的關係造成加速的結果被能源消耗抵銷。
我們對168 個處理單元的設計進行合成,得到的面積為14 mm2 相較於
Eyeriss 為12.25 mm2。
Convolutional neural networks (CNNs) have now become the majority of machine learning
algorithm in various domains, such as image recognition, speech translation, and autonomous
driving. CNN models are known to be computationally and memory intensive; therefore,
they incur high energy and latency while deployed in many situations, especially mobile
platforms. One of the solutions is sparse neural networks which can reduce computation
amount and memory requirement. By pruning the redundant connections in CNN models, a
high fraction of weights become zeros. This leaves us with a great opportunity to compress
data and skip ineffectual computations.
There are many previous work exploit sparsity in CNN to saving time or energy. Many
of them apply element-wise pruning on CNN models and turn kernels very irregular. This
make accelerator hard to compute. So our goal is to keep some regularity in sparse kernels
and try to construct accelerator base on it.
In this thesis, we propose a design of CNN accelerator which supports sparse row-wise
kernels. The accelerator can eliminate computation of redundant kernels rows to improve
performance. Our pruning models obtain from Skim Caffe which applies element-wise pruning
to prune models. And we encode these sparse kernel data in row-wise data format,
which stores non-zeros kernel rows and filter id (FID), to reduces off-chip memory access of
weight and can also maintain some regularity of sparse kernels. Therefore, the architecture
can perform 1D convolution and reuse data in row dimension like Eyeriss, a row stationary
accelerator. Even though row-wise data format maintains some level of regularity, irregularity
still exist between rows. To deal with this issue, we add an accumulation module
to accumulate psum outside PE array. But this will increase access amount of SRAM and
affect power consumption. We allocate each output channel of ofmap to a fix block of output
SRAM to overcome the irregularity of compressed data flow. So while store psum back to
accumulation module, output SRAM addresses can be acquired from FID. But fixed memory
address will cause memory contention if we directly process kernel rows with the order of
row-wise data format. To prevent this, filter scheduling is needed. We rescheduling the orderof kernel rows and add empty rows to maintain data structure. This will increase overhead
of latency because of the empty rows. And the overhead can achieve 32% in our default
14x12 PE array architecture in Googlenet.
We compare our results on Eyeriss with three CNN models: Googlenet, Resnet-50 and
Mobilenet v1. Their element-wise sparsities are 61%, 47%, and 50% and row-wise sparsities
are 56%, 44%, and 50%. The speedups in latency can achieve 2.0x, 1.6x, and 1.8x
respectively. Although power consumptions, which are 1.35x, 1.34x and 2.16x to Eyeriss,
are increased due to the increment of SRAM access amount in accumulation modules. The
architecture can still have benefit compare with energy delay products (EDP) which are 67%
and 84% in Googlenet and Resnet-50 to Eyeriss. But it’s 118% in Mobilenet v1. This is
because 1x1 layers are not suitable in row-wise data format. Even though the accelerator can
save cycles in 1x1 layers, the speedup will be offset by power consumption. We synthesis our
168 PE architecture, which does not consist of controller, with Deisgn Compiler and TSMC
130 nm library. The area is 14 mm2 and compare to the area of Eyeriss is 12.25 mm2 using
TSMC 65 nm.
[1] Y. L. Y. Bengio and G. Hinton, “Deep learning,” Nature, vol. 521, 2015.
[2] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image
recognition,” CoRR, vol. abs/1409.1556, 2014.
[3] L. Deng, J. Li, J.-T. Huang, K. Yao, D. Yu, F. Seide, M. Seltzer, G. Zweig, X. He,
J. Williams, Y. Gong, and A. Acero, “Recent advances in deep learning for speech
research at microsoft,” May 2013.
[4] C. Chen, A. Seff, A. Kornhauser, and J. Xiao, “Deepdriving: Learning affordance for
direct perception in autonomous driving,” Dec 2015, pp. 2722–2730.
[5] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional
neural networks,” in Proceedings of the 25th International Conference on Neural
Information Processing Systems - Volume 1, ser. NIPS’12. USA: Curran Associates
Inc., 2012, pp. 1097–1105.
[6] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” CoRR, vol.
abs/1409.4842, 2014.
[7] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”
CoRR, vol. abs/1512.03385, 2015.
[8] M. Zhu and S. Gupta, “To prune, or not to prune: exploring the efficacy of pruning for
model compression,” arXiv e-prints, p. arXiv:1710.01878, Oct. 2017.
[9] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to
document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, Nov
1998.
[10] J. Park, S. R. Li, W. Wen, H. Li, Y. Chen, and P. Dubey, “Holistic sparsecnn: Forging
the trident of accuracy, speed, and size,” CoRR, vol. abs/1608.01409, 2016.
[11] S. R. Li, J. Park, and P. T. P. Tang, “Enabling sparse winograd convolution by native
pruning,” CoRR, vol. abs/1702.08597, 2017.
[12] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama,
and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” arXiv
preprint arXiv:1408.5093, 2014.
[13] H. Mao, S. Han, J. Pool, W. Li, X. Liu, Y. Wang, and W. J. Dally, “Exploring the
Regularity of Sparse Structure in Convolutional Neural Networks,” ArXiv e-prints, May
2017.
[14] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun,
and O. Temam, “Dadiannao: A machine-learning supercomputer,” in 2014 47th Annual
IEEE/ACM International Symposium on Microarchitecture, Dec 2014, pp. 609–622.
[15] Y. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-efficient reconfigurable
accelerator for deep convolutional neural networks,” IEEE Journal of Solid-State
Circuits, vol. 52, no. 1, pp. 127–138, Jan 2017.
[16] S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both weights and connections for
efficient neural networks,” CoRR, vol. abs/1506.02626, 2015.
[17] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural network
with pruning, trained quantization and huffman coding,” CoRR, vol. abs/1510.00149,
2015.
[18] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “EIE: efficient
inference engine on compressed deep neural network,” CoRR, vol. abs/1602.01528,
2016.
[19] S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen,
“Cambricon-x: An accelerator for sparse neural networks,” in 2016 49th Annual
IEEE/ACM International Symposium on Microarchitecture (MICRO), Oct 2016, pp.
1–12.
[20] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos, “Cnvlutin:
Ineffectual-neuron-free deep neural network computing,” in 2016 ACM/IEEE
43rd Annual International Symposium on Computer Architecture (ISCA), June 2016,
pp. 1–13.
[21] D. Kim, J. Ahn, and S. Yoo, “Zena: Zero-aware neural network accelerator,” IEEE
Design Test, vol. 35, no. 1, pp. 39–46, Feb 2018.
[22] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer,
S. W. Keckler, and W. J. Dally, “Scnn: An accelerator for compressed-sparse convolutional
neural networks,” in 2017 ACM/IEEE 44th Annual International Symposium on
Computer Architecture (ISCA), June 2017, pp. 27–40.