深度卷積神經網路加速器之設計探索方法｜國立清華大學博碩士論文庫

簡易檢索 / 詳目顯示

回結果列表

研究生：	陳俊辰 Chen, Chun-Chen
論文名稱：	深度卷積神經網路加速器之設計探索方法 Design Exploration Methodology for Deep Convolutional Neural Network Accelerator
指導教授：	黃稚存 Huang, Chih-Tsun
口試委員:	劉靖家 Liou, Jing-Jia 謝明得 Shieh, Ming-Der
學位類別：	碩士 Master
系所名稱：
論文出版年：	2018
畢業學年度：	107
語文別：	英文
論文頁數：	66
中文關鍵詞：	卷積神經網路、加速器、架構
外文關鍵詞：	CNN, Accelerator, Architecture
相關次數：	點閱：3 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

深度卷積神經網路在近年的人工智慧應用扮演了關鍵角色。近年來，為
了要解決深度卷積神經的龐大運算複雜度，越來越多深度學習推論加速器
被提出。利用在深度卷積神經專用加速器上的高平行度，人工智慧推論的
即時運算變得可能。然而，大量的運算複雜度也伴隨著大量的資料需求。
由於卷積層的資料量過大，把所有資料都儲存在加速器上是沒有效率的。
為了要有效率地用加速器去進行卷積運算，整個卷積層運算必須被切割。
對卷積層切割的以及排程的方法就稱為資料流。帶有巨大頻寬需求的複雜
資料流為架構設計上的改善帶來巨大負擔。
為了要設計有效率的深度卷積神經網路加速器，資料流以及硬體架構必
須被同時考慮。在此篇論文中，我們基於卷積運算的規律性提出了一個快
速且精準的運算週期分析模型。用了我們的模型，高效節能推論加速器的
效能可以被以低於0.63% 的錯誤來快速預估。為了要將推論加速器套用至
不同的深度卷積神經網路模型中，一個參數探索流程被提出來尋找較佳的運算量配置。
有了我們的探索流程，我們能輕易地找出加速器的效能瓶頸。基於評估
的結果，我們提出一個改進過的加速器，相較於現有的Eyeriss [1] 架構可
以分別在ResNet-50 達到1.34 倍以及MobileNet-V2 達到2.39 倍的效能改
善。
最後，一個擁有2016 運算單元的加速器被作為例子來展現我們的方法
可以有效進行架構探索以及規格制定。有了運算單元的數目以及記憶體的
容量，設計架構可以被一步一步地改進，在ResNet-50 上可以達到1849.89
MACOPS (每秒可以進行的乘加運算)。如此高的運算效率(91.8%) 證明我
們提出的探索方法是很有效的。

Deep convolutional neural networks (DCNN) have played the key roles in modern artificial
intelligence (AI) applications. Recently, more and more inference accelerators have been
proposed to cope with the gigantic computation complexity of DCNN. Dedicated accelerators
make the real-time inferencing computation possible, utilizing the huge degree of the
computational parallelism. However, the huge computation complexity comes with the huge
data requirement. Storing the whole convolutional layer data in the on-chip storage will
result in huge memory cost, which is too inefficient to consider. As a result, the whole-layer
computation should be separated into small pieces to efficiently process the convolution.
The way to separate and schedule the computation is called dataflow. The complicated
dataflow with the massive bandwidth requirement leads to the crucial burden of optimizing
the architectural design.
To design an efficient DCNN accelerator, the dataflow and hardware architecture should
be considered simultaneously. In this thesis, we propose an analytical model of fast and
accurate latency estimation based on the regularity of convolution during the early design
phase. Using our model, the performance of energy-efficient inference architecture can be
estimated rapidly with less than 0.63% error. In order to adopt the inference architecture to
different DCNN models, a parameter exploration flow is proposed to search for the optimized
workload arrangement.
With our exploration flow, the performance bottleneck of the target DCNN accelerator
can be easily identified. Based on the evaluation result, we propose an improved accelerator
architecture, which can achieve the speedup of 1.34 and 2.39 on ResNet-50 [2] and
MobileNet-V2 [3], respectively, as compared with the existing Eyeriss [1] architecture.
Finally, an accelerator design of 2016 processing elements is studied to demonstrate the
effective architecture exploration and specification using our approach in detials. Given
the initial constraints of the number of processing elements and the size of memories, the
design architecture can be optimized iteratively, achieving 1,849.89 MACOPS (multiply-andaccumulate
operations) per cycle on ResNet-50. The high resultant utilization (i.e., 91.8%)
justifies the effectiveness of the proposed exploration flow.

Introduction 1
1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Previous Works 4
1 Eyeriss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 MAESTRO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Dataflow 7
1 Parameter Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Per PE Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3 PE Set Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4 Data Delivering Reuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5 GLB Usage and Processing Pass Scheduling . . . . . . . . . . . . . . . . . . 13
Proposed Design Exploration Methodology 15
1 Latency Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.1 Target Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2 Required Information for Latency Estimation . . . . . . . . . . . . . 16
1.3 Three-stage Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.4 Processing Pass Composition . . . . . . . . . . . . . . . . . . . . . . 28
2 Exploration Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Proposed DNN Accelerator Architecture 34
1 Edge Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2 Exchangeable Input Network . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3 Kernel Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4 PE Grouping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Experiment Result 44
1 Proposed DNN Accelerator Architecture . . . . . . . . . . . . . . . . . . . . 44
1.1 Edge Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
1.2 Exchangeable Input Network and Kernel Split . . . . . . . . . . . . . 46
1.3 PE Grouping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Conclusion and Future Works 59
1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.1 MAC Bit Width Exploration . . . . . . . . . . . . . . . . . . . . . . 60
2.2 PE Group Size Exploration . . . . . . . . . . . . . . . . . . . . . . . 60
2.3 Memory Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.4 Latency Analysis for Zero-Aware Convolutional Computation . . . . 61
2.5 Other Type of Interconnection . . . . . . . . . . . . . . . . . . . . . 61
2.6 Flexible Design Exploration Methodology . . . . . . . . . . . . . . . 61
                                

[1] Y. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An Energy-Efficient Reconfigurable
Accelerator for Deep Convolutional Neural Networks,” IEEE Journal of Solid-
State Circuits, vol. 52, no. 1, pp. 127–138, Jan 2017.
[2] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”
in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June
2016, pp. 770–778.
[3] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto,
and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision
applications,” CoRR, vol. abs/1704.04861, 2017.
[4] H. Kwon, M. Pellauer, and T. Krishna, “Maestro: An open-source infrastructure for
modeling dataflows within deep learning accelerators,” 2018.
[5] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional
neural networks,” in Advances in Neural Information Processing Systems 25,
F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds. Curran Associates,
Inc., 2012, pp. 1097–1105.
[6] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inception-v4, inception-resnet and
the impact of residual connections on learning,” 2016.
[7] M. Jaderberg, A. Vedaldi, and A. Zisserman, “Speeding up convolutional neural networks
with low rank expansions,” CoRR, vol. abs/1405.3866, 2014.
[8] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, pp. 436 EP –,
05 2015. [Online]. Available: http://dx.doi.org/10.1038/nature14539
[9] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with
deep convolutional neural networks,” in Proceedings of the 25th International
Conference on Neural Information Processing Systems - Volume 1, ser. NIPS’12.
USA: Curran Associates Inc., 2012, pp. 1097–1105. [Online]. Available: http:
//dl.acm.org/citation.cfm?id=2999134.2999257
[10] Y. Lin and T. S. Chang, “Data and hardware efficient design for convolutional neural
network,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 65, no. 5,
pp. 1642–1651, May 2018.
[11] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted
residuals and linear bottlenecks,” 2018.
[12] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional
neural networks,” vol. 25, 01 2012.
[13] S. Williams, A. Waterman, and D. Patterson, “Roofline: An insightful visual
performance model for multicore architectures,” Commun. ACM, vol. 52, no. 4, pp.
65–76, Apr. 2009. [Online]. Available: http://doi.acm.org/10.1145/1498765.1498785
[14] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke,
and A. Rabinovich, “Going deeper with convolutions,” 2014.
[15] J. Bachrach, H. Vo, B. Richards, Y. Lee, A. Waterman, R. Avižienis, J. Wawrzynek,
and K. Asanović, “Chisel: Constructing hardware in a scala embedded language,”
in Proceedings of the 49th Annual Design Automation Conference, ser. DAC
’12. New York, NY, USA: ACM, 2012, pp. 1216–1225. [Online]. Available:
http://doi.acm.org/10.1145/2228360.2228584
[16] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss v2: A flexible and high-performance accelerator
for emerging deep neural networks,” 2018.
[17] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer,
S. W. Keckler, and W. J. Dally, “Scnn: An accelerator for compressed-sparse
convolutional neural networks,” in Proceedings of the 44th Annual International
Symposium on Computer Architecture, ser. ISCA ’17. New York, NY, USA: ACM,
2017, pp. 27–40. [Online]. Available: http://doi.acm.org/10.1145/3079856.3080254
[18] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia,
N. Boden, A. Borchers, R. Boyle, P. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley,
M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann,
C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski,
A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law,
D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean, A. Maggiore, M. Mahony,
K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick,
N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn,
G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian,
H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H. Yoon,
“In-datacenter performance analysis of a tensor processing unit,” in 2017 ACM/IEEE
44th Annual International Symposium on Computer Architecture (ISCA), June 2017,
pp. 1–12.
[19] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun,
and O. Temam, “Dadiannao: A machine-learning supercomputer,” in 2014 47th Annual
IEEE/ACM International Symposium on Microarchitecture, Dec 2014, pp. 609–622.
[20] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam, “Shidiannao: Shifting vision processing closer to the sensor,” in 2015 ACM/IEEE 42ndAnnual International Symposium on Computer Architecture (ISCA), June 2015, pp.
92–104.
[21] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “Eie:
Efficient inference engine on compressed deep neural network,” in 2016 ACM/IEEE
43rd Annual International Symposium on Computer Architecture (ISCA), June 2016,
pp. 243–254.
[22] D. Kim, J. Ahn, and S. Yoo, “A novel zero weight/activation-aware hardware architecture
of convolutional neural network,” in Design, Automation Test in Europe Conference
Exhibition (DATE), 2017, March 2017, pp. 1462–1467.
[23] S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen,
“Cambricon-x: An accelerator for sparse neural networks,” in 2016 49th Annual
IEEE/ACM International Symposium on Microarchitecture (MICRO), Oct 2016, pp.
1–12.
[24] H. Kwon, A. Samajdar, and T. Krishna, “Maeri: Enabling flexible dataflow
mapping over dnn accelerators via reconfigurable interconnects,” in Proceedings of
the Twenty-Third International Conference on Architectural Support for Programming
Languages and Operating Systems, ser. ASPLOS ’18. New York, NY, USA: ACM,
2018, pp. 461–475. [Online]. Available: http://doi.acm.org/10.1145/3173162.3173176
[25] B. Moons, R. Uytterhoeven, W. Dehaene, and M. Verhelst, “14.5 envision: A 0.26-to-
10tops/w subword-parallel dynamic-voltage-accuracy-frequency-scalable convolutional
neural network processor in 28nm fdsoi,” in 2017 IEEE International Solid-State Circuits
Conference (ISSCC), Feb 2017, pp. 246–247.
[26] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu, S. Song,
Y. Wang, and H. Yang, “Going deeper with embedded fpga platform for convolutional
neural network,” in Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, ser. FPGA ’16. New York, NY, USA: ACM, 2016,
pp. 26–35. [Online]. Available: http://doi.acm.org/10.1145/2847263.2847265
[27] S. Wang, D. Zhou, X. Han, and T. Yoshimura, “Chain-nn: An energy-efficient 1d chain
architecture for accelerating deep convolutional neural networks,” in Design, Automation
Test in Europe Conference Exhibition (DATE), 2017, March 2017, pp. 1032–1037.

簡易檢索 / 詳目顯示

相關論文