卷積神經網路定點數量化技術與加速器之軟性錯誤分析

簡易檢索 / 詳目顯示

回結果列表

研究生：	吳岳騏 Wu, Yueh-Chi
論文名稱：	卷積神經網路定點數量化技術與加速器之軟性錯誤分析 Soft Error Analysis for CNN Inference Accelerator with Efficient Dynamic Fixed Point Quantization
指導教授：	黃稚存 Huang, Chih-Tsun
口試委員:	劉靖家 Liou, Jing-Jia 謝明得 Shieh, Ming-Der
學位類別：	碩士 Master
系所名稱：
論文出版年：	2018
畢業學年度：	107
語文別：	英文
論文頁數：	80
中文關鍵詞：	卷積神經網路、加速器、架構、量化、軟性錯誤
外文關鍵詞：	CNN, Accelerator, Architecture, Quantization, SoftError
相關次數：	點閱：1 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

本研究分析軟性錯誤於深度卷積神經網路加速器上的影響,並且探討不同記憶體保護技術對軟性錯誤的效果。此外,我們提出的動態定點數量化資料格式能夠減少加速器中的記憶體需求且極大的增強軟性錯誤的容忍度。
為了模擬加速器中的軟性錯誤以及在加速器上測試不同的記憶體保護技術,我們於 Caffe 實做基於 C++ 的資料流精確的模擬加速器架構以對近期之深度卷積神經網路提供快速的軟性錯誤分析原型。
我們提出的動態定點數量化資料格式可以有效降低部份累加和於加速器中所需要的記憶體空間,可以有效在稍微減少分類準確度的情形下減少所需要的資料寬度。並且於 Caffe 中實做的完整深度卷積神經網路量化流程提供簡單的界面來執行網路的推論及訓練流程,進而評估量化後的網路效能。
為了權衡硬體設計複雜度及網路分類準確度,我們提出的動態定點數量化資料格式支援不同的配置以提供實務上的設計參考。量化計算的細節調整及準確度比較在本論文中有完整的討論,並提供實做上不同目標準確度的設計選擇。
從本論文的測試結果,我們提出的動態定點數量化流程比 Tensorflow 的流程更加簡單高效,並且於記憶體需求及量化計算的負擔上更有優勢。最後,我們提出的動態定點數量化格式和普通的定點數及浮點數格式相比可以明顯的降低軟性錯誤的敏感度達到 99% 以上。

As the applications of convolution neural network (CNN) grows rapidly, the demand for CNN accelerator is getting higher and higher. While CNN is fault tolerant by its nature, critical applications that require high degrees of reliability may not be able to tolerate the accuracy drop caused by soft error. These applications, such as self-driving cars and medical evaluation services, are easily exposed to soft error sources like cosmetic rays and radiations, which may lead to catastrophic consequences if false classification results are made by a soft error vulnerable CNN accelerator.

In order to model the soft error on the CNN accelerator and further develop soft error mitigation techniques, we implemented a C++ data-flow accurate simulated convolution accelerator in Caffe and built a soft error model on the internal buffer of the simulated accelerator. By running modern CNNs on the simulated accelerator with soft error injection, we are able to analysis the fault-sensitive region in the accelerator and implement mitigation techniques on it.

Due to limited hardware resources, the CNN accelerator needs to trade off between classification accuracy and design complexity. One of the common approaches to reduce the hardware overhead is data quantization. By quantizing the input data for the accelerator into lower bitwidth, the accelerator can design a smaller multiplier and greatly reduce the power consumption. As the quantized data is smaller in size, the data density becomes higher in the buffers of the CNN accelerators, thus the higher efficiency can be achieved.

To improve the soft error tolerance of the accelerator together with increase the computation efficiency, the proposed quantized dynamic fixed point format is presented and implemented in the simulated convolution accelerator. For the quanitzed format is optimized for fitting the dynamic range of data, the accuracy drop caused by excessive soft error rate is greatly reduced. Furthermore, the proposed quantization format can reduce the partial sum bitwidth with only minor accuracy loss, which further reduces the memory overhead compared with state-of-the-art approach. Moreover, the full quantization flow is implemented in Caffe, which provides a simple interface to do the quantized training and inference flow to evaluate the performance of quantized CNN.

For trading off between the design complexity and accuracy, the proposed quantization format is also highly configurable. The detailed analysis of hardware design is discussed in this thesis, including quantization flow for the accelerator, saturation mode comparison, rounding modes for MAC operation and bitwidth setting of quantized format. Designer can choose the design options based on the accuracy demand and hardware constraint, which makes the proposed quantization method flexible for various applications.

To further reduce the soft error impact on the classification accuracy, several memory protection techniques are tested on the simulated convolution accelerator. By combining the proposed quantization techniques together with simple ECC buffer implementation, the accuracy drop caused by soft errors can be controlled in an acceptable range by tuning the quantization and ECC configurations.

From the test result, the proposed quantization flow proved to be more efficient than Tensorflow’s quantization implementation in memory requirement and quantization overhead. By fine-tuning the quantized CNN with the bitwidth of input feature map and weight set to 8 and partial sum set to 16, the Top-1 accuracy of MobileNetV1 achieves 68.65% (pre-trained model: 69.85%) and MobileNetV2 achieves 70.35% (pre-trained model: 71.39%) by their best quantization configuration. Finally, the proposed quantization technique dramatically reduces the soft error sensitivity by over 99% as compared with the conventional floating point and fixed point data types.

From the test result, the proposed quantization flow proved to be more efficient than
Tensorflow’s quantization implementation in memory requirement and quantization over-
head. By fine-tuning the quantized CNN with the bitwidth of input feature map and weight
set to 8 and partial sum set to 16, the Top-1 accuracy of MobileNetV1 achieves 68.65%
(pre-trained model: 69.85%) and MobileNetV2 achieves 70.35% (pre-trained model: 71.39%)
by their best quantization configuration. Finally, the proposed quantization technique dra-
matically reduces the soft error sensitivity by over 99% as compared with the conventional
floating point and fixed point data types.
ivContents
Introduction
1
1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3 Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Previous Works
4
1 Tensorflow quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Ristretto dynamic fixed point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Eyeriss convolution accelerator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4 Soft error propagation in CNN accelerators and applications . . . . . . . . . . 7
Soft Error Modeling on the Simulated Convolution Accelerator
10
1 Fault model of the soft error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Simulated accelerator model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Soft error impact on the conventional data formats . . . . . . . . . . . . . . . . 16
Proposed Dynamic Fixed Point Quantization with Reduced Partial Sum Bitwidth 18
1 Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2 Bitwidth setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1 State-of-the-art datapath . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1.1 Eyeriss datapath . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1.2 TPU datapath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2 Proposed dynamic fixed point representation . . . . . . . . . . . . . . . . 20
3 Quantization flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4 Determine the accumulation width . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5 Data layout of the proposed dynamic fixed point format . . . . . . . . . . . . . 26
6 Quantized output conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
7 Quantization flow comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
8 Quantized convolution layer execution flow . . . . . . . . . . . . . . . . . . . . . 31
8.1 Inference flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
8.2 Training flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
9 Hardware design of the quantized convolution . . . . . . . . . . . . . . . . . . . 33
9.1 Dataflow in hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
9.2 Shifter design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
9.3 Rounding modes of accumulation shift . . . . . . . . . . . . . . . . . . . . 37
9.4 Saturation control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Accuracy of the Proposed Quantization Method
43
1 Fine-tuning methodology of CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2 Accuracy results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.1 Testing setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.2 Accuracy comparison with different data types . . . . . . . . . . . . . . . 47
2.3 Accuracy comparison with Tensorflow . . . . . . . . . . . . . . . . . . . . 48
2.4 Accuracy of quantized CNNs . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.5 Accuracy of different BW configurations . . . . . . . . . . . . . . . . . . . 50
2.6 Accuracy of different PW configurations . . . . . . . . . . . . . . . . . . . 50
2.7 Accuracy of different accumulation rounding modes . . . . . . . . . . . . 52
2.8 Accuracy of different saturation strategies . . . . . . . . . . . . . . . . . 52
2.9 Accuracy of shifter configuration . . . . . . . . . . . . . . . . . . . . . . . 54
2.10 Accuracy of fixed F max and S F . . . . . . . . . . . . . . . . . . . . . . . . . 55
Soft Error Injection Experiments on the Simulated CNN Accelerator
57
1 Soft error injection experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
1.1 Baseline soft error injection result of quantized CNNs . . . . . . . . . . 58
1.2 Soft error mitigation techniques . . . . . . . . . . . . . . . . . . . . . . . 58
1.2.1 Data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
1.2.2 Saturation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
1.2.3 ECC buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Conclusion and Future Work
66
1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2 Future works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Appendices 70
Appendix A OI max configuration table 71
Appendix B Per-layer maximum F max of MobileNetV2 74
Appendix C Accuracy result of OI max setting which regards O min 75
                                

[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado,
A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard,
Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga,
S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever,
K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden,
M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine
learning on heterogeneous systems,” 2015, software available from tensorflow.org.
[Online]. Available: https://www.tensorflow.org/
[2] P. Gysel, J. Pimentel, M. Motamedi, and S. Ghiasi, “Ristretto: A framework for em-
pirical study of resource-efficient inference in convolutional neural networks,” IEEE
Transactions on Neural Networks and Learning Systems, 2018.
[3] Chen, Yu-Hsin and Krishna, Tushar and Emer, Joel and Sze, Vivienne, “Eyeriss: An
Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks,”
in IEEE International Solid-State Circuits Conference, ISSCC 2016, Digest of Technical
Papers, 2016, pp. 262–263.
[4] G. Li, S. K. S. Hari, M. Sullivan, T. Tsai, K. Pattabiraman, J. Emer, and S. W. Keckler,
“Understanding error propagation in deep learning neural network (dnn) accelerators
and applications,” in Proceedings of the International Conference for High Performance
Computing, Networking, Storage and Analysis, ser. SC ’17. New York, NY, USA: ACM,
2017, pp. 8:1–8:12. [Online]. Available: http://doi.acm.org/10.1145/3126908.3126964
[5] tiny-dnn. [Online]. Available: https://github.com/tiny-dnn
[6] R. C. Baumann, “Radiation-induced soft errors in advanced semiconductor technolo-
gies,” IEEE Transactions on Device and Materials Reliability, vol. 5, pp. 305–316, 2005.
[7] C.
C.
Chen
and
C.
T.
Huang,
“Design
exploration
deep convolutional neural network accelerator,”
line]. Available:
methodology
Master’s thesis,
for
2018. [On-
http://140.113.39.130/cgi-bin/gs32/hugsweb.cgi?o=dnthucdr&s=id=
%22G021050625770%22.&searchmode=basic
[8] T. J. Dell, “A white paper on the benefits of chipkill-correct ecc for pc server main mem-
ory,” 01 1997.
[9] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates,
S. Bhatia, N. Boden, A. Borchers, R. Boyle, P.-l. Cantin, C. Chao, C. Clark, J. Coriell,
M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gulland,
R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey,
A. Jaworski, A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy,
J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean,
A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix,
T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek,
E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing,
M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang,
E. Wilcox, and D. H. Yoon, “In-datacenter performance analysis of a tensor processing
unit,” in Proceedings of the 44th Annual International Symposium on Computer
Architecture, ser. ISCA ’17.
New York, NY, USA: ACM, 2017, pp. 1–12. [Online].
Available: http://doi.acm.org/10.1145/3079856.3080246
[10] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep learning
with limited numerical precision,” in Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37,
ser. ICML’15.
JMLR.org,
2015,
pp. 1737–1746. [Online]. Available:
http:
//dl.acm.org/citation.cfm?id=3045118.3045303
[11] N. T. Quach, N. T. Quach, N. T. Quach, M. J. Flynn, M. J. Flynn, and M. J. Flynn, “An
improved algorithm for high-speed floating point addition,” 1990.
[12] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted
residuals and linear bottlenecks,” in The IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), June 2018.
[13] D. Das, N. Mellempudi, D. Mudigere, D. Kalamkar, S. Avancha, K. Banerjee, S. Srid-
haran, K. Vaidyanathan, B. Kaul, E. Georganas, A. Heinecke, P. Dubey, J. Corbal,
N. Shustrov, R. Dubtsov, E. Fomenko, and V. Pirogov, “Mixed Precision Training of
Convolutional Neural Networks using Integer Operations,” ArXiv e-prints, Feb. 2018.
[14] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand,
M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks
for mobile vision applications,” CoRR, vol. abs/1704.04861, 2017. [Online]. Available:
http://arxiv.org/abs/1704.04861
[15] D. Shin, J. Lee, J. Lee, and H.-J. Yoo, “14.2 dnpu: An 8.1tops/w reconfigurable cnn-rnn
processor for general-purpose deep neural networks,” 2017 IEEE International Solid-
State Circuits Conference (ISSCC), pp. 240–241, 2017.

簡易檢索 / 詳目顯示

相關論文