一個針對卷積神經網路推論具資料局部性感知之平行化方法

簡易檢索 / 詳目顯示

回結果列表

研究生：	郭達毅 Guo, Da-Yi
論文名稱：	一個針對卷積神經網路推論具資料局部性感知之平行化方法 A Data-Locality Aware Parallelization Approach for Convolution Neural Network Inference
指導教授：	蔡仁松 Tsay, Ren-Song
口試委員:	吳誠文 WU, CHENG-WEN 呂仁碩 LIU, REN-SHUO
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 電機工程學系 Department of Electrical Engineering
論文出版年：	2019
畢業學年度：	107
語文別：	英文
論文頁數：	32
中文關鍵詞：	嵌入式系統、異質排程、平行運算
外文關鍵詞：	embedded system, heterogeneous scheduler, parallel computing
相關次數：	點閱：2 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

平行化是一種普遍用於提升多核心系統效能的方法。然而，現今的神經網路推論(CNN inference)架構多採用多線程(multi-thread)方法將每一個卷積層(convolution layer)的計算工作分散至不同的核心上。我們觀察到這個方法會引起大量的核心間溝通(inter-core communication)成本並使系統效能降低。在這篇論文中，我們應用流水線執行(pipeline execution)的概念來平行化神經網路推論並且降低溝通成本。我們的實驗結果證明，我們的方法可以達成比傳統多線程方法高出73%的吞吐量(throughput)。

Parallelization is a common design practice for throughput improvement on multicore system. However, existing schedulers for convolution neural network inference essentially divide computational tasks of each convolution layer onto different CPU cores. However, this scheduling approach induces huge inter-core data movement and degrades the overall performance efficiency. In this paper, we proposed a pipeline-based scheduler to parallelize the convolution neural network inference while reducing the overall latency. Nevertheless, the optimization of the proposed pipeline-based scheduler requires careful balance of the workload of each stage so that the total latency is minimized. The experimental results show that our approach can get 73% performance improvement on throughput compared to existing multi-thread scheduler.

Abstract-------------------------------------------------3
Contents-------------------------------------------------4
List of Figures------------------------------------------5
I.    Introduction-------------------------------------6
II.    Related work-------------------------------------10
III.    Methodology------------------------------------- 12
A.    Pipeline Configuration Generation----------------15
B.    Execution Time Estimating------------------------18
C.    Layer-to-stage Allocation Algorithm--------------20
IV.    Experimental results-----------------------------25
A.    Performance Comparison---------------------------25
B.    Execution Time Estimation Error------------------26
C.    Optimal Pipeline Configuration Prediction--------27
V.    Conclusion---------------------------------------29
References-----------------------------------------------30


                                

[1] “NCNN: a high-performance neural network inference framework optimized for the mobile platform.” https://github.com/Tencent/ncnn.
[2] Albert Chiou and Mat Laibowitz. "Cache Coherent Interconnect Network ".
[3] Neil Parris. "Cache Coherency Fundamentals." https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/extended-system-coherency---part-1---cache-coherency-fundamentals, 2013.H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua, “A convolutional neural network cascade for face detection,” in IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 5325–5334.
[4] “Snoop Control Unit” https://developer.arm.com/docs/100486/latest/snoop-control-unit.M. Motamedi, D. Fong, and S. Ghiasi, “Machine intelligence on resource-constrained IoT devices: The case of thread granularity optimization for CNN inference,” ACM Trans. Embedded Comput. Syst., vol. 16, no. 5s, pp. 1–19, 2017.
[5] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size.” arXiv preprint arXiv:1602.07360, 2016.
[6] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. “Mobilenets: Efficient convolutional neural networks for mobile vision applications.” arXiv preprint arXiv:1704.04861, 2017.
[7] “Tengine is a lite, high performance, modular inference engine for embedded device.” https://github.com/OAID/Tengine.
[8] “Compute Library: A Software Library for Computer Vision and Machine Learning.” https://developer.arm.com/technologies/compute-library.J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng. "Quantized convolutional neural networks for mobile devices." arXiv preprint arXiv:1512.06473, 2015.
[9] “Arm big.LITTLE technology is a heterogeneous processing architecture that uses two types of processor.” https://www.arm.com/why-arm/technologies/big-littleLinpeng Tang, Yida Wang, Theodore L Willke, and Kai Li. "Scheduling computation graphs of deep learning models on manycore cpus." arXiv preprint arXiv:1807.09667, 2018.
[10] “Intel Lakefield is packed with more than one type of CPU core to create a more stable and better rounded system.” https://www.techradar.com/news/intel-lakefield-video-guides-us-inside-its-first-hybrid-processor?region-switch=1551470279Hsin-Yu Ho, et al. "An Effective Early Multi-core System Shared Cache Design Method Based on Reuse-distance Analysis" National Tsing Hua University, 2017.
[11] J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng. "Quantized convolutional neural networks for mobile devices." arXiv preprint arXiv:1512.06473, 2015.
[12] Ji Lin, Yongming Rao, Jiwen Lu, and Jie Zhou. "Runtime neural pruning." In NIPS, 2017.
[13] Linpeng Tang, Yida Wang, Theodore L Willke, and Kai Li. "Scheduling computation graphs of deep learning models on manycore cpus." arXiv preprint arXiv:1807.09667, 2018.
[14] Siqi Wang, Gayathri Ananthanarayanan, Yifan Zeng, Neeraj Goel, Anuj Pathania, Tulika Mitra. "High-Throughput CNN Inference on Embedded ARM big.LITTLE Multi-Core Processors. " arXiv preprint arXiv:1903.05898, 2019.
[15] B. Lewis and D. J. Berg. “Multithreaded Programming with Pthreads. Prentice Hall”, 1998.

簡易檢索 / 詳目顯示

相關論文