應用於深度神經網路推薦系統之高效率與低延遲內積加速器

簡易檢索 / 詳目顯示

回結果列表

研究生：	阮郁善 Ruan, Yu-Shan
論文名稱：	應用於深度神經網路推薦系統之高效率與低延遲內積加速器 A High-throughput Low-latency Inner Product Engine for Small-batch Inference of Deep Learning Recommendation Models
指導教授：	林永隆 Lin, Youn-Long
口試委員:	黃俊達陳建文郭皇志
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊工程學系 Computer Science
論文出版年：	2021
畢業學年度：	109
語文別：	英文
論文頁數：	30
中文關鍵詞：	深度學習、推薦系統、加速器、可程式化邏輯閘陣列、小批次
外文關鍵詞：	Deep learning, Recommendation system, Accelerator, FPGA, Small-batch
相關次數：	點閱：2 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

推薦系統廣泛應用於各式產業，建構於用戶的過往歷程，其能精準地提供具吸引力的產品或廣告。再加上深度學習技術近年於各領域都有傑出表現，越來越多研究社群關注深度學習類神經網路應用於推薦系統領域，目前最先進且開源的 deep learning recommendation model (DLRM)，使研究者們能為此貢獻所長，專注於提升系統效率、效能、負載能力等議題。

記憶體資源限制在深度學習技術中為實際落地的最大困難與考量點，大多數研究使用大批量導向架構以減緩記憶體頻寬需求。然而，其於低延遲與小批量限制情境下，會帶來系統使用率低落問題。為此本論文提出具高效率低延遲之內積加速器(RecIP) 應用於 DLRM 加速器中，針對硬體資源限制與高記憶體頻寬需求取得適宜的平衡，並能符合低延遲規格於小批量 DLRM 推論情境。

RecIP 引擎於 Intel Stratix 10 FPGA 平台在 100 MHz 可高達 819.2 GOP/s，並提供將近 90% 系統使用率，使用 50% 邏輯運算與 19% On-chip memory 資源，比大批量導向加速器僅增加各 3% 邏輯與記憶體資源就可於小批量下達高資源使用率。

Recommendation systems are utilized in various business applications. Based on user's records, they predict his/her rating or preference. Deep learning technology has achieved superior performance in diverse applications. An open-sourced deep learning recommendation model (DLRM) is getting popular in both academic and industry. More and more research communities contribute to optimizing deep-learning-based recommendation system efficiency, performance, and workload. However, most researchers adopt traditional DNN accelerators or used batch-oriented architecture for reducing memory traffic.

To address DLRM's unique compute and memory characteristics, we propose a latency-aware and high-throughput Inner Product Engine (RecIP). We need to process hardware resource limitations at high memory bandwidth requirements and support low-latency DLRM accelerator for small-query inference.

Implemented on an Intel Stratix 10 FPGA, RecIP engine achieves 819.2 GOP/s running at 100 MHz. It utilized 90% of computing resources, and 50% of logic resources, which is only 3% more than a batch-oriented architecture.

Acknowledgements
摘要 i
Abstract ii
Introduction 1
Related Work 5
1 DNN Accelerators . . . . . . . . . . . . . . . . . . . . . 5
2 DLRM Accelerators  . . . . . . . . . . . . . . . . . . . . 6
3 Benchmarks  . .  . . . . . . . . . . . . . . . . . . . . . 7
Inner Product Engine Architecture 9
1 Proposed Architecture .  . . . . . . . . . . . . . . . . . 9
1.1 System Overview . .  . .  . . .  . . . . . . . . . . . . 9
1.2 RecIP Engine Topview . . .. . . . . . . . . . . . . . . 10
1.3 Compute Array . . . . . . . . . . . . . . . . . . . . . 10
1.4 Length 64 Vector Unit (VU64) . . .  . . . . . . . . . . 12
1.5 Data Mapping and Buffer Allocation . . . . . .  . . . . 13
1.6 Scheduling . . . . . . . . .. . . . . . . . . . . . . . 15
Experiment Results 17
1 Experiment Setup  . . . . . . . . . . . . . . . . . . . . 17
2 Implementation Result . . . . . . . . . . . . . . . . . . 18
3 Memory Traffic Analysis . . . . . . . . . . . . . . . . . 19
4 Hardware Resource Usage Analysis . . . . . . . . . . . .  20
5 Hardware Utilization and Throughput Analysis . . . . . .  22
6 System Timing Analysis . . . . . . . . . . . . . . . . .  25
Conclusion and Future work 27
1 Conclusion . . . . . . . . . . . . .. . . . . . . . . . . 27
2 Future work . . . . . . . . . . . . . . . . . . . . . . . 28
Bibliography 29
                                

[1] Baidu. Deepbench, 2017.
[2] Cho, B. Y., Jung, J., and Erez, M. Accelerating bandwidthbound deep learning inference
with mainmemory accelerators. arXiv:2012.00158v1 (2020).
[3] Choquette, J., and Gandhi, W. Nvidia a100 gpu: Performance innovation for gpu computing. Hot Chips 32 Symposium (HCS), IEEE (2020).
[4] Chung, E., Fowers, J., Ovtcharov, K., Papamichael, M., Caulfield, A., Massengill, T., and
et al. Serving dnns in real time at datacenter scale with project brainwave. IEEE Micro
(2018).
[5] Facebook. Deep learning recommendation model for personalization and recommendation
systems, 2019.
[6] Gupta, U., Hsia, S., Saraph, V., Wang, X., Reagen, B., Wei, G.Y., Lee, H.H. S., Brooks,
D., and Wu, C.J. Deeprecsys: A system for optimizing endtoend atscale neural recommendation inference. Proceedings of the International Symposium on Computer Architecture (ISCA) (2020).
[7] Hanlon, J. why is so much memory needed for deep neural networks, 2016.
[8] He, M., Song, C., Kim, I., Jeong, C., Kim, S., Park, I., Thottethodi, M., and Vijaykumar,
T. N. Newton: A drammaker＇s acceleratorinmemory (aim) architecture for machine
learning. 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE (2020).
[9] Hwang, R., Kim, T., Kwon, Y., and Rhu, M. Centaur: A chipletbased, hybrid sparsedense accelerator for personalized recommendations. arXiv:2005.05968v1 (2020).
[10] Jannach, D., and Jugovac, M. Measuring the business value of recommender systems.
ACM Transactions on Management Information Systems (2019), 1–23.
[11] Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia,
S., Boden, N., and et al. Indatacenter performance analysis of a tensor processing unit.
ISCA (2017).
[12] Kumar, S., Bradbury, J., Young, C., Wang, Y. E., Levskaya, A., Hechtman, B., Chen,
D., and Lee, H. Exploring the limits of concurrency in ml training on google tpus.
arXiv:2011.03641 (2020).
[13] LAPEDUS, M. Inmemory vs. nearmemory computing, 2019.
29
[14] Lee, J., Suh, T., Roy, D., and Baucus, M. Emerging technology and business model innovation: The case of artificial intelligence. J. Open Innov. Technol. Mark. Complex.
(2019).
[15] Naumov, M., Mudigere, D., Shi, H.J. M., Huang, J., Sundaraman, N., Park, J., Wang, X.,
Gupta, U., Wu, C.J., and et al. Deep learning recommendation model for personalization
and recommendation systems. CoRR (2019).
[16] Reddi, V. J., Cheng, C., Kanter, D., Mattson, P., Schmuelling, G., Wu, C.J., Anderson, B.,
Breughe, M., Charlebois, M., Chou, W., Chukka, R., Coleman, C., Davis, S., Deng, P., and
et al. Mlperf inference benchmark. ACM/IEEE 47th Annual International Symposium on
Computer Architecture (ISCA) (2020).
[17] Underwood, C. Use cases of recommendation systems in business–current applications
and methods, 2020.
[18] Xie, X., Lian, J., Liu, Z., Wang, X., Wu, F., Wang, H., and Chen, Z. Personalized recommendation systems: Five hot research topics you must know, 2018.

簡易檢索 / 詳目顯示

相關論文