簡易檢索 / 詳目顯示

研究生: 阮郁善
Ruan, Yu-Shan
論文名稱: 應用於深度神經網路推薦系統之高效率與低延遲內積加速器
A High-throughput Low-latency Inner Product Engine for Small-batch Inference of Deep Learning Recommendation Models
指導教授: 林永隆
Lin, Youn-Long
口試委員: 黃俊達
陳建文
郭皇志
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2021
畢業學年度: 109
語文別: 英文
論文頁數: 30
中文關鍵詞: 深度學習推薦系統加速器可程式化邏輯閘陣列小批次
外文關鍵詞: Deep learning, Recommendation system, Accelerator, FPGA, Small-batch
相關次數: 點閱:3下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 推薦系統廣泛應用於各式產業,建構於用戶的過往歷程,其能精準地提供具吸引力的產品或廣告。再加上深度學習技術近年於各領域都有傑出表現,越來越多研究社群關注深度學習類神經網路應用於推薦系統領域,目前最先進且開源的 deep learning recommendation model (DLRM),使研究者們能為此貢獻所長,專注於提升系統效率、效能、負載能力等議題。

    記憶體資源限制在深度學習技術中為實際落地的最大困難與考量點,大多數研究使用大批量導向架構以減緩記憶體頻寬需求。然而,其於低延遲與小批量限制情境下,會帶來系統使用率低落問題。為此本論文提出具高效率低延遲之內積加速器(RecIP) 應用於 DLRM 加速器中,針對硬體資源限制與高記憶體頻寬需求取得適宜的平衡,並能符合低延遲規格於小批量 DLRM 推論情境。

    RecIP 引擎於 Intel Stratix 10 FPGA 平台在 100 MHz 可高達 819.2 GOP/s,並提供將近 90% 系統使用率,使用 50% 邏輯運算與 19% On-chip memory 資源,比大批量導向加速器僅增加各 3% 邏輯與記憶體資源就可於小批量下達高資源使用率。


    Recommendation systems are utilized in various business applications. Based on user's records, they predict his/her rating or preference. Deep learning technology has achieved superior performance in diverse applications. An open-sourced deep learning recommendation model (DLRM) is getting popular in both academic and industry. More and more research communities contribute to optimizing deep-learning-based recommendation system efficiency, performance, and workload. However, most researchers adopt traditional DNN accelerators or used batch-oriented architecture for reducing memory traffic.

    To address DLRM's unique compute and memory characteristics, we propose a latency-aware and high-throughput Inner Product Engine (RecIP). We need to process hardware resource limitations at high memory bandwidth requirements and support low-latency DLRM accelerator for small-query inference.

    Implemented on an Intel Stratix 10 FPGA, RecIP engine achieves 819.2 GOP/s running at 100 MHz. It utilized 90% of computing resources, and 50% of logic resources, which is only 3% more than a batch-oriented architecture.

    Acknowledgements 摘要 i Abstract ii 1 Introduction 1 2 Related Work 5 2.1 DNN Accelerators . . . . . . . . . . . . . . . . . . . . . 5 2.2 DLRM Accelerators . . . . . . . . . . . . . . . . . . . . 6 2.3 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . 7 3 Inner Product Engine Architecture 9 3.1 Proposed Architecture . . . . . . . . . . . . . . . . . . 9 3.1.1 System Overview . . . . . . . . . . . . . . . . . . . 9 3.1.2 RecIP Engine Topview . . .. . . . . . . . . . . . . . . 10 3.1.3 Compute Array . . . . . . . . . . . . . . . . . . . . . 10 3.1.4 Length 64 Vector Unit (VU64) . . . . . . . . . . . . . 12 3.1.5 Data Mapping and Buffer Allocation . . . . . . . . . . 13 3.1.6 Scheduling . . . . . . . . .. . . . . . . . . . . . . . 15 4 Experiment Results 17 4.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . 17 4.2 Implementation Result . . . . . . . . . . . . . . . . . . 18 4.3 Memory Traffic Analysis . . . . . . . . . . . . . . . . . 19 4.4 Hardware Resource Usage Analysis . . . . . . . . . . . . 20 4.5 Hardware Utilization and Throughput Analysis . . . . . . 22 4.6 System Timing Analysis . . . . . . . . . . . . . . . . . 25 5 Conclusion and Future work 27 5.1 Conclusion . . . . . . . . . . . . .. . . . . . . . . . . 27 5.2 Future work . . . . . . . . . . . . . . . . . . . . . . . 28 Bibliography 29

    [1] Baidu. Deepbench, 2017.
    [2] Cho, B. Y., Jung, J., and Erez, M. Accelerating bandwidth­bound deep learning inference
    with main­memory accelerators. arXiv:2012.00158v1 (2020).
    [3] Choquette, J., and Gandhi, W. Nvidia a100 gpu: Performance innovation for gpu computing. Hot Chips 32 Symposium (HCS), IEEE (2020).
    [4] Chung, E., Fowers, J., Ovtcharov, K., Papamichael, M., Caulfield, A., Massengill, T., and
    et al. Serving dnns in real time at datacenter scale with project brainwave. IEEE Micro
    (2018).
    [5] Facebook. Deep learning recommendation model for personalization and recommendation
    systems, 2019.
    [6] Gupta, U., Hsia, S., Saraph, V., Wang, X., Reagen, B., Wei, G.­Y., Lee, H.­H. S., Brooks,
    D., and Wu, C.­J. Deeprecsys: A system for optimizing end­to­end at­scale neural recommendation inference. Proceedings of the International Symposium on Computer Architecture (ISCA) (2020).
    [7] Hanlon, J. why is so much memory needed for deep neural networks, 2016.
    [8] He, M., Song, C., Kim, I., Jeong, C., Kim, S., Park, I., Thottethodi, M., and Vijaykumar,
    T. N. Newton: A dram­maker's accelerator­in­memory (aim) architecture for machine
    learning. 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE (2020).
    [9] Hwang, R., Kim, T., Kwon, Y., and Rhu, M. Centaur: A chiplet­based, hybrid sparsedense accelerator for personalized recommendations. arXiv:2005.05968v1 (2020).
    [10] Jannach, D., and Jugovac, M. Measuring the business value of recommender systems.
    ACM Transactions on Management Information Systems (2019), 1–23.
    [11] Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia,
    S., Boden, N., and et al. In­datacenter performance analysis of a tensor processing unit.
    ISCA (2017).
    [12] Kumar, S., Bradbury, J., Young, C., Wang, Y. E., Levskaya, A., Hechtman, B., Chen,
    D., and Lee, H. Exploring the limits of concurrency in ml training on google tpus.
    arXiv:2011.03641 (2020).
    [13] LAPEDUS, M. In­memory vs. near­memory computing, 2019.
    29
    [14] Lee, J., Suh, T., Roy, D., and Baucus, M. Emerging technology and business model innovation: The case of artificial intelligence. J. Open Innov. Technol. Mark. Complex.
    (2019).
    [15] Naumov, M., Mudigere, D., Shi, H.­J. M., Huang, J., Sundaraman, N., Park, J., Wang, X.,
    Gupta, U., Wu, C.­J., and et al. Deep learning recommendation model for personalization
    and recommendation systems. CoRR (2019).
    [16] Reddi, V. J., Cheng, C., Kanter, D., Mattson, P., Schmuelling, G., Wu, C.­J., Anderson, B.,
    Breughe, M., Charlebois, M., Chou, W., Chukka, R., Coleman, C., Davis, S., Deng, P., and
    et al. Mlperf inference benchmark. ACM/IEEE 47th Annual International Symposium on
    Computer Architecture (ISCA) (2020).
    [17] Underwood, C. Use cases of recommendation systems in business–current applications
    and methods, 2020.
    [18] Xie, X., Lian, J., Liu, Z., Wang, X., Wu, F., Wang, H., and Chen, Z. Personalized recommendation systems: Five hot research topics you must know, 2018.

    QR CODE