簡易檢索 / 詳目顯示

研究生: 林亭君
Lin, Ting-Chun
論文名稱: 具有彈性資料流的高利用率Vision Transformer加速器
High Utilization Vision Transformer Accelerator with Flexible Dataflow
指導教授: 邱瀞德
Chiu, Ching-Te
口試委員: 李政崑
Lee, Jenq-Kuen
范倫達
Van, Lan-Da
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 通訊工程研究所
Communications Engineering
論文出版年: 2024
畢業學年度: 113
語文別: 英文
論文頁數: 59
中文關鍵詞: Vision Transformer硬體加速器高利用率彈性資料流
外文關鍵詞: Vision Transformer, Hardware Accelerator, High Utilization, Flexible Dataflow
相關次數: 點閱:67下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 最近 Vision Transformer 正在迅速發展,利用自注意力機制在各種電腦視覺任務中實現了卓越的準確性。然而,它們的高內存和計算成本使得在資源受限的邊緣設備上實現這些模型變得具有挑戰性,從而限制了它們的實際效益。目前研究中的兩個關鍵觀察啟發了我們提出的方法。首先,儘管大多數 Transformer 加速器將自注意力模組識別為推理加速的主要瓶頸,但前饋網絡(Feed-Forward Network)—— Transformer 編碼器的另一個主要模組也需要大量計算,尤其是在短令牌長度的情況下,使其成為另一個瓶頸。其次,雖然映射策略對提高硬體性能至關重要,但在 Transformer 上不同數據流的硬體利用率和數據重用仍未得到充分探索。
    為了解決上述問題,我們提出了具有彈性資料流的高利用率通用矩陣乘法(GEMM)引擎,並針對 Vision Transformer 中的線性層進行了全面優化。為了最大化硬體利用率,我們首先引入了一種切片大小搜尋策略,該策略會考慮不同線性操作的記憶體需求,並選擇了最合適的切片大小。其次,我們在切片層級實施層內平行化,允許各種平行維度資料流。為了提高資料重複使用,我們分析了不同資料流對動態隨機存取記憶體的存取量,並為每個操作選擇資料移動量最小的資料流。
    在硬體設計上,為了支援彈性資料流,我們提出了三個彈性 Transformer 的通用矩陣乘法(GEMM)引擎,用於頭層平行運算。每個引擎都被組織成一個 4 階段流水線,由分配網路、乘法網路、還原網路和量化網路組成。對於非線性運算,我們採用 [1] 提出的近似 softmax 演算法,並設計了一個輕量級的 Shiftmax 模組。我們不使用與序列長度相同數量的 shiftexp 單元,而是使用 32 個shiftexp 單元並行運行。這表示我們只需要增加 32 個乘法器和加法器,透過在整個序列中重複使用這些單元,以最小的硬體開銷完成計算。
    我們所提出的 Vision Transformer 加速器支援每層的彈性資料流,在 DeiT-Tiny 上的硬體利用率維持 100% 的同時,速度比 [2] 提升 2.77 倍。它基於台積電 40 nm 製程,運行頻率為 1000 MHz,提供 116.79 GOPS/mm² 的面積效率和 0.47 TOPS/W 的功率效率。


    Recently, Vision Transformers have been rapidly evolving to achieve superior accuracy in a wide range of computer vision tasks utilizing self-attention mechanisms. However, their high memory and computational costs make them challenging to implement on resource-constrained edge devices, thus limiting their practical benefits. Two key observations in current research motivate our proposed method. First, while most transformer accelerators identify the self-attention module as the primary bottleneck for inference acceleration, the feed-forward network—another major module of the transformer encoder—also demands significant computation, especially with shorter token lengths, making it another bottleneck. Second, hardware utilization and data reuse across different dataflows in transformers have been underexplored, even though mapping strategies are essential for improving hardware performance.
    To address the issues mentioned above, we propose a high utilization GEMM engine with flexible dataflow, fully optimized for the linear layer in vision transformers. To maximize hardware utilization, we first introduce a tile size search strategy that considers the memory requirements across different linear operations and selects the most suitable tile size. Second, we implement intra-layer parallelization at the tiling level, allowing for various parallel dimensions of dataflow. To enhance data reuse, we analyze DRAM access amounts across different dataflows and choose the dataflow that minimizes data movement for each operation.
    In the hardware design, to support flexible dataflows, we propose three flexible transformer GEMM engines for head-level parallel computation. Each engine is organized into a 4-stage pipeline, consisting of a distribution network, multiplication network, reduction network, and quantization network. For non-linear operations, we employ the approximate softmax algorithm proposed by [1] and design a lightweight Shiftmax module. Instead of using a number of shiftexp units equal to the sequence length, we employ only 32 shiftexp units running in parallel. This means we only need to add 32 multipliers and adders, reusing these units throughout the sequence to complete the computation with minimal hardware overhead.
    The proposed vision transformer accelerator supports flexible dataflow for each layer, achieving a 2.77× speedup over [2] while maintaining 100% hardware utilization on DeiT-Tiny. It operates under TSMC 40 nm technology at 1000 MHz, delivering an area efficiency of 116.79 GOPS/mm² and a power efficiency of 0.47 TOPS/W.

    摘要 i Abstract ii 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Background and Related Works 7 2.1 Transformer Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Transformer Accelerators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3 Proposed High Utilization and Flexible Dataflow Accelerator and Methods 15 3.1 High Utilization and Flexible Dataflow Accelerator . . . . . . . . . . . . . . . 15 3.1.1 Accelerator Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.1.2 Flexible Transformer GEMM Engine . . . . . . . . . . . . . . . . . . 16 3.1.3 SRAM Data Placement . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.1.4 SRAM Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.1.5 Shiftmax Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2.1 Dataflow Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2.2 Analytical Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2.3 Mapping Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4 Experimental Results 45 4.1 DRAM Access Amount . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2 SRAM Size Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.3 PE Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.4 Computing Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.5 Synthesis Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.6 Comparison with other works . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5 Conclusion 53 References 55

    [1] Z. Li and Q. Gu, “I-vit: Integer-only quantization for efficient vision transformer inference,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 17065–17075, 2023.
    [2] M. Huang, J. Luo, C. Ding, Z. Wei, S. Huang, and H. Yu, “An integer-only and groupvector systolic accelerator for efficiently mapping vision transformer on edge,” IEEE Transactions on Circuits and Systems I: Regular Papers, 2023.
    [3] H. You, Z. Sun, H. Shi, Z. Yu, Y. Zhao, Y. Zhang, C. Li, B. Li, and Y. Lin, “Vitcod: Vision transformer acceleration via dedicated algorithm and accelerator co-design,” in 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pp. 273–286, IEEE, 2023.
    [4] P. Dong, M. Sun, A. Lu, Y. Xie, K. Liu, Z. Kong, X. Meng, Z. Li, X. Lin, Z. Fang, et al., “Heatvit: Hardware-efficient adaptive token pruning for vision transformers,” in 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pp. 442–455, IEEE, 2023.
    [5] S. Nag, G. Datta, S. Kundu, N. Chandrachoodan, and P. A. Beerel, “Vita: A vision transformer inference accelerator for edge applications,” in 2023 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–5, IEEE, 2023.
    [6] H.-Y. Wang and T.-S. Chang, “Row-wise accelerator for vision transformer,” in 2022 IEEE 4th International Conference on Artificial Intelligence Circuits and Systems (AICAS), pp. 399–402, IEEE, 2022. [7] C.-C. Lin, W. Lu, P.-T. Huang, and H.-M. Chen, “A 28nm 343.5 fps/w vision transformer accelerator with integer-only quantized attention block,” in 2024 IEEE 6th International Conference on AI Circuits and Systems (AICAS), pp. 80–84, IEEE, 2024.
    [8] G. Islamoglu, M. Scherer, G. Paulin, T. Fischer, V. J. Jung, A. Garofalo, and L. Benini, “Ita: An energy-efficient attention and softmax accelerator for quantized transformers,” in 2023 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED), pp. 1–6, IEEE, 2023.
    [9] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
    [10] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, “Training dataefficient image transformers & distillation through attention,” in International conference on machine learning, pp. 10347–10357, PMLR, 2021.
    [11] F. Yang, H. Yang, J. Fu, H. Lu, and B. Guo, “Learning texture transformer network for image super-resolution,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5791–5800, 2020.
    [12] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European conference on computer vision, pp. 213–229, Springer, 2020.
    [13] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,” arXiv preprint arXiv:2010.04159, 2020.
    [14] H. Chen, Y. Wang, T. Guo, C. Xu, Y. Deng, Z. Liu, S. Ma, C. Xu, C. Xu, and W. Gao, “Pre-trained image processing transformer,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12299–12310, 2021.
    [15] M. D. M. Reddy, M. S. M. Basha, M. M. C. Hari, and M. N. Penchalaiah, “Dall-e: Creating images from text,” UGC Care Group I Journal, vol. 8, no. 14, pp. 71–75, 2021.
    [16] L. Ye, M. Rochan, Z. Liu, and Y. Wang, “Cross-modal self-attention network for referring image segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10502–10511, 2019.
    [17] C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid, Videobert: A joint model for video and language representation learning,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 7464–7473, 2019.
    [18] C. Sun, F. Baradel, K. Murphy, and C. Schmid, “Learning video representations using contrastive bidirectional transformer,” arXiv preprint arXiv:1906.05743, 2019.
    [19] C. Doersch, A. Gupta, and A. Zisserman, “Crosstransformers: spatially-aware few-shot transfer,” Advances in Neural Information Processing Systems, vol. 33, pp. 21981–21993, 2020.
    [20] M. Kumar, D. Weissenborn, and N. Kalchbrenner, “Colorization transformer,” arXiv preprint arXiv:2102.04432, 2021.
    [21] X. Wang, C. Yeshwanth, and M. Nießner, “Sceneformer: Indoor scene generation with transformers,” in 2021 International Conference on 3D Vision (3DV), pp. 106–115, IEEE, 2021.
    [22] H.-J. Ye, H. Hu, D.-C. Zhan, and F. Sha, “Few-shot learning via embedding adaptation with set-to-set functions,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8808–8817, 2020.
    [23] W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai, “Vl-bert: Pre-training of generic visual-linguistic representations,” arXiv preprint arXiv:1908.08530, 2019.
    [24] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022, 2021.
    [25] J. Liang, J. Cao, G. Sun, K. Zhang, L. Van Gool, and R. Timofte, “Swinir: Image restoration using swin transformer,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 1833–1844, 2021.
    [26] S.-C. Kao, H. Kwon, M. Pellauer, A. Parashar, and T. Krishna, “A formalism of dnn accelerator flexibility,” Proceedings of the ACM on Measurement and Analysis of Computing Systems, vol. 6, no. 2, pp. 1–23, 2022.
    [27] G. E. Moon, H. Kwon, G. Jeong, P. Chatarasi, S. Rajamanickam, and T. Krishna, “Evaluating spatial accelerator architectures with tiled matrix-matrix multiplication,” IEEE Transactions on Parallel and Distributed Systems, vol. 33, no. 4, pp. 1002–1014, 2021.
    [28] L. Lu, Y. Jin, H. Bi, Z. Luo, P. Li, T. Wang, and Y. Liang, Sanger: A co-design framework for enabling sparse attention using reconfigurable architecture,” in MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 977–991, 2021.
    [29] S. Tuli and N. K. Jha, “Acceltran: A sparsity-aware accelerator for dynamic inference with transformers,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 42, no. 11, pp. 4038–4051, 2023.
    [30] Z. Qu, L. Liu, F. Tu, Z. Chen, Y. Ding, and Y. Xie, “Dota: detect and omit weak attentions for scalable transformer acceleration,” in Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 14–26, 2022.
    [31] Y. Qin, Y. Wang, D. Deng, Z. Zhao, X. Yang, L. Liu, S. Wei, Y. Hu, and S. Yin, “Fact: Ffn-attention co-optimized transformer architecture with eager correlation prediction,” in Proceedings of the 50th Annual International Symposium on Computer Architecture, pp. 1–14, 2023.
    [32] G. Shen, J. Zhao, Q. Chen, J. Leng, C. Li, and M. Guo, “Salo: an efficient spatial accelerator enabling hybrid sparse attention mechanisms for long sequences,” in Proceedings of the 59th ACM/IEEE Design Automation Conference, pp. 571–576, 2022.
    [33] J. Park, H. Yoon, D. Ahn, J. Choi, and J.-J. Kim, “Optimus: Optimized matrix multiplication structure for transformer neural network accelerator,” Proceedings of Machine Learning and Systems, vol. 2, pp. 363–378, 2020.
    [34] T. J. Ham, S. J. Jung, S. Kim, Y. H. Oh, Y. Park, Y. Song, J.-H. Park, S. Lee, K. Park, J. W. Lee, et al., “A^ 3: Accelerating attention mechanisms in neural networks with approximation,” in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 328–341, IEEE, 2020.
    [35] C. Chen, L. Li, and M. M. S. Aly, “Vita: A highly efficient dataflow and architecture for vision transformers,” in 2024 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 1–6, IEEE, 2024.
    [36] Y. Wang, Y. Qin, D. Deng, J. Wei, Y. Zhou, Y. Fan, T. Chen, H. Sun, L. Liu, S. Wei, et al., “A 28nm 27.5 tops/w approximate-computing-based transformer processor with asymptotic sparsity speculating and out-of-order computing,” in 2022 IEEE international solid-state circuits conference (ISSCC), vol. 65, pp. 1–3, IEEE, 2022.
    [37] K. Marino, P. Zhang, and V. K. Prasanna, “Me-vit: A single-load memory-efficient fpga accelerator for vision transformers,” in 2023 IEEE 30th International Conference on High Performance Computing, Data, and Analytics (HiPC), pp. 213–223, IEEE, 2023.
    [38] Z. Zhao, R. Cao, K.-F. Un, W.-H. Yu, P.-I. Mak, and R. P. Martins, “An fpga-based transformer accelerator using output block stationary dataflow for object recognition applications,” IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 70, no. 1, pp. 281–285, 2022.
    [39] T. Wang, L. Gong, C. Wang, Y. Yang, Y. Gao, X. Zhou, and H. Chen, “Via: A novel visiontransformer accelerator based on fpga,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 41, no. 11, pp. 4088–4099, 2022.
    [40] E. Kwon, J. Yoon, and S. Kang, “Mobile transformer accelerator exploiting various line sparsity and tile-based dynamic quantization,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2023.
    [41] J. Dass, S. Wu, H. Shi, C. Li, Z. Ye, Z. Wang, and Y. Lin, “Vitality: Unifying low-rank and sparse approximation for vision transformer acceleration with a linear taylor attention,” in 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pp. 415–428, IEEE, 2023.
    [42] B. Keller, R. Venkatesan, S. Dai, S. G. Tell, B. Zimmer, C. Sakr, W. J. Dally, C. T. Gray, and B. Khailany, “A 95.6-tops/w deep learning inference accelerator with per-vector scaled 4-bit quantization in 5 nm,” IEEE Journal of Solid-State Circuits, vol. 58, no. 4, pp. 1129–1141, 2023.
    [43] A. Marchisio, D. Dura, M. Capra, M. Martina, G. Masera, and M. Shafique, “Swifttron: An efficient hardware accelerator for quantized transformers,” in 2023 International Joint Conference on Neural Networks (IJCNN), pp. 1–9, IEEE, 2023.
    [44] A. Amirshahi, J. A. H. Klein, G. Ansaloni, and D. Atienza, “Tic-sat: Tightly-coupled systolic accelerator for transformers,” in Proceedings of the 28th Asia and South Pacific Design Automation Conference, pp. 657–663, 2023.
    [45] B. Keller, R. Venkatesan, S. Dai, S. G. Tell, B. Zimmer, W. J. Dally, C. T. Gray, and B. Khailany, “A 17–95.6 tops/w deep learning inference accelerator with per-vector scaled 4-bit quantization for transformers in 5nm,” in 2022 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits), pp. 16–17, IEEE, 2022.
    [46] H. Wang, Z. Zhang, and S. Han, “Spatten: Efficient sparse attention architecture with cascade token and head pruning,” in 2021 IEEE International Symposium on HighPerformance Computer Architecture (HPCA), pp. 97–110, IEEE, 2021.
    [47] C. Fang, S. Guo, W. Wu, J. Lin, Z. Wang, M. K. Hsu, and L. Liu, “An efficient hardware accelerator for sparse transformer neural networks,” in 2022 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 2670–2674, IEEE, 2022.
    [48] K. A. A. Fuad and L. Chen, “A survey on sparsity exploration in transformer-based accelerators,” Electronics, vol. 12, no. 10, p.2299, 2023.
    [49] D. Du, G. Gong, and X. Chu, “Model quantization and hardware acceleration for vision transformers: A comprehensive survey,” arXiv preprint arXiv:2405.00314, 2024.
    [50] S. Chen and Z. Lu, “Hardware acceleration of multilayer perceptron based on inter-layer optimization,” in 2019 IEEE 37th International Conference on Computer Design (ICCD), pp. 164–172, IEEE, 2019.
    [51] J. Zhao, P. Zeng, G. Shen, Q. Chen, and M. Guo, “Hardware-oftware co-design enabling static and dynamic sparse attention mechanisms,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2024.

    QR CODE