SSHO：類神經網路的結構式稀疏張量與高階綜合優化器

簡易檢索 / 詳目顯示

回結果列表

研究生：	梁耕銘 Liang, Geng-Ming
論文名稱：	SSHO：類神經網路的結構式稀疏張量與高階綜合優化器 SSHO: Structured Sparse HLS Optimizer for Neural Networks
指導教授：	李政崑 Lee, Jenq-Kuen
口試委員:	游逸平 You, Yi-Ping 洪明郁 Hung, Ming-Yu
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊工程學系 Computer Science
論文出版年：	2024
畢業學年度：	112
語文別：	中文
論文頁數：	40
中文關鍵詞：	MLIR 、高階綜合、LLVM 、編譯器、稀疏張量
外文關鍵詞：	MLIR, HLS, LLVM, Compiler, Sparse-Tensor
相關次數：	點閱：56 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

近年來，大型語言模型（LLM）成為最熱門的話題，並大大改變了我們的生活。許多設備添加了人工智慧以增強用戶體驗，但這些設備如何能夠承受如此高的存儲和計算需求呢？稀疏計算成為了一種經典且高效的解決方案。通過不存儲和計算零值，我們可以大大減少延遲和空間使用。著名的AI框架已經實現了稀疏計算，MLIR也開發了稀疏張量方言，支持稀疏壓縮格式並提供了到LLVM的編譯流程。配合特定加速器的支持，我們可以最大化地利用我們的設計，而HLS（高級綜合）已成為快速原型化的一種方式。然而，MLIR僅支持經典的稀疏壓縮，且對硬體不友好。此外，由MLIR AI模型生成的LLVM IR無法直接被HLS利用，因為版本不同和代碼風格的差異。在本文中，我們提出了一種新的稀疏壓縮方案，這種方案在硬體上更易於並行化，並通過減少小權重來關注準確性。我們選擇了卷積神經網絡（CNN）模型來展示我們設計的性能。結果顯示，我們通過使用高稀疏性來保持準確性，同時減少了計算時間和存儲空間。

In recent years, large language models (LLM) become the hottest topic and have massively changed our lives. Many devices add AI to enhance the user experience, but how can these devices afford such high storage and computation requirements? Sparse computing become a classic and efficient way to deal with the problem. Without storing and computing zero values, we can massively reduce the latency and space usage. Famous AI frameworks have already implemented sparse computing, MLIR also develops Sparse Tensor dialect, supporting sparse compression formats and providing a compilation flow to LLVM. With specific accelerator supports, we can maximize the use of our design, and HLS has become a fast way to prototype it. However, MLIR only supports classic sparse compression and is not hardware-friendly. Moreover, the LLVM IR generated from MLIR AI model cannot be utilized by HLS directly due to the version difference and code style. In this paper, we propose a new sparse compression scheme that can be more parallelizable for hardware and cares about accuracy by reducing small weights. We choose CNN models to demonstrate our design performances. Results show that we maintain accuracy by using high sparsity and simultaneously reducing computation time and storage.

Contents
摘要 i
Abstract ii
誌謝 iii
Contents iv
List of Figures vi
List of Tables ix
1 Introduction 1
2 Background 3
2.1 LLVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 MLIR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 HLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.1 HLS Pragmas . . . . . . . . . . . . . . . . . . . . . . . 7
3 Sparse Pruning and Compression 10
3.1 Pruning methods . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Compression scheme . . . . . . . . . . . . . . . . . . . . . . . 11
iv
3.2.1 Unstructured scheme . . . . . . . . . . . . . . . . . . . 11
3.2.2 Structured scheme . . . . . . . . . . . . . . . . . . . . 11
3.3 SSHO 2D Structure Sparse . . . . . . . . . . . . . . . . . . . . 12
4 Framework Design 15
4.1 Overview of Framework Design . . . . . . . . . . . . . . . . . 15
4.2 Sparse MLIR . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3 HLS Adjustment Optimizations . . . . . . . . . . . . . . . . . 20
4.3.1 MLIR Adjustments . . . . . . . . . . . . . . . . . . . . 20
4.3.2 LLVM Adjustments . . . . . . . . . . . . . . . . . . . 22
5 Experiment 24
5.1 Experimental Flow . . . . . . . . . . . . . . . . . . . . . . . . 27
5.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.3 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6 Conclusion 36
Bibliography 38
v
List of Figures
2.1 Example of LLVM IR to demonstrate the hierarchical structure. The module-level contains metadata, attributes, debug
info, and function declaration. In the function, instructions
are grouped by basic blocks and use br to connect the basic
blocks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 A few languages have their own specific IR before compiling
into LLVM IR. Swift has a SIL, and Rust has HIR and MIR. . 5
3.1 Different types of pruning methods, and all the pruning sparsity is 50%. The upper left is the original dense data, and the
upper right is using greedy pruning. The lower two are structured pruning, while the left one is inter-structured and the
right one is intra-structured. The intra-structured pruning is
near to greedy pruning and also keeps the data balanced. . . . 12
vi
3.2 Example of SSHO 2D structured sparse compression. The
data are pre-pruned and match the compression method. In
this example, we set a 1x6 array as the basic structure called
bank, and 2 blocks in a bank. There is only one block that
remains in a bank and 1 value is truncated in that block.
The original matrix uses 1536 bits, after compression, we only
store 512 bits on values, 8 bits on block offset, and 32 bits on
element offset. . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.1 The framework outline of SSHO. The pruning method is integrated in Distiller to generate sparse models. Models are
compressed in the loop level of MLIR and generate sparse
calculation loops. HLS syntax adjustments are in both MLIR
and LLVM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.1 MNIST CNN model. Started with two sequential convolutions, ReLU, and Max pooling layer. Flatten the tensor and
finish with two fully connected layers. . . . . . . . . . . . . . . 26
5.2 The flow chart of our experiments. The PyTorch models are
trained in Distiller in both sparse and dense ways. MLIR
are compiled using Torch-MLIR. All the models need to be
adjusted by the HLS adaptor, and we also use SSHO to compress the sparse model here. After the final syntax fixing in
LLVM, we can get the result by Vitis HLS and get the RTL. . 28
vii
5.3 Accuracy and latency reports. As the sparsity increases, the
accuracy decreases a little bit, but there are large speed-ups.
SSHO format has better accuracy and execution time than
CSR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.4 Accuracy and Utilization reports. Unlike latency, space usage
do not have that much decrease, but still better than the dense
model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
viii
List of Tables
2.1 Pragmas provided by Xilinx Vitis HLS. Different types of directives illustrate various behaviors. Array access patterns are
described by array directives. Loop directives rearrange the
loop format and give the trip count information. top is the
necessary directive to tell the starting function. pipeline and
dataflow are different pipeline method in different abstraction. 8
5.1 Environments of our experiments. . . . . . . . . . . . . . . . . 25
5.2 Number of non-zero in each layer. . . . . . . . . . . . . . . . . 29
5.3 Settings of the BR, BpB, BS, ER in each layer and sparsity. . 30
5.4 Latency reports, compare with different sparsity. SSHO with
87.2% sparsity is about 7 times faster than dense. There are
slight speed-up compared with SSHO in 93.9% sparsity. . . . . 31
5.5 Utilization reports. Models are compressed to reduce space
storage and small arrays are stored in F F and LUT instead
of BRAM_18K. 2 SSHO models use much less than the
dense model. . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
ix
                                

Bibliography
[1] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky, “Sparse
convolutional neural networks,” in Proceedings of the IEEE conference
on computer vision and pattern recognition, 2015, pp. 806–814.
[2] A. Zhou, Y. Ma, J. Zhu, J. Liu, Z. Zhang, K. Yuan, W. Sun, and H. Li,
“Learning n: m fine-grained structured sparse neural networks from
scratch,” arXiv preprint arXiv:2102.04010, 2021.
[3] H.-H. Liao, C.-L. Lee, J.-K. Lee, W.-C. Lai, M.-Y. Hung, and C.-W.
Huang, “Support convolution of cnn with compression sparse matrix
multiplication flow in tvm,” in 50th international conference on parallel
processing workshop, 2021, pp. 1–7.
[4] G.-M. Liang, C.-Y. Yuan, M.-S. Yuan, T.-L. Chen, K.-H. Chen, and
J.-K. Lee, “The support of mlir hls adaptor for llvm ir,” in Workshop
Proceedings of the 51st International Conference on Parallel Processing,
2022, pp. 1–8.
[5] G.-M. Liang, C.-L. Lee, R. Lai, and J.-K. Lee, “Support of sparse tensor computing for mlir hls,” in Proceedings of the 52nd International
Conference on Parallel Processing Workshops, 2023, pp. 88–95.
38
BIBLIOGRAPHY 39
[6] H.-I. C. Liu, M. Brehler, M. Ravishankar, N. Vasilache, B. Vanik, and
S. Laurenzo, “Tinyiree: An ml execution environment for embedded
systems from compilation to deployment,” IEEE Micro, vol. 42, no. 5,
pp. 9–16, 2022.
[7] T. Jin, G.-T. Bercea, T. D. Le, T. Chen, G. Su, H. Imai, Y. Negishi,
A. Leu, K. O’Brien, K. Kawachiya et al., “Compiling onnx neural network models using mlir,” arXiv preprint arXiv:2008.08272, 2020.
[8] H. Ye, C. Hao, J. Cheng, H. Jeong, J. Huang, S. Neuendorffer, and
D. Chen, “Scalehls: A new scalable high-level synthesis framework on
multi-level intermediate representation,” in 2022 IEEE international
symposium on high-performance computer architecture (HPCA). IEEE,
2022, pp. 741–755.
[9] A. Mishra, J. A. Latorre, J. Pool, D. Stosic, D. Stosic, G. Venkatesh,
C. Yu, and P. Micikevicius, “Accelerating sparse deep neural networks,”
arXiv preprint arXiv:2104.08378, 2021.
[10] S. Cao, C. Zhang, Z. Yao, W. Xiao, L. Nie, D. Zhan, Y. Liu, M. Wu,
and L. Zhang, “Efficient and effective sparse lstm on fpga with bankbalanced sparsity,” in Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2019, pp. 63–72.
[11] Y. N. Wu, P.-A. Tsai, S. Muralidharan, A. Parashar, V. Sze, and
J. Emer, “Highlight: Efficient and flexible dnn acceleration with hierarchical structured sparsity,” in Proceedings of the 56th Annual IEEE/
ACM International Symposium on Microarchitecture, 2023, pp. 1106–
1120.
BIBLIOGRAPHY 40
[12] A. Bik, P. Koanantakool, T. Shpeisman, N. Vasilache, B. Zheng, and
F. Kjolstad, “Compiler support for sparse tensor computations in mlir,”
ACM Transactions on Architecture and Code Optimization (TACO),
vol. 19, no. 4, pp. 1–25, 2022.
[13] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for
word representation,” in Proceedings of the 2014 conference on empirical
methods in natural language processing (EMNLP), 2014, pp. 1532–1543.
[14] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, “Bag of tricks for
efficient text classification,” arXiv preprint arXiv:1607.01759, 2016.
[15] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of
word representations in vector space,” arXiv preprint arXiv:1301.3781,
2013.
[16] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick,
J. Hays, P. Perona, D. Ramanan, P. Doll’a r, and C. L. Zitnick,
“Microsoft COCO: common objects in context,” CoRR, vol. abs/
1405.0312, 2014. [Online]. Available: http://arxiv.org/abs/1405.0312
[17] C.-L. Lee, C.-T. Chao, W.-H. Chu, M.-Y. Hung, and J.-K. Lee, “Accelerating ai applications with sparse matrix compression in halide,” Journal
of Signal Processing Systems, vol. 95, no. 5, pp. 609–622, 2023.

簡易檢索 / 詳目顯示

相關論文