可微分查找矩陣乘法用於壓縮Transformer網路｜國立清華大學博碩士論文庫

簡易檢索 / 詳目顯示

回結果列表

研究生：	周家興 Zhou, Jia-Xing
論文名稱：	可微分查找矩陣乘法用於壓縮Transformer網路 Differentiable Lookup-Based Matrix Multiplication for Compressing Transformer Network
指導教授：	林永隆 Lin, Young-Long
口試委員:	王廷基 Wang, Ting-Chi 吳凱強 Wu, Kai-Chiang
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊工程學系 Computer Science
論文出版年：	2024
畢業學年度：	112
語文別：	英文
論文頁數：	34
中文關鍵詞：	基於查找的矩陣乘法、壓縮、transformer 網路
外文關鍵詞：	looup-based matrix multiplication, compression, transformer network
相關次數：	點閱：37 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

近年，研究者努力追求更高效的深度神經網絡，尤其是降低乘累加運算
的計算量。傳統如知識蒸餾、剪枝、和量化的策略已被深入挖掘。由於乘
法運算耗能問題，新策略如AdderNet 和ShiftCNN 應運而生，它們目標替
換原有運算，從而節能。
不久前，MADDNESS 提出一全新策略，直接用查找-累加方法取代了乘
累加運算。繼而有如PECAN 和LUT-NN 等研究也秉持此方向。我們的研
究進一步完善了LUT-NN，並提出了端到端的訓練方式。在ImageNet 數據
上的成果表明，我們的方法使LUT-NN 的基礎準確率上升最多至11%。

In recent years, the quest for efficient Deep Neural Networks (DNNs) has centered
on reducing the computational burden of multiply-accumulate (MAC) operations.
Traditional avenues such as Knowledge Distillation (KD), pruning, and
quantization have been explored extensively. With the energy cost of multiplication
operations being a significant concern, alternative methodologies like Adder-
Net and ShiftCNN have emerged, focusing on the direct substitution of operations
to save energy.
Recently, a novel approach called MADDNESS took this further by entirely replacing
MAC operations with lookup-accumulate (LAC) operations. Several subsequent
works, including PECAN and LUT-NN, have followed suit. Our research
builds on and notably improves the latest of these methods, LUT-NN, introducing
an end-to-end training procedure. Tested on the ImageNet dataset, our proposed
method significantly enhances the efficiency of DNNs, improving upon the baseline
LUT-NN model’s accuracy by up to 11%.

Acknowledgements
摘要i
Abstract ii
Introduction 1
Background 3
1 Scalar Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Product Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Lookup-Based Matrix Multiplication . . . . . . . . . . . . . . . . . . 5
2.2 Time and Space Complexity Analysis . . . . . . . . . . . . . . . . . . 7
3 Vision Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1 Multi-head Self Attention . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Transformer block for images . . . . . . . . . . . . . . . . . . . . . . 10
3.3 The class token . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4 ResMLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Related Work 13
1 MADDNESS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.1 MADDNESSHASH . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2 Prototypes Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 PECAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1 Angle-based similarity . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 L1 norm-based similarity . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 LUT-NN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Proposed Methods 16
1 Differentiable Product Quantization . . . . . . . . . . . . . . . . . . . . . . . 17
1.1 Differentiable Encoding . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2 Learned temperature . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3 Updating Prototypes Via Gradient Descent . . . . . . . . . . . . . . . 18
2 Scalar Quantization-Aware Training at Table Level . . . . . . . . . . . . . . . 19
3 K-Means Clustering for Initialization on More Samples . . . . . . . . . . . . . 20
4 Self-Knowledge Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1 Soft Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 Hard Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Experimental Results 23
1 Layer Compression: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2 Two Types of Knowledge Distillation . . . . . . . . . . . . . . . . . . . . . . . 26
3 The Impact of Data Types of Table . . . . . . . . . . . . . . . . . . . . . . . . 26
4 Comparison between our work and LUT-NN . . . . . . . . . . . . . . . . . . . 27
5 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6 Evaluating Our Method’s Impact on the ResMLP-S12 . . . . . . . . . . . . . . 29
Conclusion And Future Work 30
References 31

                                

[1] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image
recognition,” arXiv preprint arXiv:1409.1556, 2014.
[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional
neural networks,” Advances in neural information processing systems, vol. 25,
2012.
[3] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection
with region proposal networks,” Advances in neural information processing systems,
vol. 28, 2015.
[4] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,”
in Proceedings of the IEEE conference on computer vision and pattern recognition,
pp. 3431–3440, 2015.
[5] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted
residuals and linear bottlenecks,” in Proceedings of the IEEE conference on computer
vision and pattern recognition, pp. 4510–4520, 2018.
[6] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer,
“Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size,”
arXiv preprint arXiv:1602.07360, 2016.
[7] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,”
in International conference on machine learning, pp. 6105–6114, PMLR, 2019.
[8] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for
the 2020s,” in Proceedings of the IEEE/CVF conference on computer vision and pattern
recognition, pp. 11976–11986, 2022.
[9] S. Woo, S. Debnath, R. Hu, X. Chen, Z. Liu, I. S. Kweon, and S. Xie, “Convnext v2:
Co-designing and scaling convnets with masked autoencoders,” in Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16133–16142,
2023.
[10] G. Hinton, O. Vinyals, and J. Dean, “Distilling the in a neural network,” arXiv preprint
arXiv:1503.02531, 2015.
[11] D. Blalock, J. J. Gonzalez Ortiz, J. Frankle, and J. Guttag, “What is the state of neural
network pruning?,” Proceedings of machine learning and systems, vol. 2, pp. 129–146,
2020.
[12] T. Hoefler, D. Alistarh, T. Ben-Nun, N. Dryden, and A. Peste, “Sparsity in deep learning:
Pruning and growth for efficient inference and training in neural networks,” The Journal
of Machine Learning Research, vol. 22, no. 1, pp. 10882–11005, 2021.
[13] V. Natesh, A. Sabot, H. Kung, and M. Ting, “Rosko: Row skipping outer products for
sparse matrix multiplication kernels,” arXiv preprint arXiv:2307.03930, 2023.
[14] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and
D. Kalenichenko, “Quantization and training of neural networks for efficient integerarithmetic-
only inference,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, pp. 2704–2713, 2018.
[15] M. Nagel, R. A. Amjad, M. Van Baalen, C. Louizos, and T. Blankevoort, “Up or down?
adaptive rounding for post-training quantization,” in International Conference on Machine
Learning, pp. 7197–7206, PMLR, 2020.
[16] Y. Bhalgat, J. Lee, M. Nagel, T. Blankevoort, and N. Kwak, “Lsq+: Improving lowbit
quantization through learnable offsets and better initialization,” in Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 696–
697, 2020.
[17] M. Horowitz, “1.1 computing’s energy problem (and what we can do about it),” in 2014
IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC),
pp. 10–14, 2014.
[18] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net: Imagenet classification
using binary convolutional neural networks,” in European conference on computer vision,
pp. 525–542, Springer, 2016.
[19] H. Chen, Y. Wang, C. Xu, B. Shi, C. Xu, Q. Tian, and C. Xu, “Addernet: Do we really
need multiplications in deep learning?,” in Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, pp. 1468–1477, 2020.
[20] Y. Xu, C. Xu, X. Chen, W. Zhang, C. Xu, and Y. Wang, “Kernel based progressive distillation
for adder neural networks,” Advances in Neural Information Processing Systems,
vol. 33, pp. 12322–12333, 2020.
[21] D. A. Gudovskiy and L. Rigazio, “Shiftcnn: Generalized low-precision architecture for
inference of convolutional neural networks,” arXiv preprint arXiv:1706.02393, 2017.
[22] D. Blalock and J. Guttag, “Multiplying matrices without multiplying,” in International
Conference on Machine Learning, pp. 992–1004, PMLR, 2021.
[23] X. Tang, Y. Wang, T. Cao, L. L. Zhang, Q. Chen, D. Cai, Y. Liu, and M. Yang,
“Lut-nn: Towards unified neural network inference by table lookup,” arXiv preprint
arXiv:2302.03213, 2023.
[24] H. Jegou, M. Douze, and C. Schmid, “Product quantization for nearest neighbor search,”
IEEE transactions on pattern analysis and machine intelligence, vol. 33, no. 1, pp. 117–
128, 2010.
[25] J. MacQueen et al., “Some methods for classification and analysis of multivariate observations,”
in Proceedings of the fifth Berkeley symposium on mathematical statistics and
probability, pp. 281–297, Oakland, CA, USA, 1967.
[26] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani,
M. Minderer, G. Heigold, S. Gelly, et al., “An image is worth 16x16 words: Transformers
for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
[27] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and
I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems,
vol. 30, 2017.
[28] D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),” arXiv preprint
arXiv:1606.08415, 2016.
[29] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint
arXiv:1607.06450, 2016.
[30] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin, “Convolutional sequence
to sequence learning,” in International conference on machine learning, pp. 1243–1252,
PMLR, 2017.
[31] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional
transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
[32] H. Touvron, P. Bojanowski, M. Caron, M. Cord, A. El-Nouby, E. Grave, G. Izacard,
A. Joulin, G. Synnaeve, J. Verbeek, et al., “Resmlp: Feedforward networks for image
classification with data-efficient training,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 45, no. 4, pp. 5314–5321, 2022.
[33] J. Ran, R. Lin, J. C. L. Li, J. Zhou, and N. Wong, “Pecan: A product-quantized content
addressable memory network,” in 2023 Design, Automation & Test in Europe Conference
& Exhibition (DATE), pp. 1–6, IEEE, 2023.
[34] T. Chen, L. Li, and Y. Sun, “Differentiable product quantization for end-to-end embedding
compression,” in International Conference on Machine Learning, pp. 1617–1626, PMLR,
2020.
[35] R. Gong, X. Liu, S. Jiang, T. Li, P. Hu, J. Lin, F. Yu, and J. Yan, “Differentiable soft
quantization: Bridging full-precision and low-bit neural networks,” in Proceedings of the
IEEE/CVF international conference on computer vision, pp. 4852–4861, 2019.
[36] A. Fan, P. Stock, B. Graham, E. Grave, R. Gribonval, H. Jegou, and A. Joulin,
“Training with quantization noise for extreme model compression,” arXiv preprint
arXiv:2004.07320, 2020.
[37] V. Markovtsev, “Kmcuda.” https://github.com/src-d/kmcuda, 2020.
[38] Y. Ding, Y. Zhao, X. Shen, M. Musuvathi, and T. Mytkowicz, “Yinyang k-means: A dropin
replacement of the classic k-means with consistent speedup,” in International conference
on machine learning, pp. 579–587, PMLR, 2015.
[39] H. Touvron, M. Cord, and H. Jégou, “Deit iii: Revenge of the vit,” in European Conference
on Computer Vision, pp. 516–533, Springer, 2022.

簡易檢索 / 詳目顯示

相關論文