簡易檢索 / 詳目顯示

研究生: 周家興
Zhou, Jia-Xing
論文名稱: 可微分查找矩陣乘法用於壓縮Transformer網路
Differentiable Lookup-Based Matrix Multiplication for Compressing Transformer Network
指導教授: 林永隆
Lin, Young-Long
口試委員: 王廷基
Wang, Ting-Chi
吳凱強
Wu, Kai-Chiang
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2024
畢業學年度: 112
語文別: 英文
論文頁數: 34
中文關鍵詞: 基於查找的矩陣乘法壓縮transformer 網路
外文關鍵詞: looup-based matrix multiplication, compression, transformer network
相關次數: 點閱:37下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年,研究者努力追求更高效的深度神經網絡,尤其是降低乘累加運算
    的計算量。傳統如知識蒸餾、剪枝、和量化的策略已被深入挖掘。由於乘
    法運算耗能問題,新策略如AdderNet 和ShiftCNN 應運而生,它們目標替
    換原有運算,從而節能。
    不久前,MADDNESS 提出一全新策略,直接用查找-累加方法取代了乘
    累加運算。繼而有如PECAN 和LUT-NN 等研究也秉持此方向。我們的研
    究進一步完善了LUT-NN,並提出了端到端的訓練方式。在ImageNet 數據
    上的成果表明,我們的方法使LUT-NN 的基礎準確率上升最多至11%。


    In recent years, the quest for efficient Deep Neural Networks (DNNs) has centered
    on reducing the computational burden of multiply-accumulate (MAC) operations.
    Traditional avenues such as Knowledge Distillation (KD), pruning, and
    quantization have been explored extensively. With the energy cost of multiplication
    operations being a significant concern, alternative methodologies like Adder-
    Net and ShiftCNN have emerged, focusing on the direct substitution of operations
    to save energy.
    Recently, a novel approach called MADDNESS took this further by entirely replacing
    MAC operations with lookup-accumulate (LAC) operations. Several subsequent
    works, including PECAN and LUT-NN, have followed suit. Our research
    builds on and notably improves the latest of these methods, LUT-NN, introducing
    an end-to-end training procedure. Tested on the ImageNet dataset, our proposed
    method significantly enhances the efficiency of DNNs, improving upon the baseline
    LUT-NN model’s accuracy by up to 11%.

    Acknowledgements 摘要i Abstract ii 1 Introduction 1 2 Background 3 2.1 Scalar Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Product Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2.1 Lookup-Based Matrix Multiplication . . . . . . . . . . . . . . . . . . 5 2.2.2 Time and Space Complexity Analysis . . . . . . . . . . . . . . . . . . 7 2.3 Vision Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3.1 Multi-head Self Attention . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.2 Transformer block for images . . . . . . . . . . . . . . . . . . . . . . 10 2.3.3 The class token . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4 ResMLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3 Related Work 13 3.1 MADDNESS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.1.1 MADDNESSHASH . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.1.2 Prototypes Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2 PECAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2.1 Angle-based similarity . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2.2 L1 norm-based similarity . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.3 LUT-NN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4 Proposed Methods 16 4.1 Differentiable Product Quantization . . . . . . . . . . . . . . . . . . . . . . . 17 4.1.1 Differentiable Encoding . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.1.2 Learned temperature . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.1.3 Updating Prototypes Via Gradient Descent . . . . . . . . . . . . . . . 18 4.2 Scalar Quantization-Aware Training at Table Level . . . . . . . . . . . . . . . 19 4.3 K-Means Clustering for Initialization on More Samples . . . . . . . . . . . . . 20 4.4 Self-Knowledge Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.4.1 Soft Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.4.2 Hard Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 5 Experimental Results 23 5.1 Layer Compression: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 5.2 Two Types of Knowledge Distillation . . . . . . . . . . . . . . . . . . . . . . . 26 5.3 The Impact of Data Types of Table . . . . . . . . . . . . . . . . . . . . . . . . 26 5.4 Comparison between our work and LUT-NN . . . . . . . . . . . . . . . . . . . 27 5.5 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 5.6 Evaluating Our Method’s Impact on the ResMLP-S12 . . . . . . . . . . . . . . 29 6 Conclusion And Future Work 30 References 31

    [1] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image
    recognition,” arXiv preprint arXiv:1409.1556, 2014.
    [2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional
    neural networks,” Advances in neural information processing systems, vol. 25,
    2012.
    [3] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection
    with region proposal networks,” Advances in neural information processing systems,
    vol. 28, 2015.
    [4] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,”
    in Proceedings of the IEEE conference on computer vision and pattern recognition,
    pp. 3431–3440, 2015.
    [5] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted
    residuals and linear bottlenecks,” in Proceedings of the IEEE conference on computer
    vision and pattern recognition, pp. 4510–4520, 2018.
    [6] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer,
    “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size,”
    arXiv preprint arXiv:1602.07360, 2016.
    [7] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,”
    in International conference on machine learning, pp. 6105–6114, PMLR, 2019.
    [8] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for
    the 2020s,” in Proceedings of the IEEE/CVF conference on computer vision and pattern
    recognition, pp. 11976–11986, 2022.
    [9] S. Woo, S. Debnath, R. Hu, X. Chen, Z. Liu, I. S. Kweon, and S. Xie, “Convnext v2:
    Co-designing and scaling convnets with masked autoencoders,” in Proceedings of the
    IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16133–16142,
    2023.
    [10] G. Hinton, O. Vinyals, and J. Dean, “Distilling the in a neural network,” arXiv preprint
    arXiv:1503.02531, 2015.
    [11] D. Blalock, J. J. Gonzalez Ortiz, J. Frankle, and J. Guttag, “What is the state of neural
    network pruning?,” Proceedings of machine learning and systems, vol. 2, pp. 129–146,
    2020.
    [12] T. Hoefler, D. Alistarh, T. Ben-Nun, N. Dryden, and A. Peste, “Sparsity in deep learning:
    Pruning and growth for efficient inference and training in neural networks,” The Journal
    of Machine Learning Research, vol. 22, no. 1, pp. 10882–11005, 2021.
    [13] V. Natesh, A. Sabot, H. Kung, and M. Ting, “Rosko: Row skipping outer products for
    sparse matrix multiplication kernels,” arXiv preprint arXiv:2307.03930, 2023.
    [14] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and
    D. Kalenichenko, “Quantization and training of neural networks for efficient integerarithmetic-
    only inference,” in Proceedings of the IEEE conference on computer vision
    and pattern recognition, pp. 2704–2713, 2018.
    [15] M. Nagel, R. A. Amjad, M. Van Baalen, C. Louizos, and T. Blankevoort, “Up or down?
    adaptive rounding for post-training quantization,” in International Conference on Machine
    Learning, pp. 7197–7206, PMLR, 2020.
    [16] Y. Bhalgat, J. Lee, M. Nagel, T. Blankevoort, and N. Kwak, “Lsq+: Improving lowbit
    quantization through learnable offsets and better initialization,” in Proceedings of the
    IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 696–
    697, 2020.
    [17] M. Horowitz, “1.1 computing’s energy problem (and what we can do about it),” in 2014
    IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC),
    pp. 10–14, 2014.
    [18] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net: Imagenet classification
    using binary convolutional neural networks,” in European conference on computer vision,
    pp. 525–542, Springer, 2016.
    [19] H. Chen, Y. Wang, C. Xu, B. Shi, C. Xu, Q. Tian, and C. Xu, “Addernet: Do we really
    need multiplications in deep learning?,” in Proceedings of the IEEE/CVF conference on
    computer vision and pattern recognition, pp. 1468–1477, 2020.
    [20] Y. Xu, C. Xu, X. Chen, W. Zhang, C. Xu, and Y. Wang, “Kernel based progressive distillation
    for adder neural networks,” Advances in Neural Information Processing Systems,
    vol. 33, pp. 12322–12333, 2020.
    [21] D. A. Gudovskiy and L. Rigazio, “Shiftcnn: Generalized low-precision architecture for
    inference of convolutional neural networks,” arXiv preprint arXiv:1706.02393, 2017.
    [22] D. Blalock and J. Guttag, “Multiplying matrices without multiplying,” in International
    Conference on Machine Learning, pp. 992–1004, PMLR, 2021.
    [23] X. Tang, Y. Wang, T. Cao, L. L. Zhang, Q. Chen, D. Cai, Y. Liu, and M. Yang,
    “Lut-nn: Towards unified neural network inference by table lookup,” arXiv preprint
    arXiv:2302.03213, 2023.
    [24] H. Jegou, M. Douze, and C. Schmid, “Product quantization for nearest neighbor search,”
    IEEE transactions on pattern analysis and machine intelligence, vol. 33, no. 1, pp. 117–
    128, 2010.
    [25] J. MacQueen et al., “Some methods for classification and analysis of multivariate observations,”
    in Proceedings of the fifth Berkeley symposium on mathematical statistics and
    probability, pp. 281–297, Oakland, CA, USA, 1967.
    [26] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani,
    M. Minderer, G. Heigold, S. Gelly, et al., “An image is worth 16x16 words: Transformers
    for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
    [27] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and
    I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems,
    vol. 30, 2017.
    [28] D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),” arXiv preprint
    arXiv:1606.08415, 2016.
    [29] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint
    arXiv:1607.06450, 2016.
    [30] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin, “Convolutional sequence
    to sequence learning,” in International conference on machine learning, pp. 1243–1252,
    PMLR, 2017.
    [31] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional
    transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
    [32] H. Touvron, P. Bojanowski, M. Caron, M. Cord, A. El-Nouby, E. Grave, G. Izacard,
    A. Joulin, G. Synnaeve, J. Verbeek, et al., “Resmlp: Feedforward networks for image
    classification with data-efficient training,” IEEE Transactions on Pattern Analysis and Machine
    Intelligence, vol. 45, no. 4, pp. 5314–5321, 2022.
    [33] J. Ran, R. Lin, J. C. L. Li, J. Zhou, and N. Wong, “Pecan: A product-quantized content
    addressable memory network,” in 2023 Design, Automation & Test in Europe Conference
    & Exhibition (DATE), pp. 1–6, IEEE, 2023.
    [34] T. Chen, L. Li, and Y. Sun, “Differentiable product quantization for end-to-end embedding
    compression,” in International Conference on Machine Learning, pp. 1617–1626, PMLR,
    2020.
    [35] R. Gong, X. Liu, S. Jiang, T. Li, P. Hu, J. Lin, F. Yu, and J. Yan, “Differentiable soft
    quantization: Bridging full-precision and low-bit neural networks,” in Proceedings of the
    IEEE/CVF international conference on computer vision, pp. 4852–4861, 2019.
    [36] A. Fan, P. Stock, B. Graham, E. Grave, R. Gribonval, H. Jegou, and A. Joulin,
    “Training with quantization noise for extreme model compression,” arXiv preprint
    arXiv:2004.07320, 2020.
    [37] V. Markovtsev, “Kmcuda.” https://github.com/src-d/kmcuda, 2020.
    [38] Y. Ding, Y. Zhao, X. Shen, M. Musuvathi, and T. Mytkowicz, “Yinyang k-means: A dropin
    replacement of the classic k-means with consistent speedup,” in International conference
    on machine learning, pp. 579–587, PMLR, 2015.
    [39] H. Touvron, M. Cord, and H. Jégou, “Deit iii: Revenge of the vit,” in European Conference
    on Computer Vision, pp. 516–533, Springer, 2022.

    QR CODE