簡易檢索 / 詳目顯示

研究生: 袁晟洋
Yuan, Cheng-Yang
論文名稱: 指馬為鹿:易於編修之語意切割表示法
Neural Palettes: Lightweight Editable Representations for Semantic Segmentation
指導教授: 陳煥宗
Chen, Hwann-Tzong
口試委員: 劉庭祿
Liu, Tyng-Luh
賴尚宏
Lai, Shang-Hong
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2023
畢業學年度: 111
語文別: 英文
論文頁數: 43
中文關鍵詞: 機器學習語意分割
外文關鍵詞: Machine Learning, Image semantic segmentation
相關次數: 點閱:147下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 現代的語義分割模型在訓練期間通常需要大量的 GPU 記憶體。即使對
    分割結果進行微小修改,例如在數據集的子集上進行微調或者對單個視
    頻進行擬合,由於需要進行基於反向傳播的優化所以還是需要大量的記
    憶體。本文提出了一種新方法,稱為神經調色板(Neural Palette),它
    將傳統的分類頭部替換為輕量級模塊,該模塊將高維特徵嵌入投影到二
    維空間,並使用二維径向基函數 (RBF) 核生成預測。投影到二維空間的
    點形成一個可編輯的" 調色板",提供可解釋的語義,更重要的是,它能
    夠通過單次前向傳播來精煉模型預測,而無需額外的反向傳播記憶體。
    所提出的方法可以輕鬆地融入任何預訓練的語義分割模型中,訓練成本
    低,以增強模型的可解釋性和微調的靈活性。我們展示了配備神經調色
    板的模型在原始任務上取得了可比較的結果,並在微調後續任務時表現
    更好,同時消耗的 GPU 記憶體比原始模型更少。


    Modern semantic segmentation models often require large GPU memory
    footprints during training. With even a slight modification to the segmentation results, such as fine-tuning on a subset of the dataset or fitting to a single
    video, a large amount of memory is necessary for back-propagation-based optimization. This thesis presents a new method, Neural Palette, which replaces
    the conventional classification head with a lightweight module that projects
    high-dimensional feature embeddings onto a 2D space and uses 2D radial basis function kernels to generate predictions. The projected 2D points depict
    an editable map that provides interpretable semantics and, more importantly,
    enables the refinement of model predictions with a single forward pass without
    needing additional memory for back-propagation. The proposed method can
    be effortlessly incorporated into any pre-trained semantic segmentation model
    with a low training cost to enhance the model’s interpretability and flexibility for fine-tuning. We show that the Neural-Palette-equipped model achieves
    comparable results on the original tasks and performs better in fine-tuning the
    subsequent tasks while consuming less GPU memory than the original model.

    List of Tables 3 List of Figures 5 摘 要 7 Abstract 8 1 Introduction 9 2 Related Work 12 3 Approach 15 3.1 Preliminary 15 3.2 Overview 16 3.3 2D space converter 16 3.4 RBF predictor 17 3.5 Editable map 18 3.6 Flexibility 19 4 Experiments 21 4.1 Datasets and evaluations 21 4.2 Implementation details 22 4.3 Comparison with original models 23 4.4 Fine-tuning with a subset 24 4.5 Fine-tuning for videos 24 4.6 Visualization of the editable map 25 4.7 Robustness 28 4.8 Testing on more videos 29 4.9 Ablation study 4.9.1 Another editing method 30 4.9.2 Push loss ablation 32 4.9.3 Upper and lower bounds for novel class 32 4.9.4 Limitation 34 5 Conclusion and Future Work 36 Bibliography 37

    [1] Vijay Badrinarayanan, Fabio Galasso, and Roberto Cipolla. Label propagation in
    video sequences. In 2010 IEEE Computer Society Conference on Computer Vision
    and Pattern Recognition, pages 3265–3272. IEEE, 2010.
    [2] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for
    embedding and clustering. In NIPS, 2001.
    [3] Sergi Caelles, Kevis-Kokitsi Maninis, Jordi Pont-Tuset, Laura Leal-Taixé, Daniel Cremers, and Luc Van Gool. One-shot video object segmentation. In Proceedings of the
    IEEE conference on computer vision and pattern recognition, pages 221–230, 2017.
    [4] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and
    Alan L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets,
    atrous convolution, and fully connected crfs, 2016.
    [5] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin P. Murphy, and
    Alan Loddon Yuille. Semantic image segmentation with deep convolutional nets and
    fully connected crfs. CoRR, abs/1412.7062, 2014.
    [6] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation, 2017.
    [7] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig
    Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation, 2018.
    [8] Xi Chen, Zuoxin Li, Ye Yuan, Gang Yu, Jianxin Shen, and Donglian Qi. State-aware
    tracker for real-time video object segmentation. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9381–9390, 2020.
    [9] MMSegmentation Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation,
    2020.
    [10] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The
    cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE
    Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
    [11] Meng Joo Er, Shiqian Wu, Juwei Lu, and Hock Lye Toh. Face recognition with
    radial basis function (rbf) neural networks. IEEE transactions on neural networks, 13
    3:697–710, 2002.
    [12] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascalnetwork.org/challenges/VOC/voc2012/workshop/index.html.
    [13] Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature
    hierarchies for accurate object detection and semantic segmentation. 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 580–587, 2013.
    [14] Arepalli Peda Gopi, Rao Jyothi, Vejendla Lakshman Narayana, and Kanumalli Satya
    Sandeep. Classification of tweets data based on polarity using improved rbf kernel of
    svm. International Journal of Information Technology, pages 1–16, 2020.
    [15] Hong gui Han, Qili Chen, and Jun fei Qiao. An efficient self-organizing rbf neural
    network for water quality prediction. Neural networks : the official journal of the
    International Neural Network Society, 24 7:717–25, 2011.
    [16] Bharath Hariharan, Pablo Arbeláez, Lubomir Bourdev, Subhransu Maji, and Jitendra
    Malik. Semantic contours from inverse detectors. In 2011 international conference
    on computer vision, pages 991–998. IEEE, 2011.
    [17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning
    for image recognition, 2015.
    [18] Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to
    common corruptions and perturbations. Proceedings of the International Conference
    on Learning Representations, 2019.
    [19] Harold Hotelling. Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24:498–520, 1933.
    [20] Qibin Hou, Li Zhang, Ming-Ming Cheng, and Jiashi Feng. Strip pooling: Rethinking
    spatial pooling for scene parsing. In Proceedings of the IEEE/CVF conference on
    computer vision and pattern recognition, pages 4003–4012, 2020.
    [21] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin
    De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameterefficient transfer learning for nlp. In International Conference on Machine Learning,
    pages 2790–2799. PMLR, 2019.
    [22] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean
    Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language
    models. arXiv preprint arXiv:2106.09685, 2021.
    [23] Guangbin Huang, Paramasivan Saratchandran, and Narasimhan Sundararajan. A generalized growing and pruning rbf (ggap-rbf) neural network for function approximation. IEEE Transactions on Neural Networks, 16:57–67, 2005.
    [24] Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and
    Wenyu Liu. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings
    of the IEEE/CVF international conference on computer vision, pages 603–612, 2019.
    [25] Rabeeh Karimi Mahabadi, James Henderson, and Sebastian Ruder. Compacter: Efficient low-rank hypercomplex adapter layers. Advances in Neural Information Processing Systems, 34:1022–1035, 2021.
    [26] Joseph B. Kruskal. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29:1–27, 1964.
    [27] Bor-Chen Kuo, Hsin-Hua Ho, Cheng-Hsaun Li, Chih-Cheng Hung, and Jin-Shiuh
    Taur. A kernel-based feature selection method for svm with rbf kernel for hyperspectral image classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 7:317–326, 2014
    [28] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin,
    and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted
    windows. In Proceedings of the IEEE/CVF international conference on computer
    vision, pages 10012–10022, 2021.
    [29] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks
    for semantic segmentation, 2014.
    [30] Xinkai Lu, Wenguan Wang, Martin Danelljan, Tianfei Zhou, Jianbing Shen, and
    Luc Van Gool. Video object segmentation with episodic graph memory networks.
    In European Conference on Computer Vision, 2020.
    [31] Jonathon Luiten, Paul Voigtlaender, and Bastian Leibe. Premvos: Proposalgeneration, refinement and merging for video object segmentation. In Computer
    Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia,
    December 2–6, 2018, Revised Selected Papers, Part IV, pages 565–580. Springer,
    2019.
    [32] Nicolas Märki, Federico Perazzi, Oliver Wang, and Alexander Sorkine-Hornung. Bilateral space video segmentation. In Proceedings of the IEEE conference on computer
    vision and pattern recognition, pages 743–751, 2016.
    [33] Leland McInnes and John Healy. Umap: Uniform manifold approximation and projection for dimension reduction. ArXiv, abs/1802.03426, 2018.
    [34] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning deconvolution network for semantic segmentation. 2015 IEEE International Conference on Computer
    Vision (ICCV), pages 1520–1528, 2015.
    [35] Federico Perazzi, Oliver Wang, Markus Gross, and Alexander Sorkine-Hornung.
    Fully connected object proposals for video segmentation. In Proceedings of the IEEE
    international conference on computer vision, pages 3227–3234, 2015.
    [36] Jonas Pfeiffer, Aishwarya Kamath, Andreas Rücklé, Kyunghyun Cho, and Iryna
    Gurevych. Adapterfusion: Non-destructive task composition for transfer learning.
    arXiv preprint arXiv:2005.00247, 2020.
    [37] Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alexander SorkineHornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation.
    arXiv:1704.00675, 2017.
    [38] Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. Learning multiple visual
    domains with residual adapters. Advances in neural information processing systems,
    30, 2017.
    [39] Andreas Rücklé, Gregor Geigle, Max Glockner, Tilman Beck, Jonas Pfeiffer, Nils
    Reimers, and Iryna Gurevych. Adapterdrop: On the efficiency of adapters in transformers. arXiv preprint arXiv:2010.11918, 2020.
    [40] John W. Sammon. A nonlinear mapping for data structure analysis. IEEE Transactions on Computers, C-18:401–409, 1969.
    [41] Jian Tang, J. Liu, Ming Zhang, and Qiaozhu Mei. Visualizing large-scale and highdimensional data. Proceedings of the 25th International Conference on World Wide
    Web, 2016.
    [42] Joshua B. Tenenbaum, Vin de Silva, and John C. Langford. A global geometric
    framework for nonlinear dimensionality reduction. Science, 290 5500:2319–23, 2000.
    [43] Laurens van der Maaten and Geoffrey E. Hinton. Visualizing data using t-sne. Journal
    of Machine Learning Research, 9:2579–2605, 2008.
    [44] Haochen Wang, Xiaolong Jiang, Haibing Ren, Yao Hu, and Song Bai. Swiftnet: Realtime video object segmentation. 2021 IEEE/CVF Conference on Computer Vision
    and Pattern Recognition (CVPR), pages 1296–1305, 2021.
    [45] Ruize Wang, Duyu Tang, Nan Duan, Zhongyu Wei, Xuanjing Huang, Guihong Cao,
    Daxin Jiang, Ming Zhou, et al. K-adapter: Infusing knowledge into pre-trained models with adapters. arXiv preprint arXiv:2002.01808, 2020.
    [46] Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti
    Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, et al. Image
    as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv
    preprint arXiv:2208.10442, 2022.
    [47] Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Unified perceptual parsing for scene understanding. In Proceedings of the European conference on
    computer vision (ECCV), pages 418–434, 2018.
    [48] Zongxin Yang, Yunchao Wei, and Yi Yang. Collaborative video object segmentation
    by foreground-background integration. In European Conference on Computer Vision,
    2020.
    [49] Isik Yilmaz and Oguz Kaynar. Multiple regression, ann (rbf, mlp) and anfis models
    for prediction of swell potential of clayey soils. Expert Syst. Appl., 38:5958–5966,
    2010.
    [50] Zhang Yun, Zhou Quan, Sun Caixin, Lei Shaolan, Liu Yuming, and Song Yang.
    Rbf neural network and anfis-based short-term load forecasting approach in real-time
    price environment. IEEE Transactions on Power Systems, 23:853–858, 2008.
    [51] Elad Ben Zaken, Shauli Ravfogel, and Yoav Goldberg. Bitfit: Simple parameterefficient fine-tuning for transformer-based masked language-models. arXiv preprint
    arXiv:2106.10199, 2021.
    [52] Mohammad Amin Zare, Hamid Reza Pourghasemi, Mahdi Vafakhah, and Biswajeet
    Pradhan. Landslide susceptibility mapping at vaz watershed (iran) using an artificial
    neural network model: a comparison between multilayer perceptron (mlp) and radial basic function (rbf) algorithms. Arabian Journal of Geosciences, 6:2873–2888, 2013.
    [53] Xiaohui Zeng, Renjie Liao, Li Gu, Yuwen Xiong, Sanja Fidler, and Raquel Urtasun. Dmm-net: Differentiable mask-matching network for video object segmentation. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages
    3928–3937, 2019.
    [54] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang,Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip HS Torr, et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages
    6881–6890, 2021.

    QR CODE