指馬為鹿：易於編修之語意切割表示法｜國立清華大學博碩士論文庫

簡易檢索 / 詳目顯示

回結果列表

研究生：	袁晟洋 Yuan, Cheng-Yang
論文名稱：	指馬為鹿：易於編修之語意切割表示法 Neural Palettes: Lightweight Editable Representations for Semantic Segmentation
指導教授：	陳煥宗 Chen, Hwann-Tzong
口試委員:	劉庭祿 Liu, Tyng-Luh 賴尚宏 Lai, Shang-Hong
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊工程學系 Computer Science
論文出版年：	2023
畢業學年度：	111
語文別：	英文
論文頁數：	43
中文關鍵詞：	機器學習、語意分割
外文關鍵詞：	Machine Learning, Image semantic segmentation
相關次數：	點閱：147 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

現代的語義分割模型在訓練期間通常需要大量的 GPU 記憶體。即使對
分割結果進行微小修改，例如在數據集的子集上進行微調或者對單個視
頻進行擬合，由於需要進行基於反向傳播的優化所以還是需要大量的記
憶體。本文提出了一種新方法，稱為神經調色板（Neural Palette），它
將傳統的分類頭部替換為輕量級模塊，該模塊將高維特徵嵌入投影到二
維空間，並使用二維径向基函數 (RBF) 核生成預測。投影到二維空間的
點形成一個可編輯的" 調色板"，提供可解釋的語義，更重要的是，它能
夠通過單次前向傳播來精煉模型預測，而無需額外的反向傳播記憶體。
所提出的方法可以輕鬆地融入任何預訓練的語義分割模型中，訓練成本
低，以增強模型的可解釋性和微調的靈活性。我們展示了配備神經調色
板的模型在原始任務上取得了可比較的結果，並在微調後續任務時表現
更好，同時消耗的 GPU 記憶體比原始模型更少。

Modern semantic segmentation models often require large GPU memory
footprints during training. With even a slight modification to the segmentation results, such as fine-tuning on a subset of the dataset or fitting to a single
video, a large amount of memory is necessary for back-propagation-based optimization. This thesis presents a new method, Neural Palette, which replaces
the conventional classification head with a lightweight module that projects
high-dimensional feature embeddings onto a 2D space and uses 2D radial basis function kernels to generate predictions. The projected 2D points depict
an editable map that provides interpretable semantics and, more importantly,
enables the refinement of model predictions with a single forward pass without
needing additional memory for back-propagation. The proposed method can
be effortlessly incorporated into any pre-trained semantic segmentation model
with a low training cost to enhance the model’s interpretability and flexibility for fine-tuning. We show that the Neural-Palette-equipped model achieves
comparable results on the original tasks and performs better in fine-tuning the
subsequent tasks while consuming less GPU memory than the original model.

List of Tables 3
List of Figures 5
摘 要 7
Abstract 8
Introduction 9
Related Work 12
Approach 15
1 Preliminary  15
2 Overview  16
3 2D space converter  16
4 RBF predictor  17
5 Editable map  18
6 Flexibility  19
Experiments 21
1 Datasets and evaluations  21
2 Implementation details  22
3 Comparison with original models  23
4 Fine-tuning with a subset  24
5 Fine-tuning for videos  24
6 Visualization of the editable map  25
7 Robustness  28
8 Testing on more videos  29
9 Ablation study
9.1 Another editing method  30
9.2 Push loss ablation  32
9.3 Upper and lower bounds for novel class  32
9.4 Limitation  34
Conclusion and Future Work 36
Bibliography 37
                                

[1] Vijay Badrinarayanan, Fabio Galasso, and Roberto Cipolla. Label propagation in
video sequences. In 2010 IEEE Computer Society Conference on Computer Vision
and Pattern Recognition, pages 3265–3272. IEEE, 2010.
[2] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for
embedding and clustering. In NIPS, 2001.
[3] Sergi Caelles, Kevis-Kokitsi Maninis, Jordi Pont-Tuset, Laura Leal-Taixé, Daniel Cremers, and Luc Van Gool. One-shot video object segmentation. In Proceedings of the
IEEE conference on computer vision and pattern recognition, pages 221–230, 2017.
[4] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and
Alan L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets,
atrous convolution, and fully connected crfs, 2016.
[5] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin P. Murphy, and
Alan Loddon Yuille. Semantic image segmentation with deep convolutional nets and
fully connected crfs. CoRR, abs/1412.7062, 2014.
[6] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation, 2017.
[7] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig
Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation, 2018.
[8] Xi Chen, Zuoxin Li, Ye Yuan, Gang Yu, Jianxin Shen, and Donglian Qi. State-aware
tracker for real-time video object segmentation. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9381–9390, 2020.
[9] MMSegmentation Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation,
2020.
[10] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The
cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[11] Meng Joo Er, Shiqian Wu, Juwei Lu, and Hock Lye Toh. Face recognition with
radial basis function (rbf) neural networks. IEEE transactions on neural networks, 13
3:697–710, 2002.
[12] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascalnetwork.org/challenges/VOC/voc2012/workshop/index.html.
[13] Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature
hierarchies for accurate object detection and semantic segmentation. 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 580–587, 2013.
[14] Arepalli Peda Gopi, Rao Jyothi, Vejendla Lakshman Narayana, and Kanumalli Satya
Sandeep. Classification of tweets data based on polarity using improved rbf kernel of
svm. International Journal of Information Technology, pages 1–16, 2020.
[15] Hong gui Han, Qili Chen, and Jun fei Qiao. An efficient self-organizing rbf neural
network for water quality prediction. Neural networks : the official journal of the
International Neural Network Society, 24 7:717–25, 2011.
[16] Bharath Hariharan, Pablo Arbeláez, Lubomir Bourdev, Subhransu Maji, and Jitendra
Malik. Semantic contours from inverse detectors. In 2011 international conference
on computer vision, pages 991–998. IEEE, 2011.
[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning
for image recognition, 2015.
[18] Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to
common corruptions and perturbations. Proceedings of the International Conference
on Learning Representations, 2019.
[19] Harold Hotelling. Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24:498–520, 1933.
[20] Qibin Hou, Li Zhang, Ming-Ming Cheng, and Jiashi Feng. Strip pooling: Rethinking
spatial pooling for scene parsing. In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, pages 4003–4012, 2020.
[21] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin
De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameterefficient transfer learning for nlp. In International Conference on Machine Learning,
pages 2790–2799. PMLR, 2019.
[22] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean
Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language
models. arXiv preprint arXiv:2106.09685, 2021.
[23] Guangbin Huang, Paramasivan Saratchandran, and Narasimhan Sundararajan. A generalized growing and pruning rbf (ggap-rbf) neural network for function approximation. IEEE Transactions on Neural Networks, 16:57–67, 2005.
[24] Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and
Wenyu Liu. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings
of the IEEE/CVF international conference on computer vision, pages 603–612, 2019.
[25] Rabeeh Karimi Mahabadi, James Henderson, and Sebastian Ruder. Compacter: Efficient low-rank hypercomplex adapter layers. Advances in Neural Information Processing Systems, 34:1022–1035, 2021.
[26] Joseph B. Kruskal. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29:1–27, 1964.
[27] Bor-Chen Kuo, Hsin-Hua Ho, Cheng-Hsaun Li, Chih-Cheng Hung, and Jin-Shiuh
Taur. A kernel-based feature selection method for svm with rbf kernel for hyperspectral image classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 7:317–326, 2014
[28] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin,
and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted
windows. In Proceedings of the IEEE/CVF international conference on computer
vision, pages 10012–10022, 2021.
[29] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks
for semantic segmentation, 2014.
[30] Xinkai Lu, Wenguan Wang, Martin Danelljan, Tianfei Zhou, Jianbing Shen, and
Luc Van Gool. Video object segmentation with episodic graph memory networks.
In European Conference on Computer Vision, 2020.
[31] Jonathon Luiten, Paul Voigtlaender, and Bastian Leibe. Premvos: Proposalgeneration, refinement and merging for video object segmentation. In Computer
Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia,
December 2–6, 2018, Revised Selected Papers, Part IV, pages 565–580. Springer,
2019.
[32] Nicolas Märki, Federico Perazzi, Oliver Wang, and Alexander Sorkine-Hornung. Bilateral space video segmentation. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 743–751, 2016.
[33] Leland McInnes and John Healy. Umap: Uniform manifold approximation and projection for dimension reduction. ArXiv, abs/1802.03426, 2018.
[34] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning deconvolution network for semantic segmentation. 2015 IEEE International Conference on Computer
Vision (ICCV), pages 1520–1528, 2015.
[35] Federico Perazzi, Oliver Wang, Markus Gross, and Alexander Sorkine-Hornung.
Fully connected object proposals for video segmentation. In Proceedings of the IEEE
international conference on computer vision, pages 3227–3234, 2015.
[36] Jonas Pfeiffer, Aishwarya Kamath, Andreas Rücklé, Kyunghyun Cho, and Iryna
Gurevych. Adapterfusion: Non-destructive task composition for transfer learning.
arXiv preprint arXiv:2005.00247, 2020.
[37] Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alexander SorkineHornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation.
arXiv:1704.00675, 2017.
[38] Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. Learning multiple visual
domains with residual adapters. Advances in neural information processing systems,
30, 2017.
[39] Andreas Rücklé, Gregor Geigle, Max Glockner, Tilman Beck, Jonas Pfeiffer, Nils
Reimers, and Iryna Gurevych. Adapterdrop: On the efficiency of adapters in transformers. arXiv preprint arXiv:2010.11918, 2020.
[40] John W. Sammon. A nonlinear mapping for data structure analysis. IEEE Transactions on Computers, C-18:401–409, 1969.
[41] Jian Tang, J. Liu, Ming Zhang, and Qiaozhu Mei. Visualizing large-scale and highdimensional data. Proceedings of the 25th International Conference on World Wide
Web, 2016.
[42] Joshua B. Tenenbaum, Vin de Silva, and John C. Langford. A global geometric
framework for nonlinear dimensionality reduction. Science, 290 5500:2319–23, 2000.
[43] Laurens van der Maaten and Geoffrey E. Hinton. Visualizing data using t-sne. Journal
of Machine Learning Research, 9:2579–2605, 2008.
[44] Haochen Wang, Xiaolong Jiang, Haibing Ren, Yao Hu, and Song Bai. Swiftnet: Realtime video object segmentation. 2021 IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR), pages 1296–1305, 2021.
[45] Ruize Wang, Duyu Tang, Nan Duan, Zhongyu Wei, Xuanjing Huang, Guihong Cao,
Daxin Jiang, Ming Zhou, et al. K-adapter: Infusing knowledge into pre-trained models with adapters. arXiv preprint arXiv:2002.01808, 2020.
[46] Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti
Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, et al. Image
as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv
preprint arXiv:2208.10442, 2022.
[47] Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Unified perceptual parsing for scene understanding. In Proceedings of the European conference on
computer vision (ECCV), pages 418–434, 2018.
[48] Zongxin Yang, Yunchao Wei, and Yi Yang. Collaborative video object segmentation
by foreground-background integration. In European Conference on Computer Vision,
2020.
[49] Isik Yilmaz and Oguz Kaynar. Multiple regression, ann (rbf, mlp) and anfis models
for prediction of swell potential of clayey soils. Expert Syst. Appl., 38:5958–5966,
2010.
[50] Zhang Yun, Zhou Quan, Sun Caixin, Lei Shaolan, Liu Yuming, and Song Yang.
Rbf neural network and anfis-based short-term load forecasting approach in real-time
price environment. IEEE Transactions on Power Systems, 23:853–858, 2008.
[51] Elad Ben Zaken, Shauli Ravfogel, and Yoav Goldberg. Bitfit: Simple parameterefficient fine-tuning for transformer-based masked language-models. arXiv preprint
arXiv:2106.10199, 2021.
[52] Mohammad Amin Zare, Hamid Reza Pourghasemi, Mahdi Vafakhah, and Biswajeet
Pradhan. Landslide susceptibility mapping at vaz watershed (iran) using an artificial
neural network model: a comparison between multilayer perceptron (mlp) and radial basic function (rbf) algorithms. Arabian Journal of Geosciences, 6:2873–2888, 2013.
[53] Xiaohui Zeng, Renjie Liao, Li Gu, Yuwen Xiong, Sanja Fidler, and Raquel Urtasun. Dmm-net: Differentiable mask-matching network for video object segmentation. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages
3928–3937, 2019.
[54] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang,Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip HS Torr, et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages
6881–6890, 2021.

簡易檢索 / 詳目顯示

相關論文