研究生: |
賴冠穎 Lai, Kuan-Ying |
---|---|
論文名稱: |
結合Pixel Shuffle 來改善語意分割中的上採樣 Improve Upsampling in Semantic Segmentation with Pixel Shuffle |
指導教授: |
林永隆
Lin, Youn-Long |
口試委員: |
王廷基
Wang, Ting-Chi 黃俊達 Huang, Juinn-Dar |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Computer Science |
論文出版年: | 2022 |
畢業學年度: | 111 |
語文別: | 英文 |
論文頁數: | 22 |
中文關鍵詞: | 語意分割 、雙線性上採樣 |
外文關鍵詞: | Pixel Shuffle, Bilinear Upsample |
相關次數: | 點閱:2 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
雙線性插值上採樣是種用於擴大捲積神經網路計算出的特徵或對齊不同大小張量的操作。
本篇論文藉由研究特徵的空間性質以及語意分割中雙線性上採樣的成本,結合了 Pixel Shuffle 與雙線性插值上採樣,並且將這方法應用在 SegFormer 上。
在 ADE20K 與 Cityscapes 的實驗中驗證了我們的方法在不降低準確率的情況下降低了10%以上的推論時間。
我們進一步與 FCN 跟 FCHarDNet 實驗來展示這方法的一般性。
Bilinear upsampling is a tensor operation used to align tensors of different sizes or to amplify the features computed by convolutional neural networks.
By studying the spatial property of features and the cost of bilinear upsampling for semantic segmentation, we combine Pixel Shuffle and bilinear upsampling then apply it to the SegFormer.
The experiment on ADE20K and Cityscapes shows that the proposed method reduces the inference time by more than 1\% while keeping the similar accuracy at the same time.
We further experiment with FCN and FCHarDNet to demonstrate the generality of the proposed method.
[1] X. Liu and Y. e. a. Han, “Importance-aware semantic segmentation in self-driving with
discrete wasserstein training,” in AAAI, 2020.
[2] M. Hua, Y. Nan, and S. Lian, “Small obstacle avoidance based on rgb-d semantic seg-
mentation,” in ICCV Workshops, 2019.
[3] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical
image segmentation,” in MICCAI, 2015.
[4] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic seg-
mentation,” in CVPR, 2015.
[5] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang,
“Real-time single image and video super-resolution using an efficient sub-pixel convolu-
tional neural network,” in CVPR, 2016.
[6] C. Du, H. Zewei, S. Anshun, Y. Jiangxin, C. Yanlong, C. Yanpeng, T. Siliang, and
M. Ying Yang, “Orientation-aware deep neural network for real image super-resolution,”
in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Workshops, pp. 0–0, 2019.
[7] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “Segformer: Simple
and efficient design for semantic segmentation with transformers,” NeurIPS, 2021.
[8] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke,
S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,”
in CVPR, 2016.
[9] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Scene parsing through
ade20k dataset,” in CVPR, 2017.
[10] J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu, M. Tan, X. Wang,
et al., “Deep high-resolution representation learning for visual recognition,” IEEE trans-
actions on pattern analysis and machine intelligence, 2020.
[11] M. Fan, S. Lai, J. Huang, X. Wei, Z. Chai, J. Luo, and X. Wei, “Rethinking bisenet for
real-time semantic segmentation,” in CVPR, 2021.
[12] I. O. Tolstikhin, N. Houlsby, A. Kolesnikov, L. Beyer, X. Zhai, T. Unterthiner, J. Yung,
A. Steiner, D. Keysers, J. Uszkoreit, et al., “Mlp-mixer: An all-mlp architecture for vi-
sion,” NeurIPS, 2021.
21
[13] S. Liu, D. Huang, et al., “Receptive field block net for accurate and fast object detection,”
in ECCV, 2018.
[14] T. Shaharabany and L. Wolf, “End-to-end segmentation via patch-wise polygons predic-
tion,” arXiv, 2021.
[15] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in CVPR, 2018.
[16] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for
semantic image segmentation,” arXiv, 2017.
[17] H. Yan, C. Zhang, and M. Wu, “Lawin transformer: Improving semantic segmentation
transformer with multi-scale representations via large window attention,” arXiv, 2022.
[18] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and
I. Polosukhin, “Attention is all you need,” NeurIPS, 2017.
[19] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. De-
hghani, M. Minderer, G. Heigold, S. Gelly, et al., “An image is worth 16x16 words: Trans-
formers for image recognition at scale,” arXiv, 2020.
[20] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer:
Hierarchical vision transformer using shifted windows,” in ICCV, 2021.
[21] B. Cheng, A. Schwing, and A. Kirillov, “Per-pixel classification is not all you need for
semantic segmentation,” NeurIPS, 2021.
[22] S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, P. H. Torr,
et al., “Rethinking semantic segmentation from a sequence-to-sequence perspective with
transformers,” in CVPR, 2021.
[23] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end
object detection with transformers,” in ECCV, 2020.
[24] H. Bao, L. Dong, and F. Wei, “Beit: Bert pre-training of image transformers,” arXiv, 2021.
[25] T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun, “Unified perceptual parsing for scene
understanding,” in ECCV, 2018.
[26] A. Kirillov, R. Girshick, K. He, and P. Dollár, “Panoptic feature pyramid networks,” in
CVPR, 2019.
[27] R. Ranftl, A. Bochkovskiy, and V. Koltun, “Vision transformers for dense prediction,” in
Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12179–
12188, 2021.
[28] P. Chao, C.-Y. Kao, Y.-S. Ruan, C.-H. Huang, and Y.-L. Lin, “Hardnet: A low memory
traffic network,” in ICCV, 2019.
[29] W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, “Pyramid
vision transformer: A versatile backbone for dense prediction without convolutions,” in
ICCV, 2021.