結合Pixel Shuffle 來改善語意分割中的上採樣｜國立清華大學博碩士論文庫

簡易檢索 / 詳目顯示

回結果列表

研究生：	賴冠穎 Lai, Kuan-Ying
論文名稱：	結合Pixel Shuffle 來改善語意分割中的上採樣 Improve Upsampling in Semantic Segmentation with Pixel Shuffle
指導教授：	林永隆 Lin, Youn-Long
口試委員:	王廷基 Wang, Ting-Chi 黃俊達 Huang, Juinn-Dar
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊工程學系 Computer Science
論文出版年：	2022
畢業學年度：	111
語文別：	英文
論文頁數：	22
中文關鍵詞：	語意分割、雙線性上採樣
外文關鍵詞：	Pixel Shuffle, Bilinear Upsample
相關次數：	點閱：93 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

雙線性插值上採樣是種用於擴大捲積神經網路計算出的特徵或對齊不同大小張量的操作。
本篇論文藉由研究特徵的空間性質以及語意分割中雙線性上採樣的成本，結合了 Pixel Shuffle 與雙線性插值上採樣，並且將這方法應用在 SegFormer 上。

在 ADE20K 與 Cityscapes 的實驗中驗證了我們的方法在不降低準確率的情況下降低了10%以上的推論時間。
我們進一步與 FCN 跟 FCHarDNet 實驗來展示這方法的一般性。

Bilinear upsampling is a tensor operation used to align tensors of different sizes or to amplify the features computed by convolutional neural networks.
By studying the spatial property of features and the cost of bilinear upsampling for semantic segmentation, we combine Pixel Shuffle and bilinear upsampling then apply it to the SegFormer.

The experiment on ADE20K and Cityscapes shows that the proposed method reduces the inference time by more than 1\% while keeping the similar accuracy at the same time.
We further experiment with FCN and FCHarDNet to demonstrate the generality of the proposed method.

Introduction 1
Related work 3
1 Semantic segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3 Super resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4 Upsampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Proposed Method 5
1 MLP decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Mix Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Pixel Shuffle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Experiments 11
1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Experiment Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1 Tensor size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Other Upsampling methods . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Different Scale of Pixel Shuffle . . . . . . . . . . . . . . . . . . . . . 14
3.4 Comparison with SegFormer . . . . . . . . . . . . . . . . . . . . . . . 14
3.5 Pixel Shuffle on other Networks . . . . . . . . . . . . . . . . . . . . . 16
Conclusion and futurework 19
1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
References 21
                                

[1] X. Liu and Y. e. a. Han, “Importance-aware semantic segmentation in self-driving with
discrete wasserstein training,” in AAAI, 2020.
[2] M. Hua, Y. Nan, and S. Lian, “Small obstacle avoidance based on rgb-d semantic seg-
mentation,” in ICCV Workshops, 2019.
[3] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical
image segmentation,” in MICCAI, 2015.
[4] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic seg-
mentation,” in CVPR, 2015.
[5] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang,
“Real-time single image and video super-resolution using an efficient sub-pixel convolu-
tional neural network,” in CVPR, 2016.
[6] C. Du, H. Zewei, S. Anshun, Y. Jiangxin, C. Yanlong, C. Yanpeng, T. Siliang, and
M. Ying Yang, “Orientation-aware deep neural network for real image super-resolution,”
in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Workshops, pp. 0–0, 2019.
[7] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “Segformer: Simple
and efficient design for semantic segmentation with transformers,” NeurIPS, 2021.
[8] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke,
S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,”
in CVPR, 2016.
[9] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Scene parsing through
ade20k dataset,” in CVPR, 2017.
[10] J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu, M. Tan, X. Wang,
et al., “Deep high-resolution representation learning for visual recognition,” IEEE trans-
actions on pattern analysis and machine intelligence, 2020.
[11] M. Fan, S. Lai, J. Huang, X. Wei, Z. Chai, J. Luo, and X. Wei, “Rethinking bisenet for
real-time semantic segmentation,” in CVPR, 2021.
[12] I. O. Tolstikhin, N. Houlsby, A. Kolesnikov, L. Beyer, X. Zhai, T. Unterthiner, J. Yung,
A. Steiner, D. Keysers, J. Uszkoreit, et al., “Mlp-mixer: An all-mlp architecture for vi-
sion,” NeurIPS, 2021.
21
[13] S. Liu, D. Huang, et al., “Receptive field block net for accurate and fast object detection,”
in ECCV, 2018.
[14] T. Shaharabany and L. Wolf, “End-to-end segmentation via patch-wise polygons predic-
tion,” arXiv, 2021.
[15] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in CVPR, 2018.
[16] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for
semantic image segmentation,” arXiv, 2017.
[17] H. Yan, C. Zhang, and M. Wu, “Lawin transformer: Improving semantic segmentation
transformer with multi-scale representations via large window attention,” arXiv, 2022.
[18] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and
I. Polosukhin, “Attention is all you need,” NeurIPS, 2017.
[19] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. De-
hghani, M. Minderer, G. Heigold, S. Gelly, et al., “An image is worth 16x16 words: Trans-
formers for image recognition at scale,” arXiv, 2020.
[20] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer:
Hierarchical vision transformer using shifted windows,” in ICCV, 2021.
[21] B. Cheng, A. Schwing, and A. Kirillov, “Per-pixel classification is not all you need for
semantic segmentation,” NeurIPS, 2021.
[22] S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, P. H. Torr,
et al., “Rethinking semantic segmentation from a sequence-to-sequence perspective with
transformers,” in CVPR, 2021.
[23] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end
object detection with transformers,” in ECCV, 2020.
[24] H. Bao, L. Dong, and F. Wei, “Beit: Bert pre-training of image transformers,” arXiv, 2021.
[25] T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun, “Unified perceptual parsing for scene
understanding,” in ECCV, 2018.
[26] A. Kirillov, R. Girshick, K. He, and P. Dollár, “Panoptic feature pyramid networks,” in
CVPR, 2019.
[27] R. Ranftl, A. Bochkovskiy, and V. Koltun, “Vision transformers for dense prediction,” in
Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12179–
12188, 2021.
[28] P. Chao, C.-Y. Kao, Y.-S. Ruan, C.-H. Huang, and Y.-L. Lin, “Hardnet: A low memory
traffic network,” in ICCV, 2019.
[29] W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, “Pyramid
vision transformer: A versatile backbone for dense prediction without convolutions,” in
ICCV, 2021.

簡易檢索 / 詳目顯示

相關論文