應用於語義分割之金字塔狀輸出表徵｜國立清華大學博碩士論文庫

簡易檢索 / 詳目顯示

回結果列表

研究生：	蕭棋薇 Hsiao, Chi-Wei
論文名稱：	應用於語義分割之金字塔狀輸出表徵 Specialize and Fuse: Pyramidal Representation for Semantic Segmentation
指導教授：	朱宏國 Chu, Hung-Kuo
口試委員:	陳煥宗 Chen, Hwann-Tzong 孫民 Sun, Min 彭文孝 Peng, Wen-Hsiao
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊系統與應用研究所 Institute of Information Systems and Applications
論文出版年：	2020
畢業學年度：	109
語文別：	英文
論文頁數：	42
中文關鍵詞：	語義分割、深度學習
外文關鍵詞：	semantic segmentation, deep learning
相關次數：	點閱：2 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

在本論文中,我們提出了一個可應用於語義分割的金字塔狀輸出表徵。首先,我們將「語義金字塔」定義為一組不同空間尺度下的語義地圖。每個語義地圖是由格狀排列的多個單元組成,其中,若一個單元內所有像素都屬於單一一種語義類別,我們將之稱為「單質單元」。為了鼓勵簡約原則,我們將每個像素分配到符合最粗尺度的單質單元,並建立一個「單質金字塔」來表示此分配。我們端到端地訓練一個模型去預測「語義金字塔」和「單質金字塔」。在預測階段時,我們用「單質金字塔」去整合「語義金字塔」中各個尺度的語義地圖以得到最終的語義地圖。我們的輸出表徵減少了有效輸出的數量, 這有利於簡約原則,因為實際上參與整合的「單質單元」的數量遠少於直接預測每個像素的輸出數量(即標準的語義分割的輸出空間)。此外,我們的模型學習專精於各尺度的預測反映了各語義類別「單質單元」的自然分佈(例如語義類別天空通常被分配到較粗的尺度)。最後,我們提出了一個從粗尺度到細尺度的脈絡模組,不只能進一步提升模型表現,也與我們提出的金字塔狀輸出表徵的特質一致。我們透過詳盡的對照實驗來驗證我們提出的各個關鍵模組的有效性。我們的方法在 ADE20K 和 COCO-Stuff 10K 資料集達到最佳的表現。

We present a novel pyramidal representation for semantic segmentation to take advantage of the typical scales of semantic classes (e.g., a road segment is typically larger than a car segment). First, we define a “semantic pyramid" comprising semantic maps at various scales. Each map consists of a grid of cells, and a “unit-cell" contains pixels of a single class. To encourage parsi- mony, we carefully assign each pixel to the “unit-cell" at the coarsest scale and construct the “unity pyramid" to indicate the assignment. We end-to-end train a joint model to predict both pyramids. At inference, the predicted unity pyramid fuses the semantic pyramid into the final per-pixel semantic map. Our representation reduces the effective number of predictions in favor of par- simony since the number of unit-cells to be fused is significantly less than the number of pixels (i.e., the standard output space). Moreover, our model learns to specialize in the prediction at each scale reflecting the natural distribution of unit-cell for each semantic class (e.g., skies are typically assigned at coarser scales). Finally, we propose a coarse-to-fine contextual module that accords with the essence of our pyramidal representation for further improvements. We validate the effectiveness of each key module in our method through exten- sive ablation studies. Our approach achieves state-of-the-art performance on ADE20K and COCO-Stuff 10K datasets.

List of Tables 5
List of Figures 6
摘要8
Abstract 9
Introduction 10
Related work 12
1 ContextualModules....................................... 12 2.2 HierarchicalSemanticSegmentationPrediction ........................ 13
Approach 14
1 Overview ............................................ 14
2 SemanticPyramidandUnityPyramid ............................. 16
2.1 PyramidStructure ................................... 16
2.2 Notation ........................................ 16
2.3 PyramidalGroundTruth ................................ 16
2.4 TheTrainingPhase................................... 17
2.5 Fuser—FusingSemanticPyramidBasedonUnityPyramid. . . . . . . . . . . . . . 18
3 PredictingtheUnityPyramid.................................. 19
4 Predicting the Semantic Pyramid with the Coarse-to-Fine Contextual Module . . . . . . . . 20
4.1 Coarse-to-FineContextUpdatingandAggregation . . . . . . . . . . . . . . . . . . 20
4.2 ContextAggregation .................................. 21
4.3 ContextUpdating.................................... 21
5 Computationefficiency..................................... 22
6 Implementationdetail...................................... 23
Experiments 24
1 DatasetsandMetric....................................... 24
2 ComparisonwithState-of-the-Arts............................... 24
3 AblationStudy ......................................... 25
3.1 Detailedsettingsofablationexperiments ....................... 26
3.2 TheEffectivenessofthePyramidalRepresentation . . . . . . . . . . . . . . . . . . 27
3.3 TheEffectivenessoftheContextualModule...................... 28
3.4 Does the Improvement Stem from Auxiliary Supervision? . . . . . . . . . . . . . . 28
3.5 TheTrainingProcedure ................................ 29
3.6 Does Our Pyramidal Representation Improve More on Boundary Regions? . . . . . 29
4 Performanceanalysis ...................................... 30
5 Qualitativeresults........................................ 31
Conclusion and Future Work 38
Bibliography 39
                                

[1] H. Caesar, J. R. R. Uijlings, and V. Ferrari. Coco-stuff: Thing and stuff classes in context. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 1209–1218, 2018.
[2] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part VII, pages 833–851, 2018.
[3] K. Chitta, J. M. Álvarez, and M. Hebert. Quadtree generating networks: Efficient hierarchical scene parsing with sparse convolutions. In IEEE Winter Conference on Applications of Computer Vision, WACV 2020, Snowmass Village, CO, USA, March 1-5, 2020, pages 2009–2018, 2020.
[4] H. Ding, X. Jiang, A. Q. Liu, N. Magnenat-Thalmann, and G. Wang. Boundary- aware feature propagation for scene segmentation. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pages 6818–6828, 2019.
[5] H. Ding, X. Jiang, B. Shuai, A. Q. Liu, and G. Wang. Semantic correlation promoted shape-variant context for segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 8885–8894, 2019.
[6] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu. Dual attention network for scene segmentation. In IEEE Conference on Computer Vision and Pattern Recogni- tion, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 3146–3154, 2019.
[7] J. Fu, J. Liu, Y. Wang, Y. Li, Y. Bao, J. Tang, and H. Lu. Adaptive context network for scene parsing. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pages 6747–6756, 2019.
[8] B. Graham, M. Engelcke, and L. van der Maaten. 3d semantic segmentation with submanifold sparse convolutional networks. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 9224–9232, 2018.
[9] J. He, Z. Deng, L. Zhou, Y. Wang, and Y. Qiao. Adaptive pyramid context network for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 7519– 7528, 2019.
[10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 770–778, 2016.
[11] Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu. Ccnet: Criss-cross attention for semantic segmentation. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pages 603–612, 2019.
[12] A. Kirillov, Y. Wu, K. He, and R. B. Girshick. Pointrend: Image segmentation as rendering. CoRR, abs/1912.08193, 2019.
[13] X. Li, Z. Liu, P. Luo, C. C. Loy, and X. Tang. Not all pixels are equal: Difficulty- aware semantic segmentation via deep layer cascade. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21- 26, 2017, pages 6459–6468, 2017.
[14] X. Li, Z. Zhong, J. Wu, Y. Yang, Z. Lin, and H. Liu. Expectation-maximization atten- tion networks for semantic segmentation. In 2019 IEEE/CVF International Confer- ence on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pages 9166–9175, 2019.
[15] X. Liang, Z. Hu, H. Zhang, L. Lin, and E. P. Xing. Symbolic graph reasoning meets convolutions. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 De- cember 2018, Montréal, Canada, pages 1858–1868, 2018.
[16] X. Liang, H. Zhou, and E. P. Xing. Dynamic-structured semantic propagation net- work. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 752–761, 2018.
[17] M. Sonka, V. Hlavác, and R. Boyle. Image processing, analysis and and machine vision (3. ed.). Thomson, 2008.
[18] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Pro- cessing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 5998–6008, 2017.
[19] J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu, M. Tan, X. Wang, W. Liu, and B. Xiao. Deep high-resolution representation learning for visual recognition. TPAMI, 2019.
[20] X. Wang, R. B. Girshick, A. Gupta, and K. He. Non-local neural networks. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 7794–7803, 2018.
[21] T.Xiao,Y.Liu,B.Zhou,Y.Jiang,andJ.Sun.Unifiedperceptualparsingforsceneun- derstanding. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part V, pages 432–448, 2018.
[22] C. Yu, J. Wang, C. Gao, G. Yu, C. Shen, and N. Sang. Context prior for scene segmentation. CoRR, abs/2004.01547, 2020.
[23] Y. Yuan, X. Chen, and J. Wang. Object-contextual representations for semantic seg- mentation. CoRR, abs/1909.11065, 2019.
[24] F. Zhang, Y. Chen, Z. Li, Z. Hong, J. Liu, F. Ma, J. Han, and E. Ding. Acfnet: At- tentional class feature network for semantic segmentation. In 2019 IEEE/CVF Inter- national Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pages 6797–6806, 2019.
[25] H. Zhang, K. J. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and A. Agrawal. Context encoding for semantic segmentation. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 7151–7160, 2018.
[26] H. Zhang, H. Zhang, C. Wang, and J. Xie. Co-occurrent features in semantic seg- mentation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 548–557, 2019.
[27] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In
2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 6230–6239, 2017.
[28] H. Zhao, Y. Zhang, S. Liu, J. Shi, C. C. Loy, D. Lin, and J. Jia. Psanet: Point-wise spatial attention network for scene parsing. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part IX, pages 270–286, 2018.
[29] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba. Scene parsing through ADE20K dataset. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 5122–5130, 2017.
[30] Z. Zhu, M. Xu, S. Bai, T. Huang, and X. Bai. Asymmetric non-local neural net- works for semantic segmentation. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pages 593–602, 2019.

簡易檢索 / 詳目顯示

相關論文