研究生: |
黎光健 Li, Kuang-Chien |
---|---|
論文名稱: |
基於利用整合多重高語意層來強化語意訊息之語意分割 Semantic Segmentation via Enhancing Context Information by Fusing Multiple High-Level Features |
指導教授: |
邱瀞德
Chiu, Ching-Te |
口試委員: |
蘇豐文
Soo, Von-Wun 張隆紋 Chang, Long-Wen |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Computer Science |
論文出版年: | 2019 |
畢業學年度: | 108 |
語文別: | 英文 |
論文頁數: | 60 |
中文關鍵詞: | 語意分割 、卷積神經網絡 、特徵融合 、模糊池化 |
外文關鍵詞: | Semantic Segmentation, Blur Pooling, Convolutional Neural Network, Feature Fusion |
相關次數: | 點閱:2 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
語意分割長期以來都是計算機視覺領域所關注的研究主題之一。這項技術目的在於嘗試從一張2D的圖片解析出以像素做為單位進行分類的物件資訊圖。深度學習近年來有許多突破,目前已有許多研究將深度學習應用於語意分割中,並取得不錯的成效。
在語意分割的領域中,一個模型關於空間相關的資訊(Spatial Information) (比如說像是物件原始的形狀,精確的位置)以及關於語意相關資訊(Context Information) (比如說像是行人、車子、天空等抽象語意)的整合能力,是決定一個模型效能好壞的兩個重要因素。許多人會參考FCN的編碼器(Encoder)及解碼器(Decoder)設計框架。利用連續的下採樣結構,將語意相關資訊逐步嵌入輸出端的特徵圖。之後利用解碼器的結構,產生我們所需要的物件資訊圖。然而在下採樣的過程中,我們會逐漸丟失物件空間相關的訊息,使得解碼器無法精確還原物件的形狀。於是當今許多模型會參考所提出的U-shape架構,藉由融合高低層級的特徵圖,來補足空間訊息以提升模型預測準確度。而我們這次的架構是針對語意資訊的整合能力進行優化。
我們提出一個用來加強整合語意資訊的模組( Multiple Up-Sampling Blocks 以及 Concatenated DB )。藉由融合解碼器端不同捲積層的特徵,提升語意資訊,進而提升模型之精準度。我們將R Zhang在提出的模糊池化架構應用到我們的模型中,藉由增加下採樣時,模型對於平移的不變性,來增加編碼器嵌入語意訊息的能力。另外,我們利用在語意分割領域,比cross entropy損失函數更適合預測mean-IoU的 soft-iou 損失函數來更進一步改良模型。
最後在我們的論文中,我們利用Brostow等人所建立的城市街道場景資料集 (CamVid)測試我們架構的效能。我們取得了mean-IoU 70.534%,超越了FC-DenseNet67 在論文中取得的65.8%,也超越同一篇論文中最好的架構FC-DenseNet103所取得的66.9%。
Semantic Segmentation has been one of the most important research areas in computer vision. The goal of this research aims to do a pixel-wise classification for a 2D RGB image. In recent years, CNN Network has achieved many successes in many research domains, many researchers also test the limit of this technology on Semantic Segmentation tasks.
In the Semantic Segmentation field, the Context Information (abstract semantic meaning like pedestrians, cars, and sky) and Spatial Information (such as the primitive shape of the object or the precise location) are key factors that determine the performance of a CNN model. Many-state-of-the-art Semantic Segmentation models follow the Encoder-Decoder Frameworks of Fully Connected Network (FCN) [1]. The Context Information is gradually embedded into the output feature map of each CNN layer. Then, the CNN network generates the final object information map by using the decoder. However, CNN gradually loses the Spatial Information during the process of continuous downsampling. This makes it impossible for the decoder to accurately restore the shape of the object. So many recent models refer to the U-shape structure proposed in [3] to improve the precision by fusing the high-level and low-level feature map. Our work focuses on optimization methods on Context Information.
We proposed a structure that consists of Multiple Up-Sampling Blocks and Concatenated DB, enhancing the Context Information on decoder by Fusion of Multiple High-Level Features. Also, we apply Blur Pooling [4], a better downsampling method proposed by R Zhang which makes CNN more shift-invariant to an input image. Finally, we found the drawback of using cross-entropy, the potential trend to have bad performance on small objects, on the Semantic Segmentation tasks and further boost the performance of our model by applying soft-IoU loss.
At last, we test our model performance on a dataset established by Brostow et al. (CamVid). Our method achieves 70.534% on the mean-IoU result and outperforms 65.8% obtained in the paper of FC-DenseNet67.
[1] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
[2] G. J. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla, “Segmentation and recognition using structure from motion point clouds,” in European conference on computer vision. Springer, 2008, pp. 44–57.
[3] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241.
[4] R. Zhang, “Making convolutional networks shift-invariant again,” arXiv preprint arXiv:1904.11486, 2019.
[5] G. Mattyus, W. Luo, and R. Urtasun, “Deeproadmapper: Extracting road topology from aerial images,” pp. 3438–3446, 2017.
[6] S. Jegou, M. Drozdzal, D. Vazquez, A. Romero, and Y. Bengio, “The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation,” pp. 11–19, 2017.
[7] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “Bisenet: Bilateral segmentation network for real-time semantic segmentation,” pp. 325–341, 2018.
[8] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 12, pp. 2481–2495, 2017.
[9] T. Tieleman and G. Hinton, “Rmsprop: Divide the gradient by a running average of its recent magnitude. coursera: Neural networks for machine learning,” 2012.
[10] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” 2017.
[11] A. Odena, V. Dumoulin, and C. Olah, “Deconvolution and checkerboard artifacts,” vol. 1, no. 10, 2016, p. e3.
[12] S. Mannor, D. Peleg, and R. Rubinstein, “The cross entropy method for classification,” in Proceedings of the 22nd international conference on Machine learning. ACM, 2005, pp. 561–568.
[13] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” pp. 801–818, 2018.
[14] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for semantic image segmentation,” 2017.
[15] Y. Yuan and J. Wang, “Ocnet: Object context network for scene parsing,” 2018.
[16] H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia, “Icnet for real-time semantic segmentation on high-resolution images,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 405–420.
[17] G. Lin, A. Milan, C. Shen, and I. Reid, “Refinenet: Multi-path refinement networks for high-resolution semantic segmentation,” pp. 1925–1934, 2017.
[18] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Semantic image segmentation with deep convolutional nets and fully connected crfs,” 2014.
[19] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” 2015.
[20] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” pp. 2881–2890, 2017.
[21] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv: 1502.03167, 2015.
[22] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” pp. 807–814, 2010.
[23] G. Ghiasi and C. C. Fowlkes, “Laplacian pyramid reconstruction and refinement for semantic segmentation,” in European Conference on Computer Vision. Springer, 2016, pp. 519–534.
[24] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” vol. 15, no. 1. JMLR. org, 2014, pp. 1929–1958.
[25] H. Wu and X. Gu, “Max-pooling dropout for regularization of convolutional neural networks,” pp. 46–54, 2015.
[26] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” 2014.
[27] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[28] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” pp. 4700–4708, 2017.
[29] M. Amirul Islam, M. Rochan, N. D. Bruce, and Y. Wang, “Gated feedback refinement network for dense image labeling,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3751–3759.
[30] C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun, “Large kernel matters–improve semantic segmentation by global convolutional network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4353–4361.
[31] F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” pp. 1251–1258, 2017.
[32] Z. Zhang, X. Zhang, C. Peng, X. Xue, and J. Sun, “Exfuse: Enhancing feature fusion for semantic segmentation,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 269–284.
[33] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1492–1500.
[34] P. Wang, P. Chen, Y. Yuan, D. Liu, Z. Huang, X. Hou, and G. Cottrell, “Understanding convolution for semantic segmentation,” in 2018 IEEE winter conference on applications of computer vision (WACV). IEEE, 2018, pp. 1451–1460. (Understanding Convolution for Semantic Segmentation)
[35] A. Mordvintsev, C. Olah, and M. Tyka, “Inceptionism: Going deeper into neural networks,” 2015.
[36] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, “Self-attention generative adversarial networks,” 2018.
[37] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu, “Dual attention network for scene segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 3146–3154.
[38] W. Liu, A. Rabinovich, and A. C. Berg, “Parsenet: Looking wider to see better,” 2015.