簡易檢索 / 詳目顯示

研究生: 許毓軒
Xu, Yu-Syuan
論文名稱: 可動態調整影片語義分割網路
Dynamic Vedio Segmentation Network
指導教授: 李濬屹
Lee, Chun-Yi
口試委員: 陳煥宗
Chen, Hwann-Tzong
黃稚存
Huang, Chih-Tsun
學位類別: 碩士
Master
系所名稱:
論文出版年: 2018
畢業學年度: 106
語文別: 英文
論文頁數: 37
中文關鍵詞: 電腦視覺深度學習語意分割機器學習
外文關鍵詞: Computer Vision, Deep Learning, Machine Learning, Semantic Segmentation
相關次數: 點閱:1下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來,語義圖像分割(Semantic Segmentation)通過使用深度卷積神經網路(Deep Convolutional Neural Networks)在各種數據集上實現了前所未有的高準確性,準確的語義分割可以應用在非常多領域例如自動車、監視攝像機、無人機等,但是這些應用通常需要real-time的反應故高幀速是必須的,然而深度卷積神經網路的執行時間是非常長的無法符合real-time的需求。
    可動態調整影片語義分割網路(DVSNet)被用來實現快速且正確的影片語義分割。可動態調整影片語義分割網路由兩個卷積神經網路組成:語義分割網路(Segmentation Network)和光流網路(Flow Network)。前者產生準確的語義分割結果,但網路層數較多且耗時;後者比前者快得多,但須經過額外處理來獲得語義分割結果且較不准確,可動態調整影片語義分割網路使用決策網路(Decision Network)根據預期相似度(Expected Confidence Score)動態的將不同的幀區域分配給不同的網路,具有較高預期相似度的幀區域使用光流網路;具有較低預期相似度的幀區域則必須通過語義分割網路。實驗結果證明可動態調整影片語義分割網路能夠在Cityscapes資料集上達到19.8 fps並有70.4% mIoU的正確性,可動態調整影片語義分割網路的高速版本能夠在相同的資料集上提供30.4 fps和63.2 mIoU,另外可動態調整影片語義分割網路至多可以減少高達95%的計算時間。


    In this paper, we present a detailed design of dynamic video segmentation network (DVSNet) for fast and efficient video semantic segmentation. DVSNet consists of two convolutional neural networks: a segmentation network and a flow network. The former generates highly accurate semantic segmentations, but is deeper and slower. The latter is much faster than the former, but its output requires further processing to generate less accurate semantic segmentations. We explore the use of a decision network to adaptively assign different frame regions to different networks based on a metric called expected confidence score. Frame regions with a higher expected confidence score traverse the flow network. Frame regions with a lower expected confidence score have to pass through the segmentation network. We have extensively performed experiments on various configurations of DVSNet, and investigated
    a number of variants for the proposed decision network. The experimental results show that our DVSNet is able to achieve up to 70.4% mIoU at 19.8 fps on the Cityscapes dataset. A high speed version of DVSNet is able to deliver an fps of 30.4 with 63.2% mIoU on the same dataset. DVSNet is also able to reduce up to 95% of the computational workloads.

    Chinese Abstract i Abstract ii Acknowledgements iii Contents iv List of Figures vi List of Tables vii List of Algorithms viii 1 Introduction 1 2 Background 7 2.1 Image Semantic Segmentation 7 2.2 Optical Flow 8 2.3 Video Semantic Segmentation 8 3 DVSNet 9 3.1 Dynamic Video Segmentation Network 9 3.2 Adaptive Key Frame Scheduling 11 3.3 Frame Region Based Execution 12 3.4 DVSNet Inference Algorithm 13 3.5 DN and Its Training Methodology 14 4 Experiments 16 4.1 Experimental Setup 16 4.2 Validation of DVSNet 18 4.3 Validation of DVSNet’s Adaptive Key Frame Scheduling Policy 19 4.4 Computation Time Analysis 21 4.5 Comparison of DN Configurations 21 4.6 Impact of Frame Division Schemes 22 4.7 Impact of Overlapped Regions on Accuracy 23 4.8 Results 23 5 Conclusion and Future Work 25 5.1 Conclusion 25 5.2 Future Work 25

    [1] M. Everingham et al., “The PASCAL visual object classes challenge: A retrospective,”
    Int. J. Computer Vision, vol. 111, no. 1, pp. 98-136, Jan. 2015.
    [2] M. Cordts et al., “The Cityscapes dataset for semantic urban scene understanding,”
    in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR),
    pp. 3213-3223, Jun. 2016.
    [3] B. Zhou et al., “Scene parsing through ADE20K dataset,” in Proc. IEEE Conf.
    Computer Vision and Pattern Recognition (CVPR), pp. 5122-5130, Jul. 2017.
    [4] L. Tsung-Yi and other, “Microsoft COCO: Common objects in context,” in
    Proc. European Conf. Computer Vision (ECCV), pp. 740-755, Sep. 2014.
    [5] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Semantic
    image segmentation with deep convolutional nets and fully connected
    CRFs,” in Proc. Int. Conf. Learning Representations (ICLR), May 2015.
    [6] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
    “Deeplab: Semantic image segmentation with deep convolutional nets, atrous
    convolution, and fully connected CRFs,” IEEE Trans. Pattern Analysis and
    Machine Intelligence (TPAMI), Apr. 2017.
    [7] L.-C. Chen, G. Papandreou, S. F, and H. Adam, “Rethinking atrous convolution
    for semantic image segmentation,” arXiv:1706.0558, Aug. 2017.
    [8] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,”
    in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR),
    pp. 6230-6239, Jul. 2017.
    [9] Z. Wu, C. Shen, and A. van den Hengel, “Wider or deeper: Revisiting the
    resnet model for visual recognition,” arXiv:11611.10080., Nov. 2016.
    [10] G. Lin, A. Milan, C. Shen, and I. Reid, “Refinenet: Multi-path refinement networks
    for high-resolution semantic segmentation,” in Proc. IEEE Conf. Computer
    Vision and Pattern Recognition (CVPR), pp. 5168-5177, Jul. 2017.
    [11] P. Wang et al., “Understanding convolution for semantic segmentation,”
    1702.08502, Feb. 2017.
    [12] G. GhiasiEmail and C. C. Fowlkes, “Laplacian pyramid reconstruction and refinement
    for semantic segmentation,” in Proc. European Conf. Computer Vision
    (ECCV), pp. 519-534, Oct. 2016.
    [13] D. Alexey, R. German, C. Felipe, L. Antonio, and K. Vladlen, “CARLA:
    An open urban driving simulator,” in Proc. Conf.on Robot Learning (CoRL),
    pp. 445-461, Nov. 2017.
    [14] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with
    deep convolutional neural networks,” in Proc. Neural Information Processing
    Systems (NIPS), pp. 1097-1105, Dec. 2012.
    [15] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale
    image recognition,” in Proc. Int. Conf. Learning Representations (ICLR),
    May 2015.
    [16] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”
    in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR),
    pp. 770-778, Jun. 2016.
    [17] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE Conf.
    Computer Vision and Pattern Recognition (CVPR), pp. 1-9, Jun. 2015.
    [18] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, inceptionresnet
    and the impact of residual connections on learning,” in Proc. Association
    for the Advancement of Artificial Intelligence (AAAI), pp. 4278-4284, Feb. 2017.
    [19] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic
    segmentation,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition
    (CVPR), pp. 3431-3440, Jun. 2015.
    [20] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional
    networks for visual recognition,” IEEE Trans. Pattern Analysis and
    Machine Intelligence (TPAMI), vol. 37, no. 9, pp. 1904-1916, Sep. 2015.
    [21] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,”
    in Proc. Int. Conf. Learning Representations (ICLR), May 2016.
    [22] J. Long, E. Shelhamer, and T. Darrell, “Efficient piecewise training of deep
    structured models for semantic segmentation,” in Proc. IEEE Conf. Computer
    Vision and Pattern Recognition (CVPR), pp. 3194-3203, Jun. 2016.
    [23] H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia, “ICNet for real-time semantic
    segmentation on high-resolution images,” arXiv:1704.08545., Apr. 2017.
    [24] L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille, “Attention to scale:
    Scale-aware semantic image segmentation,” in Proc. IEEE Conf. Computer Vision
    and Pattern Recognition (CVPR), pp. 3640-3649, Jun. 2016.
    [25] T. Pohlen, A. Hermans, M. Mathias, and B. Leibe, “Full-resolution residual
    networks for semantic segmentation in street scenes,” in Proc. IEEE Conf.
    Computer Vision and Pattern Recognition (CVPR), pp. 3309-3318, Jul. 2017.
    [26] S. Zagoruyko et al., “A multipath network for object detection,”
    arXiv:1604.02135, Aug. 2016.
    [27] S. Zheng et al., “Conditional random fields as recurrent neural networks,” in
    Proc. IEEE Int. Conf. Computer Vision (ICCV), pp. 1529-1537, Dec. 2015.
    [28] P. Kr¨ahenb¨uhl and V. Koltun, “Efficient inference in fully connected crfs with
    gaussian edge potentials,” in Proc. Neural Information Processing Systems
    (NIPS), pp. 109-117, Dec. 2011.
    [29] E. Shelhamer, K. Rakelly, J. Hoffman, and T. Darrell, “Clockwork convnets
    for video semantic segmentation,” in Proc. European Conf. Computer Vision
    (ECCV) Wksp, pp. 852-868, Oct. 2016.
    [30] X. Zhu, Y. Xiong, J. Dai, L. Yuan, and Y. Wei, “Deep feature flow for video
    recognition,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition
    (CVPR), pp. 4141-4150, Jul. 2017.
    [31] L. Wiskott and T. J. Sejnowski, “Slow feature analysis: Unsupervised learning
    of invariances,” Neural computation, vol. 14, no. 4, pp. 715-770, Apr. 2002.
    [32] M. D. Zeiler and R. Fergus, “Slow and steady feature analysis: Higher order
    temporal coherence in video,” in Proc. IEEE Conf. Computer Vision and Pattern
    Recognition (CVPR), pp. 3852-3861, Jun. 2016.
    [33] B. K. P. Horn and B. G. Schunck, “Determining optical flow,” J. Artificial
    intelligence, vol. 17, no. 1-3, pp. 185-203, Aug. 1981.
    [34] A. Dosovitskiy et al., “FlowNet: Learning optical flow with convolutional networks,”
    in Proc. IEEE Int. Conf. Computer Vision (ICCV), pp. 2758-2766,
    Dec. 2015.
    [35] E. Ilg et al., “FlowNet 2.0: Evolution of optical flow estimation with deep
    networks,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition
    (CVPR), pp. 1647-1655, Jul. 2017.
    [36] S. Ioffe and C. Szegedy, “Batch Normalization: accelerating deep network training
    by reducing internal covariate shift,” in Proc. Machine Learning Research
    (PMLR), vol. 37, pp. 448-456, Jul. 2015.
    [37] M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus, “Deconvolutional
    networks,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition
    (CVPR), pp. 2528-2535, Jun. 2010.
    [38] J. Weickert, A. Bruhn, T. Brox, and N. Papenberg, “A survey on variational
    optic flow methods for small displacements,” Mathematical Models for Registration
    and Applications to Medical Imaging, pp. 103-136, Oct. 2006.
    [39] T. Brox and J. Malik, “Large displacement optical flow: Descriptor matching
    in variational motion estimation,” IEEE Trans. Pattern Analysis and Machine
    Intelligence (TPAMI), vol. 33, no. 3, pp. 500-513, May 2011.
    [40] P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid., “DeepFlow: Large
    displacement optical flow with deep matching,” in Proc. IEEE Conf. Computer
    Vision and Pattern Recognition (CVPR), pp. 1385-1392, Dec. 2013.
    [41] C. Bailer, B. Taetz, and D. Stricker, “Flow fields: Dense correspondence fields
    for highly accurate large displacement optical flow estimation,” in Proc. IEEE
    Int. Conf. Computer Vision (ICCV), pp. 4015-4023, Dec. 2015.
    [42] J. Wulff and M. J. Black, “Efficient sparse-to-dense optical flow estimation
    using a learned basis and layers,” in Proc. IEEE Conf. Computer Vision and
    Pattern Recognition (CVPR), pp. 120-130, Jun. 2015.
    [43] A. Ranjan and M. J. Black, “Optical flow estimation using a spatial pyramid
    network,” arXiv:1611.00850., Nov. 2016.
    [44] E. L. Denton, S. Chintala, R. Fergus, et al., “Deep generative image models
    using a laplacian pyramid of adversarial networks,” in Proc. Neural Information
    Processing Systems (NIPS), pp. 1486-1494, Dec. 2015.
    [45] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in
    Proc. Int. Conf. Learning Representations (ICLR), May 2015.
    [46] D.Jayaraman and K. Grauman, “Visualizing and understanding convolutional
    networks,” in Proc. European Conf. Computer Vision (ECCV), pp. 818-833,
    Sep. 2014.

    QR CODE