簡易檢索 / 詳目顯示

研究生: 丁 盛
Ting, Sheng
論文名稱: 經由加入光流場及語意分割至深度神經網路以預測未來視訊畫面
Predicting future video frames by including optical flow and semantic segmentation into deep neural network
指導教授: 賴尚宏
Lai, Shang-Hong
口試委員: 劉庭祿
Liu, Tyng-Luh
李哲榮
Lee, Che-Rung
黃思皓
Huang, Szu-Hao
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2019
畢業學年度: 108
語文別: 英文
論文頁數: 41
中文關鍵詞: 視訊預測光流場語義分割深度神經網路
外文關鍵詞: Video Prediction, Optical Flow, Semantic Segmentation, Deep Neural Network
相關次數: 點閱:3下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 未來影像預測是一個傑出的研究主題,尤其是用在學習現實視覺世界的呈現,
    舉例而言,若自駕車能夠提早預知到未來的道路狀況,則自駕車將會變得更加安全而且可以被信任。
    為了預測出來下一個影像偵,該深度學習模式必須學習到正確的特徵來解決其他的電腦視覺問題,例如像是物件偵測、動向預測以及影像分割,
    不像是其他的相關作品都是以只預測下一個影像偵為主軸,
    在這篇論文中,我們將挑戰預測更久之後的影像偵,也就是說我們會繼續預測下一個影像偵之後的影像偵,
    為了達成這件事,首先我們提出了一個模型來取得未來的影像資訊,包括下一影像偵的光流資訊與語義分割資訊,
    然後我們介紹了一個兩階段的影像偵生成器,藉此來一步步地生成出較好的未來影像偵,
    接著我們將 Conditional GAN 的技巧加入了我們的模型之中,使得生成器有一個比較明確的目標來生成未來偵,
    再來,我們展示了加入倒序的訓練資料的功效,如此一來生成器就不會怠惰於學習正確的資訊。
    最後,我們在兩個知名的資料集中測試了我們的模型,並且將結果與其他相同研究主題的模型做比較,
    結果呈現出我們的模型所產生出來的影像偵,即使在較長的預測之後也更清晰並且擁有更多物件細節。


    Video frame prediction is an excellent research topic to learn video representation.
    For examples, if self-driving cars can forecast the road condition earlier, the self-driving car will become more secure and reliable.
    To predict the next video frame, the deep learning model needs to learn correct features that are capable of solving other computer vision problems like object detection, motion prediction, and image segmentation.
    Unlike other related works only focusing on the next-frame prediction, in this thesis, we will challenge to predict longer future frames, which means we need to predict more than the next frame.
    To achieve this goal, we propose a model to acquire future information including the next optical flow and the next semantic segmentation.
    Then, we introduce an image generator with a two-stage architecture to generate better future frames step by step.
    In addition, we bring the conditional GAN to our model so that the generator can learn how to generate each object more explicit.
    Moreover, we illustrate the effect of adding reversed training samples so that the generator can learn the correct features in a better way.
    Finally, we evaluate our model on both well-known datasets and compare our results with several other video prediction models.
    In summary, our prediction results are sharper and can have more object details even when predicting longer future frames.

    1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Problem Description . . . . . . . . . . . . . . . . . . . . 2 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . 4 2 Related Work 5 2.1 Optical flow and Semantic Segmentation . . . . . . . . . . . 5 2.2 Generative Adversarial Network . . . . . . . . . . . . . . . 6 2.3 Video Prediction . . . . . . . . . . . . . . . . . . . . . . 7 3 Methods 9 3.1 Future Information Generators . . . . . . . . . . . . . . . 11 3.2 Two-Stage Image Generator . . . . . . . . . . . . . . . . . 14 3.3 Conditional GAN . . . . . . . . . . . . . . . . . . . . . . 17 3.4 Reversed training samples . . . . . . . . . . . . . . . . . 20 3.5 Loss function . . . . . . . . . . . . . . . . . . . . . . . 22 4 Experiments 24 4.1 Settings . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.2 KITTI dataset . . . . . . . . . . . . . . . . . . . . . . . 25 4.2.1 Qualitative results . . . . . . . . . . . . . . . . . 25 4.2.2 Quantitative results . . . . . . . . . . . . . . . . . 27 4.3 Caltech Pedestrian dataset . . . . . . . . . . . . . . . . . 28 4.3.1 Qualitative results . . . . . . . . . . . . . . . . . 28 4.3.2 Quantitative results . . . . . . . . . . . . . . . . . 30 4.4 Ablation study . . . . . . . . . . . . . . . . . . . . . . . 31 4.4.1 Sharing skip features . . . . . . . . . . . . . . . . 31 4.4.2 Two-stage architecture . . . . . . . . . . . . . . . . 31 4.4.3 Reversed Training Samples . . . . . . . . . . . . . . 33 4.4.4 Conditional GAN . . . . . . . . . . . . . . . . . . . 34 4.4.5 Swapping Two Stages . . . . . . . . . . . . . . . . . 36 4.5 Experiment with a Different Prediction Time Step . . . . . . 38 5 Conclusions 39 References 40

    [1] Byeon, W., Wang, Q., Kumar Srivastava, R., and Koumoutsakos, P. Con-
    textvp: Fully context-aware video prediction. In Proceedings of the European
    Conference on Computer Vision (ECCV) (2018), pp. 753–769.
    [2] Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., and Adam, H. Encoder-
    decoder with atrous separable convolution for semantic image segmentation.
    In ECCV (2018).
    [3] Dollár, P., Wojek, C., Schiele, B., and Perona, P. Pedestrian detection: A
    benchmark. In CVPR (June 2009).
    [4] Gao, H., Xu, H., Cai, Q.-Z., Wang, R., Yu, F., and Darrell, T. Disen-
    tangling propagation and generation for video prediction. arXiv preprint
    arXiv:1812.00452 (2018).
    [5] Geiger, A., Lenz, P., Stiller, C., and Urtasun, R. Vision meets robotics: The
    kitti dataset. International Journal of Robotics Research (IJRR) (2013).
    [6] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair,
    S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in
    neural information processing systems (2014), pp. 2672–2680.
    [7] Hao, Z., Huang, X., and Belongie, S. Controllable video generation with sparse
    trajectories. In Proceedings of the IEEE Conference on Computer Vision and
    Pattern Recognition (2018), pp. 7854–7863.
    [8] He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image
    recognition. In Proceedings of the IEEE conference on computer vision and
    pattern recognition (2016), pp. 770–778.
    [9] Hochreiter, S., and Schmidhuber, J. Long short-term memory. Neural compu-
    tation 9, 8 (1997), 1735–1780.
    [10] Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., and Brox, T.
    Flownet 2.0: Evolution of optical flow estimation with deep networks. In Pro-
    ceedings of the IEEE conference on computer vision and pattern recognition
    (2017), pp. 2462–2470.
    [11] Jang, Y., Kim, G., and Song, Y. Video prediction with appearance and motion
    conditions. arXiv preprint arXiv:1807.02635 (2018).
    [12] Kingma, D. P., and Welling, M. Auto-encoding variational bayes. arXiv
    preprint arXiv:1312.6114 (2013).
    [13] Lee, A. X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., and Levine, S. Stochastic
    adversarial video prediction. arXiv preprint arXiv:1804.01523 (2018).
    [14] Liang, X., Lee, L., Dai, W., and Xing, E. P. Dual motion gan for future-flow
    embedded video prediction. In Proceedings of the IEEE International Confer-
    ence on Computer Vision (2017), pp. 1744–1752.
    40[15] Liu, Z., Yeh, R. A., Tang, X., Liu, Y., and Agarwala, A. Video frame synthesis
    using deep voxel flow. In Proceedings of the IEEE International Conference
    on Computer Vision (2017), pp. 4463–4471.
    [16] Lotter, W., Kreiman, G., and Cox, D. Deep predictive coding networks for
    video prediction and unsupervised learning. arXiv preprint arXiv:1605.08104
    (2016).
    [17] Luc, P., Couprie, C., Lecun, Y., and Verbeek, J. Predicting future instance
    segmentation by forecasting convolutional features. In Proceedings of the Eu-
    ropean Conference on Computer Vision (ECCV) (2018), pp. 584–599.
    [18] Luc, P., Neverova, N., Couprie, C., Verbeek, J., and LeCun, Y. Predicting
    deeper into the future of semantic segmentation. In Proceedings of the IEEE
    International Conference on Computer Vision (2017), pp. 648–657.
    [19] Mirza, M., and Osindero, S. Conditional generative adversarial nets. arXiv
    preprint arXiv:1411.1784 (2014).
    [20] Ronneberger, O., Fischer, P., and Brox, T. U-net: Convolutional networks for
    biomedical image segmentation. In International Conference on Medical im-
    age computing and computer-assisted intervention (2015), Springer, pp. 234–
    241.
    [21] Rosello, P. Predicting future optical flow from static video frames. Retrieved
    on: Jul 18 (2016).
    [22] Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen,
    X. Improved techniques for training gans. In Advances in neural information
    processing systems (2016), pp. 2234–2242.
    [23] Villegas, R., Yang, J., Hong, S., Lin, X., and Lee, H. Decomposing
    motion and content for natural video sequence prediction. arXiv preprint
    arXiv:1706.08033 (2017).
    [24] Walker, J., Gupta, A., and Hebert, M. Dense optical flow prediction from a
    static image. In Proceedings of the IEEE International Conference on Com-
    puter Vision (2015), pp. 2443–2451.
    [25] Xingjian, S., Chen, Z., Wang, H., Yeung, D.-Y., Wong, W.-K., and Woo, W.-c.
    Convolutional lstm network: A machine learning approach for precipitation
    nowcasting. In Advances in neural information processing systems (2015),
    pp. 802–810.
    [26] Zhu, J.-Y., Park, T., Isola, P., and Efros, A. A. Unpaired image-to-image trans-
    lation using cycle-consistent adversarial networks. In Proceedings of the IEEE
    international conference on computer vision (2017), pp. 2223–2232.

    QR CODE