簡易檢索 / 詳目顯示

研究生: 陳書屏
Chen, Shu-Ping
論文名稱: 基於時間特徵融合之連續影像物件偵測
Video Object Detection with Temporal Feature Fusion
指導教授: 賴尚宏
LAI, SHANG-HONG
口試委員: 邱瀞德
CHIU, CHING-TE
許秋婷
HSU, CHIU-TING
學位類別: 碩士
Master
系所名稱:
論文出版年: 2018
畢業學年度: 107
語文別: 英文
論文頁數: 36
中文關鍵詞: 物件偵測深度學習
外文關鍵詞: object detection, deep learning
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 物體檢測是計算機視覺中的經典問題,並且由於深度學習技術而取得了顯著成就。然而,將現有技術的靜態圖像物件偵測擴展到連續影像之物件偵測是一項困難的挑戰。正確的靜態圖物件偵測網路並未使用來自連續影像的豐富時間信息,並且可能遭受從靜態圖像中從未見過的困難而導致偵測失敗。

    在本文中,我們提出了一種深度學習網路架構以利用時間信息並聯合訓練整個模型來執行連續影像之物件偵測。我們的模型使用光流來對當前幀與前幀之間的特徵融合過程進行引導。我們還利用密集遞歸聚合來整合過去的特徵已妥善利用利用歷史時間信息。我們在ImageNet數據集和ITRI數據集上的實驗表明,所提出的架構可以在沒有大量時間成本的情況下實現有競爭力的檢測結果。


    Object detection is a classical problem in computer vision. It has achieved significant improvement in recent years thanks to deep learning techniques. However, it is challenging to extend the state-of-the-art static image object detection techniques to video object detection since traditional object detectors usually work on a single frame and do not utilize rich temporal information from video.

    In this thesis, we propose a ConvNet architecture that can utilize temporal information and jointly train the whole model to perform video object detection. Our model uses optical flow to guide the feature fusion process between current frame and the previous frame. We also utilize dense recursive aggregation to integrate features computed from the past frames and make use of temporal information. Our experiments on ImageNet dataset and ITRI dataset show that the proposed architecture can achieve competitive detection result without significant time cost.

    ჯ⥱ i Abstract ii 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Proposed method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Related Work 6 2.1 Object detection in static images . . . . . . . . . . . . . . . . . . . . . 6 2.2 Object detection in videos . . . . . . . . . . . . . . . . . . . . . . . . 7 3 Method 10 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2 Feature Pyramid Network . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.3 Optical Flow Network . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.4 Weight Map from Optical Flow . . . . . . . . . . . . . . . . . . . . . . 14 3.5 Recursive Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4 Experiments 19 4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.1.1 ImageNet Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.1.2 ITRI Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.2 Training and Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.2.1 Feature Pyramid Network . . . . . . . . . . . . . . . . . . . . 21 4.2.2 FlowNetS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.2.3 Joint Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.3.1 ImageNet dataset . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.3.2 ITRI dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.3.3 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.3.4 Comparison with the state-of-the-art . . . . . . . . . . . . . . . 26 4.3.5 Demo Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 5 Conclusion 32 References 33

    [1] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object
    detection with region proposal networks,” in IEEE PAMI, 2016.
    [2] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg,
    “Ssd: Single shot multibox detector,” in ECCV, 2016.
    [3] J. Dai, Y. Li, K. He, and J. Sun, “R-fcn: Object detection via region-based fully
    convolutional networks,” in NIPS, 2016.
    [4] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and
    A. Zisserman, “The pascal visual object classes challenge: A retrospective,” International
    Journal of Computer Vision, vol. 111, pp. 98–136, Jan. 2015.
    [5] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona,
    D. Ramanan, C. L. Zitnick, and P. Dollár, “Microsoft coco: Common objects in
    contex,” in ECCV, 2014.
    [6] T. Kang, W. Ouyang, H. Li, and X. Wang, “Object detection from video tubelets
    with convolutional neural networks,” in CVPR, 2016.
    [7] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”
    in CVPR, 2016.
    33
    [8] P. Fischer, A. Dosovitskiy, E. Ilg, P. Häusser, C. Hazırbaş, V. Golkov, P. van der
    Smagt, D. Cremers, and T. Brox, “Flownet: Learning optical flow with convolutional
    networks,” in ICCV, 2015.
    [9] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, “Flownet 2.0:
    Evolution of optical flow estimation with deep networks,” in CVPR, 2017.
    [10] X. Zhu, Y. Wang, J. Dai, L. Yuan, and Y. Wei, “Flow-guided feature aggregation
    for video object detection,” in ICCV, 2017.
    [11] X. Zhu, J. Dai, L. Yuan, and Y. Wei, “Towards high performance video object
    detection,” in CVPR, 2018.
    [12] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature
    pyramid networks for object detection,” in CVPR, 2017.
    [13] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,
    A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet
    Large Scale Visual Recognition Challenge,” International Journal of Computer
    Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015.
    [14] J. Redmon, S. Divvala, and R. Girshick, “You only look once: Unified, real-time
    object detection,” in CVPR, 2016.
    34
    [15] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale
    image recognition,” in ICLR, 2015.
    [16] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object
    detection,” in ICCV, 2017.
    [17] R. Girshick, “Fast r-cnn,” in ICCV, 2015.
    [18] J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders, “Selective search for
    object recoginition,” in IJCV, 2013.
    [19] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Detect to track and track to detect,”
    in ICCV, 2017.
    [20] X. Zhu, Y. Xiong, J. Dai, L. Yuan, and Y. Wei, “Deep feature flow for video recognition,”
    in CVPR, 2017.
    [21] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in ICCV, 2017.
    [22] A. Dosovitskiy, P. Fischer, E. Ilg, P. Häusser, C. Hazırbaş, V. Golkov, P. v.d.
    Smagt, D. Cremers, and T. Brox, “Flownet: Learning optical flow with convolutional
    networks,” in IEEE International Conference on Computer Vision (ICCV),
    2015.
    [23] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black, “A naturalistic open source
    movie for optical flow evaluation,” in European Conf. on Computer Vision (ECCV)
    35
    (A. Fitzgibbon et al. (Eds.), ed.), Part IV, LNCS 7577, pp. 611–625, SpringerVerlag,
    Oct. 2012.

    QR CODE