基於時間特徵融合之連續影像物件偵測｜國立清華大學博碩士論文庫

簡易檢索 / 詳目顯示

回結果列表

研究生：	陳書屏 Chen, Shu-Ping
論文名稱：	基於時間特徵融合之連續影像物件偵測 Video Object Detection with Temporal Feature Fusion
指導教授：	賴尚宏 LAI, SHANG-HONG
口試委員:	邱瀞德 CHIU, CHING-TE 許秋婷 HSU, CHIU-TING
學位類別：	碩士 Master
系所名稱：
論文出版年：	2018
畢業學年度：	107
語文別：	英文
論文頁數：	36
中文關鍵詞：	物件偵測、深度學習
外文關鍵詞：	object detection, deep learning
相關次數：	點閱：2 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

物體檢測是計算機視覺中的經典問題，並且由於深度學習技術而取得了顯著成就。然而，將現有技術的靜態圖像物件偵測擴展到連續影像之物件偵測是一項困難的挑戰。正確的靜態圖物件偵測網路並未使用來自連續影像的豐富時間信息，並且可能遭受從靜態圖像中從未見過的困難而導致偵測失敗。

在本文中，我們提出了一種深度學習網路架構以利用時間信息並聯合訓練整個模型來執行連續影像之物件偵測。我們的模型使用光流來對當前幀與前幀之間的特徵融合過程進行引導。我們還利用密集遞歸聚合來整合過去的特徵已妥善利用利用歷史時間信息。我們在ImageNet數據集和ITRI數據集上的實驗表明，所提出的架構可以在沒有大量時間成本的情況下實現有競爭力的檢測結果。

Object detection is a classical problem in computer vision. It has achieved significant improvement in recent years thanks to deep learning techniques. However, it is challenging to extend the state-of-the-art static image object detection techniques to video object detection since traditional object detectors usually work on a single frame and do not utilize rich temporal information from video.

In this thesis, we propose a ConvNet architecture that can utilize temporal information and jointly train the whole model to perform video object detection. Our model uses optical flow to guide the feature fusion process between current frame and the previous frame. We also utilize dense recursive aggregation to integrate features computed from the past frames and make use of temporal information. Our experiments on ImageNet dataset and ITRI dataset show that the proposed architecture can achieve competitive detection result without significant time cost.

ჯ⥱ i
Abstract ii
Introduction 1
1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Proposed method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Related Work 6
1 Object detection in static images . . . . . . . . . . . . . . . . . . . . . 6
2 Object detection in videos . . . . . . . . . . . . . . . . . . . . . . . . 7
Method 10
1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Feature Pyramid Network . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Optical Flow Network . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4 Weight Map from Optical Flow . . . . . . . . . . . . . . . . . . . . . . 14
5 Recursive Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Experiments 19
1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.1 ImageNet Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.2 ITRI Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2 Training and Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1 Feature Pyramid Network . . . . . . . . . . . . . . . . . . . . 21
2.2 FlowNetS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Joint Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1 ImageNet dataset . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 ITRI dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4 Comparison with the state-of-the-art . . . . . . . . . . . . . . . 26
3.5 Demo Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Conclusion 32
References 33
                                

[1] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object
detection with region proposal networks,” in IEEE PAMI, 2016.
[2] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg,
“Ssd: Single shot multibox detector,” in ECCV, 2016.
[3] J. Dai, Y. Li, K. He, and J. Sun, “R-fcn: Object detection via region-based fully
convolutional networks,” in NIPS, 2016.
[4] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and
A. Zisserman, “The pascal visual object classes challenge: A retrospective,” International
Journal of Computer Vision, vol. 111, pp. 98–136, Jan. 2015.
[5] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona,
D. Ramanan, C. L. Zitnick, and P. Dollár, “Microsoft coco: Common objects in
contex,” in ECCV, 2014.
[6] T. Kang, W. Ouyang, H. Li, and X. Wang, “Object detection from video tubelets
with convolutional neural networks,” in CVPR, 2016.
[7] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”
in CVPR, 2016.
33
[8] P. Fischer, A. Dosovitskiy, E. Ilg, P. Häusser, C. Hazırbaş, V. Golkov, P. van der
Smagt, D. Cremers, and T. Brox, “Flownet: Learning optical flow with convolutional
networks,” in ICCV, 2015.
[9] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, “Flownet 2.0:
Evolution of optical flow estimation with deep networks,” in CVPR, 2017.
[10] X. Zhu, Y. Wang, J. Dai, L. Yuan, and Y. Wei, “Flow-guided feature aggregation
for video object detection,” in ICCV, 2017.
[11] X. Zhu, J. Dai, L. Yuan, and Y. Wei, “Towards high performance video object
detection,” in CVPR, 2018.
[12] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature
pyramid networks for object detection,” in CVPR, 2017.
[13] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,
A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet
Large Scale Visual Recognition Challenge,” International Journal of Computer
Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015.
[14] J. Redmon, S. Divvala, and R. Girshick, “You only look once: Unified, real-time
object detection,” in CVPR, 2016.
34
[15] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale
image recognition,” in ICLR, 2015.
[16] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object
detection,” in ICCV, 2017.
[17] R. Girshick, “Fast r-cnn,” in ICCV, 2015.
[18] J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders, “Selective search for
object recoginition,” in IJCV, 2013.
[19] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Detect to track and track to detect,”
in ICCV, 2017.
[20] X. Zhu, Y. Xiong, J. Dai, L. Yuan, and Y. Wei, “Deep feature flow for video recognition,”
in CVPR, 2017.
[21] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in ICCV, 2017.
[22] A. Dosovitskiy, P. Fischer, E. Ilg, P. Häusser, C. Hazırbaş, V. Golkov, P. v.d.
Smagt, D. Cremers, and T. Brox, “Flownet: Learning optical flow with convolutional
networks,” in IEEE International Conference on Computer Vision (ICCV),
2015.
[23] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black, “A naturalistic open source
movie for optical flow evaluation,” in European Conf. on Computer Vision (ECCV)
35
(A. Fitzgibbon et al. (Eds.), ed.), Part IV, LNCS 7577, pp. 611–625, SpringerVerlag,
Oct. 2012.

簡易檢索 / 詳目顯示

相關論文