簡易檢索 / 詳目顯示

研究生: 莊大慶
CHUANG, TA-CHING
論文名稱: 基於類別的物件框定位法及強化小物件偵測與綠通道卷積網路之即時物件偵測系統
Real-time Object Detection via Green Channel Enhanced Convolution and Category-based Bounding Box with Loss Regression.
指導教授: 邱瀞德
Chiu, Ching-Te
口試委員: 蘇豐文
Soo, Von-Wun
張隆紋
Chang, Long-Wen
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2019
畢業學年度: 108
語文別: 英文
論文頁數: 54
中文關鍵詞: 物件偵測卷積神經網路即時強化綠通道的卷積網路基於類別的物件框定位法類別數量不均物件框權重結合
外文關鍵詞: Green Channel Enhanced Convolution, Category-based Bounding Box Localization Strategy, Small Object Detection, Category-balanced Loss Function, Box Weighting
相關次數: 點閱:3下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本篇論文係研究即時的物件偵測系統,由於要即時,其偵測速度至少要維持於30fps以上,因此我們以YOLO v2為基礎,先用TensorFlow的framework將其重製,並提出多項新穎的改進,來提升其準確度與偵測速度。
    對於卷積神經網路(Convolutional Neural Network, CNN)架構的部份,我們提出的方法包含:強化綠通道的卷積網路,以及基於類別的物件框定位法,來提升預測的準確度;另外,為提高系統對於小物件的偵測,在訓練過程中,我們亦增加一道機制來處理小物件沒被計算到Loss的問題;除此之外,我們觀察到被廣泛使用的資料集中,各類別的Ground-truth 物件框數量差異相當大,這會造成嚴重的Class Imbalance問題,因此我們改進了原本YOLO v2的Loss Function,來避免分類一面倒的困境;最後對於後處理的部份,有別於過去常用的非最大值抑制(Non-maximum Suppression, NMS)我們提出,利用權重將物件框結合的方法,來提高預測的物件框與Ground-truth 物件框間的重疊率(Intersection over Union, IoU)。
    最後,我們在幾個主要且常被廣泛使用的物件偵測之資料集上進行實驗,結果顯示,藉由我們所提出的方法,在PASCAL VOC 2007資料集上,偵測速度可維持30 fps即時之標準,準確度可達到77.553% mAP(輸入影像尺寸:416×416),相較於我們所採用的基準模型YOLO v2 (輸入影像尺寸:416×416),增加了0.753% mAP,也比同為即時偵測的SSD 300高出3.253% mAP;另外,在PASCAL VOC 2012資料集上,亦可達到75.7% mAP之準確度。相較於所有我們見過基於卷積神經網路的即時物件偵測系統中,我們是準確度最高的。


    In this paper, we introduce a real-time object detection system. We adopt YOLO v2 as our baseline model. To let the system work on multiple OS environments, we reproduced the model with the framework, TensorFlow, in the begin. Then, we proposed various novel methods to improve the detection accuracy and speed of the system.
    Different from YOLO v2, we propose two improvements for the convolutional neural network(CNN)model to achieve higher accuracy, Green Channel Enhanced Convolution and Category-based Bounding Box Localization Strategy. To get better performance on small object detection, we also provide a strategy that can prevent small objects suffering from being not matched by the predicted box during training. Furthermore, most of the widely used standard detection datasets have serious class imbalance problem. Therefore, we add category weight to the original loss function of YOLO v2. Lastly, distinct from the well-known method, non-maximum suppression(NMS), which is adopted in most of the latest state-of-the-art methods. We propose Box Weighting, which can enlarge the intersection over union(IoU)between the predicted box and the ground-truth box.
    Finally, the experimental results on PASCAL VOC 2007 show that our method (Input image size:416×416)reaches 77.553% mAP with real-time speed, which is better than the baseline model, YOLO v2, by 0.753% mAP. Besides, the mAP of our proposed architecture is 3.253% higher than SSD. For PASCAL VOC 2012, we also suppress most of the object detectors. Among all deep-learning-based real-time object detectors we have ever seen, ours can achieve the highest accuracy.

    Chapter 1. Introduction 1 1.1 Background 1 1.2 Motivation 1 1.3 Contributions 2 1.4 Thesis Organization 4 Chapter 2. Related Works 5 2.1 Two-stage Object Detection 6 2.1.1 R-CNN 6 2.1.2 Fast R-CNN 7 2.1.3 Faster R-CNN 8 2.2 One-stage Object Detection 10 2.2.1 YOLO 10 2.2.2 YOLO v2 12 2.2.3 SSD 13 Chapter 3. Proposed Method 15 3.1 System Architecture 15 3.2 Green Channel Enhanced Convolution 17 3.3 Category-based Bounding Box Localization Strategy 20 3.4 Small Object Detection Training Strategy 24 3.5 Loss Function 26 3.6 Box Weighting 29 Chapter 4. Experimental Results 31 4.1 Datasets 31 4.2 Data Augmentation and Hyperparameter Settings 33 4.3 System Environment 34 4.4 Ablation Study 35 4.4.1 Green Channel Enhanced Convolution 35 4.4.2 Small Object Detection Strategy 41 4.4.3 Category-balanced Loss Function 42 4.4.4 Box Weighting 44 4.4.5 The Path to the Best Solution. 45 4.5 Detection on PASCAL VOC 2007 Dataset 47 4.6 Detection on PASCAL VOC 2012 Dataset 49 Chapter 5. Conclusion 50 Chapter 6. Bibliography 51

    [1]Lowe, David G. “Object recognition from local scale-invariant features.” Proceedings of the International Conference on Computer Vision: 1150–1157. 1999.
    [2]H. Bay, A. Ess, T. Tuytelaars, L. Van Gool “Speeded-Up Robust Features (SURF)” Comput. Vis. Image Underst., 110 (3) (2008), pp. 346-359
    [3]Navneet Dalal, Bill Triggs. “Histograms of Oriented Gradients for Human Detection.” International Conference on Computer Vision & Pattern Recognition (CVPR '05), Jun 2005, San Diego, United States. pp.886—893
    [4]T. Ojala, M. Pietikäinen, T. Mäenpää, "Multiresolution gray-scale and rotation invariant texture classification with local binary patterns", IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 7, pp. 971-987, 2002.
    [5]J. Redmon and A. Farhadi. “Yolo9000: Better, faster, stronger.” In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 6517–6525. IEEE, 2017.
    [6]W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. E. Reed. “SSD: single shot multibox detector.” CoRR, abs/1512.02325, 2015.
    [7]R. Girshick, J. Donahue, T. Darrell, and J. Malik. “Rich feature hierarchies for accurate object detection and semantic segmentation.” In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 580–587. IEEE, 2014.
    [8]R. B. Girshick. “Fast R-CNN.” CoRR, abs/1504.08083, 2015.
    [9]S. Ren, K. He, R. Girshick, and J. Sun. “Faster r-cnn: Towards real-time object detection with region proposal networks.” arXiv preprint arXiv:1506.01497, 2015.
    [10]J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. “You only look once: Unified, real-time object detection.” In CVPR, 2016.
    [11]Uijlings, J.R., van de Sande, K.E., Gevers, T., Smeulders, A.W.: “Selective search for object recognition.” IJCV, 2013
    [12]Li, Y., He, K., Sun, J., et al.: “R-fcn: Object detection via region-based fully convolutional networks.” In: NIPS, 2016
    [13]Y. Zhu, C. Zhao, J. Wang, X. Zhao, Y. Wu, and H. Lu. “Couplenet: Coupling global structure with local parts for object detection.” In ICCV, 2017.
    [14]Shen, Z., Liu, Z., Li, J., Jiang, Y.G., Chen, Y., Xue, X.: “Dsod: Learning deeply supervised object detectors from scratch.” In: ICCV, 2017
    [15]C. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg. “DSSD : Deconvolutional single shot detector.” CoRR, abs/1701.06659, 2017.
    [16]A. Shrivastava, A. Gupta, and R. B. Girshick. “Training region-based object detectors with online hard example mining.” In CVPR, pages 761–769, 2016.
    [17]T. Kong, F. Sun, A. Yao, H. Liu, M. Lu, and Y. Chen. “RON:reverse connection with objectness prior networks for object detection.” In CVPR, 2017.
    [18]S. Bell, C. L. Zitnick, K. Bala, and R. B. Girshick. “Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks.” In CVPR, pages 2874–2883, 2016.
    [19]T. Kong, A. Yao, Y. Chen, and F. Sun. “Hypernet: Towards accurate region proposal generation and joint object detection.” In CVPR, pages 845–853, 2016.
    [20]J. Dai, Y. Li, K. He, and J. Sun. “R-FCN: object detection via region-based fully convolutional networks.” In NIPS, pages 379–387, 2016.
    [21]Zhang, S., Wen, L., Bian, X., Lei, Z., & Li, S. Z. (2018). “Single-shot refinement neural network for object detection.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4203-4212).
    [22]M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html. Online; accessed 1 October 2017.
    [23]M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html. Online; accessed 1 October 2017.
    [24]M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. “The pascal visual object classes (voc) challenge.” International journal of computer vision, 88(2):303–338, 2010.
    [25]J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.
    [26]P. F. Felzenszwalb, R. B. Girshick, and D. McAllester. Discriminatively trained deformable part models, release 4. http://people.cs.uchicago.edu/ pff/latent-release4/.
    [27]K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
    [28]S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
    [29]A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
    [30]M. Lin, Q. Chen, and S. Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013.
    [31]J. Redmon. Darknet: Open source neural networks in c. http://pjreddie.com/darknet/, 2013–2016.
    [32]O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 2015.
    [33]K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
    [34]C. Szegedy, S. Ioffe, and V. Vanhoucke. Inception-v4, inception-resnet and the impact of residual connections on learning. CoRR, abs/1602.07261, 2016.
    [35]C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014.
    [36]B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li. Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016.
    [37]M. B. Blaschko and C. H. Lampert. Learning to localize objects with structured output regression. In Computer Vision–ECCV 2008, pages 2–15. Springer, 2008.
    [38]L. Bourdev and J. Malik. Poselets: Body part detectors trained using 3d human pose annotations. In International Conference on Computer Vision (ICCV), 2009.
    [39]H. Cai, Q. Wu, T. Corradi, and P. Hall. The crossdepiction problem: Computer vision algorithms for recognizing objects in artwork and in photographs. arXiv preprint arXiv:1505.00110, 2015.
    [40]N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 886–893. IEEE, 2005.
    [41]T. Dean, M. Ruzon, M. Segal, J. Shlens, S. Vijayanarasimhan, J. Yagnik, et al. Fast, accurate detection of 100,000 object classes on a single machine. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 1814–1821. IEEE, 2013.
    [42]T. Dean, M. Ruzon, M. Segal, J. Shlens, S. Vijayanarasimhan, J. Yagnik, et al. Fast, accurate detection of 100,000 object classes on a single machine. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 1814–1821. IEEE, 2013.
    [43]D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scalable object detection using deep neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 2155–2162. IEEE, 2014.
    [44]M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1):98–136, Jan. 2015.
    [45]P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627–1645, 2010.
    [46]S. Gidaris and N. Komodakis. Object detection via a multiregion & semantic segmentation-aware CNN model. CoRR, abs/1505.01749, 2015.
    [47]S. Ginosar, D. Haas, T. Brown, and J. Malik. Detecting people in cubist art. In Computer Vision-ECCV 2014Workshops, pages 101–116. Springer, 2014.
    [48]R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 580–587. IEEE, 2014.
    [49]S. Gould, T. Gao, and D. Koller. Region-based segmentation and object detection. In Advances in neural information processing systems, pages 655–663, 2009.
    [50]K. Lenc and A. Vedaldi. R-cnn minus r. arXiv preprint arXiv:1506.06981, 2015.
    [51]M. A. Sadeghi and D. Forsyth. 30hz object detection with dpm v5. In Computer Vision–ECCV 2014, pages 65–79. Springer, 2014.
    [52]P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. CoRR, abs/1312.6229, 2013.
    [53]J. Yan, Z. Lei, L. Wen, and S. Z. Li. The fastest deformable part model for object detection. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 2497–2504. IEEE, 2014.

    QR CODE