簡易檢索 / 詳目顯示

研究生: 林哲聰
Lin, Che-Tsung
論文名稱: 車輛偵測:從偵測模型設計到跨領域適應
Vehicle Detection: From Detector Design to Cross-Domain Adaptation
指導教授: 賴尚宏
Lai, Shang-Hong
口試委員: 邱瀞德
Chiu, Ching-Te
陳煥宗
Chen, Hwann-Tzong
李哲榮
Lee, Che-Rung
杭學鳴
Hang, Hsueh-Ming
邱維辰
Chiu, Wei-Chen
劉庭祿
Liu, Tyng-Luh
王鈺強
Wang, Yu-Chiang
學位類別: 博士
Doctor
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2020
畢業學年度: 108
語文別: 英文
論文頁數: 84
中文關鍵詞: 深度學習車輛偵測跨領域適應生成式對抗網路
外文關鍵詞: deep learning, vehicle detection, cross-domain adaptation, generative adversarial network
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 車輛偵測一直是先進駕駛輔助系統及自駕車中最為核心的功能之一,過去廣泛的研究已經證明,先進的深度學習模型在各類公開的物體偵測資料集皆可以取得相當優異的表現。然而,這些模型往往都是二階段的,因此需要極高的運算資源,且難以在嵌入式系統上達到即時運算。在本篇論文中,我們提出了一個單階段的模型,其能在NVIDIA DrivePX2此嵌入式平台上達到即時車輛偵測。此外,我們提出了一個多階段、以整張影像為基礎的難例挖掘訓練策略,具體方式是在物體偵測模型的訓練過程,先用全部的資料進行訓練直到辨識率的提升收歛,再進一步的運用難例及IOU略為不足的訓練資料來微調網路,因而可達到更高的辨識率。

    我們期待車輛偵測模型可以在日夜不同情境,皆達到準確的車輛偵測結果,然而車輛在白天與晚上的外觀極為不同。資料擴充在基於深度學習之物體偵測模型訓練中相當的常見,而這樣的技巧可用來提升模型的強健性以及跨領域適應性。以往資料擴充的方法通常是由一般的影像處理演算法所組成,運用這樣的方法所擴充的影像,影像多樣性通常較為局限。近年來,生成式對抗網路已被證明能產生多樣化的影像,然而,過去的模型往往在影像轉換過程難以維持物體結構在轉後前後的一致性。因此,本文提出了AugGAN這樣的生成式對抗網路模型,其可在日轉夜這種影像風格極為不同的轉換中,達到較佳的物體保存性。然而,給定一張日間影像,AugGAN所能轉換的夜間影像風格是恆定的,因此,我們進一步的提出多模態(Multimodal)版的AugGAN,其可將一日間影像轉換成複數張夜間影像,且這些影像具有不同的環境光亮度、影像中車燈的明暗程度不同,但車輛的車型、顏色、以及位置與轉換前是一致的。


    Vehicle detection is a fundamental function required for advanced driver assistance systems and autonomous vehicles. Extensive research has shown that good performance can be obtained on public data sets by various state-of-the-art approaches, especially the deep learning methods. However, those methods are mostly two-stage approaches which unavoidably require extensive computing resources and are hard to be deployed on an embedded computing platform with real-time computing performance. In this thesis, we introduce a single-stage vehicle detector which can function in real-time on NVIDIA DrivePX2 platform and propose a multi-stage image-based online hard example mining framework which performs fine-tuning on hard examples and the ones with slightly-insufficient IOU that are considered true positives.

    One expects that vehicles around host driver could be detected as accurately as possible all day, including day and night. However, vehicle’s appearance at daytime is quite different from its counterpart at nighttime. Data augmentation plays a crucial role in training a CNN-based detector for enhancing its cross-domain robustness. Most previous approaches were based on using a combination of general image-processing operations and could only produce limited plausible image variations. Recently, GAN (Generative Adversarial Network) based methods have shown compelling visual results. However, they are prone to fail at preserving image-objects and maintaining translation consistency when faced with large and complex domain shifts, such as day-to-night. We proposed AugGAN, a GAN-based data augmenter which could transform on-road driving images to a desired domain while image-objects would be well preserved. Although this model could transform on-road images from daytime to nighttime with better object preservation, the appearances of the transformed vehicles are seen all under the same ambient light levels. Therefore, we further propose Multimodal AugGAN, a multimodal structure-consistent GAN which is capable of transforming daytime vehicles to their nighttime counterparts with different ambient light levels and rear lamp conditions (on/off) but with the same vehicle type, color and locations.

    1 Introduction 2 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Related Work 5 2.1 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Domain Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3 Proposed Object Detection Model 10 3.1 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2 Multi-Scale Feature Maps and Bounding Box Priors . . . . . . . . . 13 3.3 Non-Maximal Suppression . . . . . . . . . . . . . . . . . . . . . . 13 3.4 Multi-Stage Image-based Online Hard Example Mining . . . . . . . 14 3.5 Loss Functions of Training and Fine-tuning . . . . . . . . . . . . . 16 4 Proposed GAN Models 20 4.1 AugGAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.1.1 Structure-Aware Encoding and Segmentation Subtask . . . 22 4.1.2 Adversarial Learning . . . . . . . . . . . . . . . . . . . . . 23 4.1.3 Weight-Sharing for Multi-Task Network . . . . . . . . . . . 24 4.1.4 Cycle Consistency . . . . . . . . . . . . . . . . . . . . . . 25 4.1.5 Network Learning . . . . . . . . . . . . . . . . . . . . . . 26 4.2 Multimodal AugGAN . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.2.1 Adversarial Learning . . . . . . . . . . . . . . . . . . . . . 30 4.2.2 Image-Translation-Structure Consistency . . . . . . . . . . 31 4.2.3 Cycle-Structure Consistency . . . . . . . . . . . . . . . . . 32 4.2.4 Network Learning . . . . . . . . . . . . . . . . . . . . . . 33 5 Experimental Results 35 5.1 Single-Stage Vehicle Detector . . . . . . . . . . . . . . . . . . . . 35 5.1.1 PASCAL VOC Dataset . . . . . . . . . . . . . . . . . . . . 35 5.1.2 KITTI Dataset . . . . . . . . . . . . . . . . . . . . . . . . 39 5.1.3 CarSim-Generated Data . . . . . . . . . . . . . . . . . . . 40 5.1.4 iROADS Dataset . . . . . . . . . . . . . . . . . . . . . . . 43 5.1.5 Recall Analysis of the Proposed Data Augmentation Strategy 45 5.1.6 The Benefit of Multi-Scale Feature and Bounding Box Prior 46 5.1.7 IOU Fine-Tuning Analysis . . . . . . . . . . . . . . . . . . 48 5.2 AugGAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.2.1 Synthetic Datasets . . . . . . . . . . . . . . . . . . . . . . 50 5.2.2 KITTI and ITRI-Night Datasets . . . . . . . . . . . . . . . 52 5.2.3 ITRI Daytime and Nighttime Datasets . . . . . . . . . . . . 54 5.2.4 On-Road Nighttime Vehicle Detection Result Analysis . . . 56 5.2.5 Training Detectors with Real Night Images & AugGANGenerated Night Images . . . . . . . . . . . . . . . . . . . 58 5.2.6 Transformations other than Daytime & Nighttime . . . . . . 59 5.2.7 Loss Function Analysis . . . . . . . . . . . . . . . . . . . . 62 5.2.8 Subjective Evaluation of AugGAN . . . . . . . . . . . . . 64 5.3 Multimodal AugGAN . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.3.1 Synthetic Datasets . . . . . . . . . . . . . . . . . . . . . . 68 5.3.2 BDD100k Dataset . . . . . . . . . . . . . . . . . . . . . . 70 5.3.3 Training Detectors with Real Night Images & Multimodal AugGAN-Generated Night Images . . . . . . . . . . . . . . 72 5.3.4 Image Quality and Diversity Evaluation of Multimodal Aug- GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.3.5 Detector Training Strategy Discussion Given Both Unimodal & Multimodal Data . . . . . . . . . . . . . . . . . . . . . . 75 5.3.6 Transformations other than Daytime & Nighttime . . . . . . 76 5.3.7 Semantic Segmentation Across Domains . . . . . . . . . . 77 6 Conclusions and Future Work 79 References 81

    [1] Z. Sun, G. Bebis, and R. Miller, “On-road vehicle detection: A review,” IEEE
    Transactions on Pattern Analysis & Machine Intelligence, no. 5, pp. 694–711,
    2006.
    [2] X. Mao, D. Inoue, S. Kato, and M. Kagami, “Amplitude-modulated laser radar
    for range and speed measurement in car applications,” IEEE Transactions on
    Intelligent Transportation Systems, vol. 13, no. 1, pp. 408–413, 2011.
    [3] J. Levinson, J. Askeland, J. Becker, J. Dolson, D. Held, S. Kammel, J. Z.
    Kolter, D. Langer, O. Pink, V. Pratt, et al., “Towards fully autonomous driving:
    Systems and algorithms,” in IEEE IV symposium, 2011.
    [4] A. Shrivastava, A. Gupta, and R. Girshick, “Training region-based object detectors
    with online hard example mining,” in CVPR, 2016.
    [5] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,
    A. Courville, and Y. Bengio, “Generative adversarial nets,” in NIPS, 2014.
    [6] V. Vapnik and C. Cortes, “Support vector networks,” Machine learning,
    vol. 20, no. 3, pp. 273–297, 1995.
    [7] Y. Freund, R. Schapire, and N. Abe, “A short introduction to boosting,”
    Journal-Japanese Society For Artificial Intelligence, vol. 14, no. 771-780,
    p. 1612, 1999.
    [8] W. Liu, X. Wen, B. Duan, H. Yuan, and N. Wang, “Rear vehicle detection and
    tracking for lane change assist,” in IEEE IV symposium.
    [9] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,”
    in CVPR, 2005.
    [10] T. Machida and T. Naito, “Gpu & cpu cooperative accelerated pedestrian and
    vehicle detection,” in ICCV workshops.
    [11] Q. Yuan, A. Thangali, V. Ablavsky, and S. Sclaroff, “Learning a family of
    detectors via multiplicative kernels,” IEEE transactions on pattern analysis &
    machine intelligence, vol. 33, no. 3, pp. 514–530, 2010.
    [12] N. Blanc, B. Steux, and T. Hinz, “Larasidecam: A fast and robust vision-based
    blindspot detection system,” in IEEE IV symposium.
    [13] Y. Zhang, S. J. Kiselewich, and W. A. Bauson, “Legendre and gabor moments
    for vehicle recognition in forward collision warning,” in IEEE ITSC.
    [14] P. Viola, M. Jones, et al., “Rapid object detection using a boosted cascade of
    simple features,” CVPR, 2001.
    [15] T. Liu, N. Zheng, L. Zhao, and H. Cheng, “Learning based symmetric features
    selection for vehicle detection,” in IEEE IV symposium.
    [16] I. Kallenbach, R. Schweiger, G. Palm, and O. Lohlein, “Multi-class object
    detection in vision systems using a hierarchy of cascaded classifiers,” in IEEE
    IV Symposium.
    [17] D. Acunzo, Y. Zhu, B. Xie, and G. Baratoff, “Context-adaptive approach for
    vehicle detection under varying lighting conditions,” in IEEE ITSC.
    [18] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with
    deep convolutional neural networks,” in NIPS, 2012.
    [19] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
    V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in
    CVPR, 2015.
    [20] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale
    image recognition,” arXiv preprint arXiv:1409.1556, 2014.
    [21] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”
    in CVPR, 2016.
    [22] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,
    A. Karpathy, A. Khosla, M. Bernstein, et al., “Imagenet large scale visual
    recognition challenge,” IJCV, vol. 115, no. 3, pp. 211–252, 2015.
    [23] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for
    accurate object detection and semantic segmentation,” in CVPR, 2014.
    [24] R. Girshick, “Fast r-cnn,” in ICCV, 2015.
    [25] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object
    detection with region proposal networks,” in NIPS, 2015.
    [26] J. Dai, Y. Li, K. He, and J. Sun, “R-fcn: Object detection via region-based fully
    convolutional networks,” in NIPS, 2016.
    [27] Z. Cai, Q. Fan, R. S. Feris, and N. Vasconcelos, “A unified multi-scale deep
    convolutional neural network for fast object detection,” in ECCV, 2016.
    [28] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once:
    Unified, real-time object detection,” in CVPR, 2016.
    [29] J. Redmon and A. Farhadi, “Yolo9000: Better, faster, stronger,” in CVPR,
    2017.
    [30] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg,
    “Ssd: Single shot multibox detector,” in ECCV, 2016.
    [31] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The
    pascal visual object classes (voc) challenge,” IJCV, 2010.
    [32] S. Migacz, “8-bit inference with tensorrt,” in GPU Technology Conference,
    2017.
    [33] N. Bodla, B. Singh, R. Chellappa, and L. S. Davis, “Soft-nms–improving object
    detection with one line of code,” in ICCV, 2017.
    [34] M. Braun, S. Krebs, F. Flohr, and D. M. Gavrila, “The eurocity persons dataset:
    A novel benchmark for object detection,” arXiv preprint arXiv:1805.07193,
    2018.
    [35] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with
    conditional adversarial networks,” in CVPR, 2017.
    [36] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation
    using cycle-consistent adversarial networkss,” in ICCV, 2017.
    [37] T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim, “Learning to discover
    cross-domain relations with generative adversarial networks,” arXiv preprint
    arXiv:1703.05192, 2017.
    [38] Z. Yi, H. Zhang, P. Tan, and M. Gong, “Dualgan: Unsupervised dual learning
    for image-to-image translation,” arXiv preprint, 2017.
    [39] M.-Y. Liu, T. Breuel, and J. Kautz, “Unsupervised image-to-image translation
    networks,” in NIPS, 2017.
    [40] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. A. Efros, and
    T. Darrell, “Cycada: Cycle-consistent adversarial domain adaptation,” ICML,
    2018.
    [41] N. Inoue, R. Furuta, T. Yamasaki, and K. Aizawa, “Cross-domain weaklysupervised
    object detection through progressive domain adaptation,” in CVPR,
    2018.
    [42] Y. Chen, W. Li, C. Sakaridis, D. Dai, and L. Van Gool, “Domain adaptive faster
    r-cnn for object detection in the wild,” in CVPR, 2018.
    [43] J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and
    E. Shechtman, “Toward multimodal image-to-image translation,” in NIPS,
    2017.
    [44] A. Almahairi, S. Rajeswar, A. Sordoni, P. Bachman, and A. Courville, “Augmented
    cyclegan: Learning many-to-many mappings from unpaired data,” in
    ICML, 2018.
    [45] H.-Y. Lee, H.-Y. Tseng, J.-B. Huang, M. Singh, and M.-H. Yang, “Diverse
    image-to-image translation via disentangled representations,” in ECCV, 2018.
    [46] S.-W. Huang*, C.-T. Lin*, S.-P. Chen, Y.-Y. Wu, P.-H. Hsu, and S.-H. Lai,
    “Auggan: Cross domain adaptation with gan-based data augmentation,” in
    ECCV, 2018, *= Equal Contribution.
    [47] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network
    training by reducing internal covariate shift,” in ICML, 2015.
    [48] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing
    human-level performance on imagenet classification,” in ICCV, 2015.
    [49] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning
    with deep convolutional generative adversarial networks,” arXiv preprint
    arXiv:1511.06434, 2015.
    [50] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama,
    and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,”
    in Proceedings of the 22nd ACM international conference on Multimedia.
    [51] Y. Xiang, W. Choi, Y. Lin, and S. Savarese, “Data-driven 3d voxel patterns for
    object category recognition,” in CVPR, 2015.
    [52] Y. Xiang, W. Choi, Y. Lin, and S. Savarese, “Subcategory-aware convolutional
    neural networks for object proposals and detection,” in WACV, 2017.
    [53] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving?
    the kitti vision benchmark suite,” in CVPR, 2012.
    [54] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez, “The synthia
    dataset: A large collection of synthetic images for semantic segmentation of
    urban scenes,” in CVPR, 2016.
    [55] S. R. Richter, V. Vineet, S. Roth, and V. Koltun, “Playing for data: Ground
    truth from computer games,” in ECCV, 2016.
    [56] F. Yu, W. Xian, Y. Chen, F. Liu, M. Liao, V. Madhavan, and T. Darrell,
    “Bdd100k: A diverse driving video database with scalable annotation tooling,”
    arXiv preprint arXiv:1805.04687, 2018.
    [57] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable
    effectiveness of deep features as a perceptual metric,” in CVPR, 2018.
    [58] A. Krizhevsky, “One weird trick for parallelizing convolutional neural networks,”
    arXiv preprint arXiv:1404.5997, 2014.
    [59] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic
    segmentation,” in CVPR, 2015.

    QR CODE