研究生: |
盧冠維 Lu, Kuan-Wei |
---|---|
論文名稱: |
實時偵測器之在不同場域的效果差距 UnGap Real-time Detectors by Orchestrating the Self-Training of Teacher and Students |
指導教授: |
孫民
Sun, Min |
口試委員: |
邱維辰
Chiu, Wei-Chen 林彥宇 Lin, Yen-Yu |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
論文出版年: | 2022 |
畢業學年度: | 110 |
語文別: | 英文 |
論文頁數: | 29 |
中文關鍵詞: | 物件偵測 、無監督領域自適應 、半監督學習 |
外文關鍵詞: | Object detection, Unsupervised domain adaptation, Semi-supervised learning |
相關次數: | 點閱:91 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
最先進的物件偵測器 (Object detector) 在現實世界的影片中常遇到各種不同領域之間的差異,導致物件偵測器在監控系統的性能顯著下降。具有更好泛化性的大型模型不適用於大多數實時性能的監控應用,因無法在運算資源受限的嵌入式設備上進行運算,使得性能會退化得更加嚴重。在本文中,我們提出了一種新的訓練框架來縮小一群實時物件偵測器在不同領域的差異 (Domain Gap),且無需額外的人工標註。我們的框架協調了一個大型教師模型的自我訓練 (Self-Training) 並把知識轉移 (Knowledge Transfer) 到多個小型學生模型(每個目標領域一個)。首先,我們通過使用來自所有目標領域 (Target Domain) 且具有高度信心指數的偽標籤 (Pseudo label) 來進行自我訓練,通過加強教師模型來縮小領域之間的差異。接著,給定具有更好泛化性的教師模型,我們使用每個目標領域中被改進且具有高度信心指數的偽標籤將教師的知識傳遞給每個學生。為了評估性能,我們使用從療養院收集了室內人類偵測之數據集,該數據集具有獨特的鳥瞰視角,可以覆蓋更多活動。此外,我們從公開的 CityCam 數據集中取得另一個不同的場景 - 街景車輛偵測。實驗結果表明,我們提出的框架是有效的:在室內人類偵測和街景車輛偵測的 mAP 相對於比較基準分別提高了 120.6% 和 44%。
State-of-the-art object detectors encounter various domain gaps in real-world video streams, resulting in significant performance degradation in surveillance systems. The degradation becomes severer when large models with better generalizability are not applicable for most surveillance applications requiring real-time performance on resource-limited edge devices. In this work, we propose a novel training framework to ungap (i.e., closing the gap) a set of real-time object detectors without additional human supervision. Our framework orchestrates the self-training of one large teacher model and knowledge transfer to a set of small student models (one for each target domain). Firstly, we close the domain gaps with a teacher refinement strategy via self-training with a set of highly confident pseudo labels from all target domains. Next, given the refined teacher with better generalizability, we transfer the knowledge of the teacher to each student using the improved pseudo labels in each target domain with confidence from the teacher model. To evaluate the performance, we use an in-house indoor human detection dataset collected from nursing homes, which has a unique bird-eye viewing angle to cover more activities. We further organize another scenario - street-view vehicle detection - from a public CityCam dataset. Experiment results show that our pro- posed framework is effective: it improves the baseline relatively by 120.6% and 44% in mean average precision (mAP) for indoor human detection and street-view vehicle detection, respectively.
[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convo- lutional neural networks,” Commun. ACM, vol. 60, p. 84–90, may 2017.
[2] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770– 778, 2016.
[3] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “Yolov4: Optimal speed and accuracy of object detection,” 2020.
[4] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2980–2988, 2017.
[5] A.-J. Gallego, J. Calvo-Zaragoza, and R. B. Fisher, “Incremental unsupervised domain- adversarial training of neural networks,” IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 11, pp. 4864–4878, 2021.
[6] B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars, “Unsupervised visual domain adaptation using subspace alignment,” in 2013 IEEE International Conference on Com- puter Vision, pp. 2960–2967, 2013.
[7] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial discriminative domain adaptation,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2962–2971, 2017.
[8] Y. Chen, W. Li, C. Sakaridis, D. Dai, and L. Van Gool, “Domain adaptive faster r-cnn for object detection in the wild,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3339–3348, 2018.
[9] K. Saito, Y. Ushiku, T. Harada, and K. Saenko, “Strong-weak distribution alignment for adaptive object detection,” in 2019 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pp. 6949–6958, 2019.
[10] X.Zhu,J.Pang,C.Yang,J.Shi,andD.Lin,“Adaptingobjectdetectorsviaselectivecross- domain alignment,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 687–696, 2019.
[11] F.Yu,D.Wang,Y.Chen,N.Karianakis,T.Shen,P.Yu,D.Lymberopoulos,S.Lu,W.Shi, and X. Chen, “Unsupervised domain adaptation for object detection via cross-domain semi-supervised learning,” 2021.
[12] D.-H. Lee, “Pseudo-label : The simple and efficient semi-supervised learning method for deep neural networks,” 2013.
[13] A. Oliver, A. Odena, C. Raffel, E. D. Cubuk, and I. J. Goodfellow, “Realistic evaluation of deep semi-supervised learning algorithms,” in Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, (Red Hook, NY, USA), p. 3239–3250, Curran Associates Inc., 2018.
[14] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in
2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 1, pp. 886–893 vol. 1, 2005.
[15] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained part-based models,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 9, pp. 1627–1645, 2010.
[16] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587, 2014.
[17] R. Girshick, “Fast r-cnn,” in 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1440–1448, 2015.
[18] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137–1149, 2017.
[19] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real- time object detection,” in 2016 IEEE Conference on Computer Vision and Pattern Recog- nition (CVPR), pp. 779–788, 2016.
[20] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in European conference on computer vision, pp. 21–37, Springer, 2016.
[21] Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by backpropagation,” in
Proceedings of the 32nd International Conference on International Conference on Ma- chine Learning - Volume 37, ICML’15, p. 1180–1189, JMLR.org, 2015.
[22] J. Hoffman, D. Wang, F. Yu, and T. Darrell, “Fcns in the wild: Pixel-level adversarial and constraint-based adaptation,” 2016.
[23] Y. Zhang, P. David, and B. Gong, “Curriculum domain adaptation for semantic segmen- tation of urban scenes,” in 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2039–2049, 2017.
[24] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in 2017 IEEE International Conference on Com- puter Vision (ICCV), pp. 2242–2251, 2017.
[25] P.Li,Y.Xu,Y.Wei,andY.Yang,“Self-correctionforhumanparsing,”IEEETransactions on Pattern Analysis and Machine Intelligence, pp. 1–1, 2020.
[26] G. McLachlan and T. Krishnan, The EM algorithm and extensions. Wiley series in proba- bility and statistics, Wiley, 2008.
[27] M.Chen,K.Q.Weinberger,andJ.C.Blitzer,“Co-trainingfordomainadaptation,”inPro- ceedings of the 24th International Conference on Neural Information Processing Systems, NIPS’11, (Red Hook, NY, USA), p. 2456–2464, Curran Associates Inc., 2011.
[28] Q. Xie, P. Zhang, B. Yu, and J. Choi, “Semisupervised training of deep generative models for high-dimensional anomaly detection,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–10, 2021.
[29] Q. Xie, M.-T. Luong, E. Hovy, and Q. V. Le, “Self-training with noisy student improves imagenet classification,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10684–10695, 2020.
[30] K. Sohn, Z. Zhang, C.-L. Li, H. Zhang, C.-Y. Lee, and T. Pfister, “A simple semi- supervised learning framework for object detection,” in arXiv:2005.04757, 2020.
[31] S. Zhang, G. Wu, J. P. Costeira, and J. M. F. Moura, “Understanding traffic density from large-scale web camera data,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4264–4273, 2017.
[32] S. Zhang, G. Wu, J. P. Costeira, and J. M. F. Moura, “Fcn-rlstm: Deep spatio-temporal neural networks for vehicle counting in city cameras,” in 2017 IEEE International Con- ference on Computer Vision (ICCV), pp. 3687–3696, 2017.
[33] G. Jocher, A. Stoken, A. Chaurasia, J. Borovec, NanoCode012, TaoXie, Y. Kwon, K. Michael, L. Changyu, J. Fang, A. V, Laughing, tkianai, yxNONG, P. Skalski, A. Hogan, J. Nadar, imyhxy, L. Mammana, AlexWang1900, C. Fati, D. Montes, J. Hajek, L. Diaconu, M. T. Minh, Marc, albinxavi, fatih, oleg, and wanghaoyang0106, “ultralyt- ics/yolov5: v6.0 - YOLOv5n ’Nano’ models, Roboflow integration, TensorFlow export, OpenCV DNN support,” Oct. 2021.
[34] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ra- manan, C. L. Zitnick, and P. Dollár, “Microsoft coco: Common objects in context,” 2015.
[35] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” in International Conference on Learning Representations, 2018.