簡易檢索 / 詳目顯示

研究生: 許煜正
Xu, Yu-zheng
論文名稱: 改進從單張影像進行人群計數之深度學習方法
An Improved Deep Learning Approach to Crowd Counting from an Image
指導教授: 賴尚宏
Lai, Shang-Hong
口試委員: 李哲榮
Lee, Che-Rung
黃思皓
Huang, Szu-Hao
劉庭祿
Liu, Tyng-Luh
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2019
畢業學年度: 108
語文別: 英文
論文頁數: 32
中文關鍵詞: 人群計數
外文關鍵詞: Crowd Counting
相關次數: 點閱:1下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在如今社會,公共場景的安全防範任務得到了極大重視。眾多技術的出現,也使得我們的管控更為有效。人群計數是其中一項重要的技術,其可以幫助我們估算當前場景中的人群密集程度從而防範危險事故的發生。
    本文使用了基於全卷積神經網絡的模型來進行人群計數,並且探索了結果圖片大小對模型性能的影響。我們嘗試了切片的方法來增加訓練時每批次的圖片數量,並且改進了基於標註的熱度圖生成方式,最後在ShanghaiTech part A和UCF-QNRF等資料集上達到了領先的效果。


    In modern society, the security problem in public scenes has gained great interests. With the rapid advances in computer vision and deep learning, we have achieved significant progress in video surveillance technology. Crowd counting is one of the important problems, which is to estimate the total number of people in a specific scene from an image for better security control.
    This thesis uses a simple and efficient network based on fully convolutional network and exploits the effect of the size of the output. We enlarge the batch size by cropping the image and improve the heatmap generation method. Finally, we archive state-of-the-art result in Shanghaitech part A and UCF-QNRF.

    1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Related Work 5 2.1 Detection based approach . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Density estimation based approach . . . . . . . . . . . . . . . . . . 6 2.2.1 Heatmap Generation . . . . . . . . . . . . . . . . . . . . . 6 2.2.2 Regression Approach . . . . . . . . . . . . . . . . . . . . . 7 2.3 Datasets for Crowd Counting . . . . . . . . . . . . . . . . . . . . . 9 2.3.1 Shanghaitech Dataset . . . . . . . . . . . . . . . . . . . . . 9 2.3.2 UCF-QNRF Dataset . . . . . . . . . . . . . . . . . . . . . 10 3 Proposed Method 12 3.1 Hierarchical geometry-adaptive kernel . . . . . . . . . . . . . . . . 12 3.2 Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.3 Scale Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.4 Large Batch Size . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4 Experimental Results 18 4.1 Training Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.2 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.3 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.4 Evaluation Method in Crowd Counting Task . . . . . . . . . . . . . 19 4.4.1 Dataset Splitting . . . . . . . . . . . . . . . . . . . . . . . 19 4.4.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . 19 4.5 Experiment on Shanghaitech A . . . . . . . . . . . . . . . . . . . . 20 4.6 Ablation Study on Shanghaitech A . . . . . . . . . . . . . . . . . . 21 4.7 Experiment on UCF-QNRF . . . . . . . . . . . . . . . . . . . . . . 23 4.8 Case Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.8.1 Cases for Relative Error . . . . . . . . . . . . . . . . . . . 25 4.8.2 Cases for Absolute Error . . . . . . . . . . . . . . . . . . . 27 5 Conclusions 29 References 30

    [1] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer
    vision and pattern recognition, pp. 3431–3440, 2015.
    [2] Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma, “Single-image crowd counting
    via multi-column convolutional neural network,” in Proceedings of the IEEE
    conference on computer vision and pattern recognition, pp. 589–597, 2016.
    [3] H. Idrees, M. Tayyab, K. Athrey, D. Zhang, S. Al-Maadeed, N. Rajpoot, and
    M. Shah, “Composition loss for counting, density map estimation and localization in dense crowds,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 532–546, 2018.
    [4] N. Liu, Y. Long, C. Zou, Q. Niu, L. Pan, and H. Wu, “Adcrowdnet: An
    attention-injective deformable convolutional network for crowd understanding,” arXiv preprint arXiv:1811.11968, 2018.
    [5] C. Zhang, H. Li, X. Wang, and X. Yang, “Cross-scene crowd counting via
    deep convolutional neural networks,” in Proceedings of the IEEE conference
    on computer vision and pattern recognition, pp. 833–841, 2015.
    [6] X. Cao, Z. Wang, Y. Zhao, and F. Su, “Scale aggregation network for accurate
    and efficient crowd counting,” in Proceedings of the European Conference on
    Computer Vision (ECCV), pp. 734–750, 2018.
    [7] H. Idrees, I. Saleemi, C. Seibert, and M. Shah, “Multi-source multi-scale
    counting in extremely dense crowd images,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2547–2554, 2013.
    [8] Y. Li, X. Zhang, and D. Chen, “Csrnet: Dilated convolutional neural networks
    for understanding the highly congested scenes,” in Proceedings of the IEEE
    conference on computer vision and pattern recognition, pp. 1091–1100, 2018.
    [9] P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection: An evaluation of the state of the art,” IEEE transactions on pattern analysis and machine intelligence, vol. 34, no. 4, pp. 743–761, 2011.
    [10] S. Yang, P. Luo, C. C. Loy, and X. Tang, “Wider face: A face detection
    benchmark,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
    [11] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object
    detection with region proposal networks,” in Advances in neural information
    processing systems, pp. 91–99, 2015.
    [12] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg,
    “Ssd: Single shot multibox detector,” in European conference on computer
    vision, pp. 21–37, Springer, 2016.
    [13] P. Hu and D. Ramanan, “Finding tiny faces,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 951–959, 2017.
    30
    [14] T. Song, L. Sun, D. Xie, H. Sun, and S. Pu, “Small-scale pedestrian detection based on somatic topology localization and temporal feature aggregation,”
    arXiv preprint arXiv:1807.01438, 2018.
    [15] V. Lempitsky and A. Zisserman, “Learning to count objects in images,” in
    Advances in neural information processing systems, pp. 1324–1332, 2010.
    [16] L. Boominathan, S. S. Kruthiventi, and R. V. Babu, “Crowdnet: A deep convolutional network for dense crowd counting,” in Proceedings of the 24th ACM
    international conference on Multimedia, pp. 640–644, ACM, 2016.
    [17] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,
    A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in
    neural information processing systems, pp. 2672–2680, 2014.
    [18] S. Jiang, X. Lu, Y. Lei, and L. Liu, “Mask-aware networks for crowd counting,”
    arXiv preprint arXiv:1901.00039, 2018.
    [19] X. Wu, Y. Zheng, H. Ye, W. Hu, J. Yang, and L. He, “Adaptive scenario discovery for crowd counting,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2382–2386,
    IEEE, 2019.
    [20] W. Liu, M. Salzmann, and P. Fua, “Context-aware crowd counting,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
    pp. 5099–5108, 2019.
    [21] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
    [22] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
    network training by reducing internal covariate shift,” arXiv preprint
    arXiv:1502.03167, 2015.
    [23] H. Noh, S. Hong, and B. Han, “Learning deconvolution network for semantic
    segmentation,” in Proceedings of the IEEE international conference on computer vision, pp. 1520–1528, 2015.
    [24] S. Ruder, “An overview of gradient descent optimization algorithms,” arXiv
    preprint arXiv:1609.04747, 2016.
    [25] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in
    3rd International Conference on Learning Representations, ICLR 2015, San
    Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
    [26] C. Wang, Y. Wang, Z. Lin, A. L. Yuille, and W. Gao, “Robust estimation of 3d
    human poses from a single image,” in Proceedings of the IEEE Conference on
    Computer Vision and Pattern Recognition, pp. 2361–2368, 2014.
    [27] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman,
    “The pascal visual object classes (voc) challenge,” International Journal of
    Computer Vision, vol. 88, pp. 303–338, June 2010.
    31
    [28] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár,
    and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European
    conference on computer vision, pp. 740–755, Springer, 2014.
    [29] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A
    Large-Scale Hierarchical Image Database,” in CVPR09, 2009.
    [30] Q. Wang, J. Gao, W. Lin, and Y. Yuan, “Learning from synthetic data for crowd
    counting in the wild,” in Proceedings of the IEEE Conference on Computer
    Vision and Pattern Recognition, pp. 8198–8207, 2019.
    [31] X. Jiang, Z. Xiao, B. Zhang, X. Zhen, X. Cao, D. Doermann, and L. Shao,
    “Crowd counting and density estimation by trellis encoder-decoder networks,”
    in Proceedings of the IEEE Conference on Computer Vision and Pattern
    Recognition, pp. 6133–6142, 2019.

    QR CODE