簡易檢索 / 詳目顯示

研究生: 褚世杰
Chu, Shi-Jie
論文名稱: 基於 RT-DETR 法則的密集行人多目標檢測與跟蹤輕量分析
Study of Light-weight Multi-object Detection and Tracking on Dense Pedestrians Based on the RT-DETR Algorithm
指導教授: 王培仁
Wang, Pei-Jen
口試委員: 劉晉良
Liu, Jinn-Liang
王俊傑
Wang, Chun-Chieh
學位類別: 碩士
Master
系所名稱: 工學院 - 動力機械工程學系
Department of Power Mechanical Engineering
論文出版年: 2024
畢業學年度: 113
語文別: 中文
論文頁數: 129
中文關鍵詞: 目標檢測目標跟蹤輕量化
外文關鍵詞: Object Detection, Object Tracking, Lightweight
相關次數: 點閱:72下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 密集行人檢測及跟蹤是實現自動駕駛、城市智能交通、犯罪偵破、智能安防監控、無人機跟拍及視頻錄製聲音跟蹤應用關鍵技術,故而系統輕量化及實時性更為重要。本論文提出一種基於生成式(Transformer)架構的RT-DETR輕量化計算模型,相較於YOLO®系統,RT-DETR具備全局視野,且在視野受限、遮擋及背景雜亂環境下具有優勢;缺點是參數量大、計算量高及即時性不足。本論文針對輕量化和速度優化進行研究,期以降低計算資源及部署成本,並能提升運算效率。
    首先在RT-DETR模型輕量化及推理速度優化上,本文基於CrowdHuman數據集,將原ResNet18主幹更換為YOLOv9®的GELAN主幹,並用其RepNCSPELAN4算子優化CCFF模組。更於提升性能上,採用FasterNet的PConv、Faster Block和TransNeXt的CGLU,重新設計RepNCSPELAN4結構。進而借助WaveletPool之改進CCFF之上採樣及下採樣,並用可學習的位置編碼(LPE)優化AIFI的編碼生成。成果是參數量減少63.3%,GFLOPs降低64.2%,推理時間縮短22.1%,和僅分別下降1.8和1.4,成功平衡精度、效率和速度。
    為驗證RT-DETR(Our)的提升效果,在CrowdHuman數據集上進行四組性能對比實驗,結論如下:1)RT-DETR(ResNet18)優於YOLOv8m和YOLOv9m,與YOLOv10m相當;2)RT-DETR(Our)在精度、輕量化和即時性之間平衡良好,具備高實用性;3)RT-DETR(Our)以低參數量和計算成本達到與中等規模模型相近的精度及高效幀率,適合高精度和即時性應用;4)相比小規模YOLO模型,RT-DETR(Our)在參數量、計算量、精度和後處理時間上更優,僅實時性稍低。綜上所述,RT-DETR(Our)能成功兼顧精度、輕量化及即時性,且性能優勢顯著。
    基於CrowdHuman數據集實驗結果,為驗證RT-DETR(Our)在其他數據集上的一致性能,而不是在特定場景或數據分佈下表現出色,因此在WiderPerson數據集上重複實驗。結果顯示RT-DETR(Our)在不同數據集上具備高一致性和良好泛化能力。
    進一步為評估RT-DETR(Our)在目標跟蹤中的實際應用效果,在MOT16和MOT17數據集上進行實驗驗證,並用MOT指標評估。結果顯示RT-DETR(Our)具備良好泛化能力,並在BoTSORT上的總體性能超越YOLOv8s,僅在即時性上略遜,與YOLOv8m和RT-DETR(ResNet18)表現相近。此外,實驗結果顯示在BoTSORT中加入行人重識別(ReID)成本過高,收益有限。


    Dense pedestrian detection and tracking are key technology in autonomous driving, smart transportation, crime detection, security monitoring and drone filming applications while light weight and real-time performance are important in deployment. The thesis studies a light weight RT-DETR model based on Transformer architecture with comparisons in YOLO. The objective is to verify RT-DETR advantages in a global perspective and handling occlusion, cluttered backgrounds, and limited field of view as well as the problems in large volume of parameters, high computation, and slower execution speed. The main objective is to optimize RT-DETR to reduce computational resources and deployment costs while improve efficiency.
    To achieve light weight and inference speed based on CrowdHuman dataset, the original ResNet18 backbone was replaced with YOLOv9’s GELAN backbone with the CCFF optimization using RepNCSPELAN4 from GELAN. Improved performance is done with redesign of the RepNCSPELAN4 structure with PConv and Faster Block from FasterNet together with CGLU from TransNeXt, WaveletPool enhanced upsampling and downsampling in the CCFF. Learnable position encoding (LPE) also improves position encoding in AIFI. There are 63.3% reduction in parameters, 64.2% in GFLOPs, and 22.1% in inference time, with and decreasing by only 1.8 and 1.4with a good balance between accuracy, efficiency and speed.
    To verify the improvements of RT-DETR (Our), we conducted four performance comparison experiments on the CrowdHuman dataset, reaching the following conclusions: 1) RT-DETR (ResNet18) outperformed YOLOv8m and YOLOv9m and was comparable to YOLOv10m; 2) RT-DETR (Our) achieved an optimal balance between accuracy, lightweight design, and real-time performance, making it highly practical; 3) RT-DETR (Our) achieved accuracy close to medium-sized models with low parameters and computational costs while maintaining a high frame rate, making it suitable for applications requiring high accuracy and real-time processing; 4) Compared to small YOLO models, RT-DETR (Our) demonstrated advantages in parameter count, computation, accuracy, and post-processing time, with only slightly lower real-time performance. Overall, RT-DETR (Our) successfully balances accuracy, lightweight design, and real-time processing, showing significant performance advantages.
    The previous experimental results are based on the CrowdHuman dataset. To verify that RT-DETR (Our) maintains consistent performance across datasets rather than excelling only in specific scenarios or data distributions, we repeated the experiments on the WiderPerson dataset. Results indicate that RT-DETR (Our) demonstrates high consistency and good generalization across different datasets.
    To assess the practical application performance of RT-DETR (Our) in object tracking, we conducted experiments on the MOT16 and MOT17 datasets, evaluated using MOT metrics. Results show that RT-DETR (Our) maintains good generalization, with overall performance in BoTSORT surpassing YOLOv8s, though slightly lower in real-time performance, and is comparable to YOLOv8m and RT-DETR (ResNet18). Additionally, the experiments indicate that incorporating pedestrian re-identification (ReID) into BoTSORT incurs high computational costs with limited benefits.

    摘 要 I Abstract III 致謝 V 專業名詞文字對照表 VI 目錄 X 圖目錄 XIV 表目錄 XVII 第一章 緒論 1 1-1 研究背景 1 1-2 研究目的 2 1-3 文獻回顧 3 1-3-1目標檢測領域 3 1-3-2多目標跟蹤領域 6 第二章 基礎理論介紹 9 2-1檢測器的組成以及模型的分類 9 2-1-1 檢測模型檢測器的組成 9 2-1-2主幹網絡(Backbone) 9 2-1-3 頸部結構(Neck) 9 2-1-4 頭部網絡(Head) 9 2-2 Transformer理論 10 2-2-1 Transformer簡介 10 2-2-2 注意力機制 10 2-2-3 多頭注意力機制 12 2-2-4 位置前回饋網絡 12 2-2-5 殘差連接與正則化 13 2-3 Transformer在計算機視覺上的應用 13 2-3-1 Vision Transformer(ViT) 14 2-3-2 Detection Transformer(DETR) 14 2-3-3 Real-time Detection Transformer(RT-DETR) 17 2-4 YOLOv9之GELAN與ADown 21 2-4-1 GELAN主幹網絡 21 2-4-2 ADown 22 2-5 輕量化卷積模組 23 2-5-1 DWconv模組 24 2-5-2 Ghost-Conv模組 24 2-5-3 PConv模組 25 2-6 門控機制 26 2-6-1門控線性單元 26 2-6-2卷積門控線性單元 29 2-7 可學習位置編碼 30 2-8 小波池化(Waveletpool) 31 2-8-1 前向傳播 32 2-8-2 反向傳播 33 2-9 多目標跟蹤算法 33 2-9-1 SORT 35 2-9-2 DeepSORT 36 2-9-3 BoT-SORT 37 第三章 RT-DETR輕量化設計與目標跟蹤 51 3-1 RT-DETR的輕量化主幹網絡設計 51 3-2 CCFF的架構優化設計 53 3-2-1 基於RepNCSPELAN4的CCFF優化設計 53 3-2-2 基於小波池化Waveletpool的CCFF優化設計 54 3-3 RepNCSPELAN的優化架設計 55 3-3-1 輕量化卷積塊的選擇 55 3-3-2 PConv-RepNCSPELAN4 57 3-3-3 Faster-RepNCSPELAN4 57 3-3-4 CGLU對Faster-RepNCSPELAN4的優化 58 3-4 基於可學習位置編碼的AIFI 59 3-5 基於輕量化RT-DETR的BoT-SORT多目標跟蹤算法 60 3-6小結 62 第四章 實驗結果 76 4-1 實驗環境與數據集 76 4-1-1 實驗環境 76 4-1-2 數據集 77 4-2 評價指標 79 4-2-1 檢測器評價指標 79 4-2-2 追蹤器評價指標 81 4-3 CrowdHuman數據集上的RT-DETR漸進優化實驗 85 4-4 CrowdHuman數據集目標檢測算法實驗結果 86 4-4-1 RT-DETR(ResNet18)與YOLO中等規模模型對比 87 4-4-2 RT-DETR系列模型不同輕量化主幹網絡的對比 89 4-4-3 RT-DETR(Our)與中等規模模型對比 92 4-4-4 RT-DETR(Our)與YOLO小規模模型對比 94 4-4-5 小結 95 4-5 WiderPerson數據集目標檢測算法實驗結果 97 4-5-1 RT-DETR (ResNet18)與YOLO中等規模模型對比 97 4-5-2 RT-DETR(Our)與中等規模模型對比 98 4-5-3 RT-DETR(Our)與YOLO小規模模型對比 99 4-5-4 小結 101 4-6 基於MOT16數據集的不同檢測器的跟蹤實驗對比 101 4-6-1基於MOT16數據集的檢測器對比結果 101 4-6-2基於MOT16數據集的跟蹤器加入ReID對比結果 103 4-6-3基於MOT16數據集的跟蹤器對比結果 105 4-6-4小結 107 4-7 基於MOT17數據集的不同檢測器的跟蹤實驗對比 108 4-7-1基於MOT17數據集的檢測器對比結果 108 4-7-2基於MOT17數據集的跟蹤器加入ReID對比結果 109 4-7-3基於MOT17數據集的跟蹤器對比結果 112 4-7-4小結 113 第五章 結論與討論 121 5-1 結論 121 5-2 未來展望 124 參考文獻 126

    【1】Zou, Z., Chen, K., Shi, Z., Guo, Y., & Ye, J. (2023), “Object detection in 20 years: A survey”, arXiv preprint arXiv:1905.05055.
    【2】Viola, P., & Jones, M. (2001), “Robust real-time face detection”, In Proceedings of the Eighth IEEE International Conference on Computer Vision (ICCV)(Vol. 2, p. 747).
    【3】Dalal, N., & Triggs, B. (2005), “Histograms of oriented gradients for human detection”, In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) (Vol. 1, pp. 886-893).
    【4】Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010), “Object detection with discriminatively trained part-based models”, IEEE Transactions on Pattern Analysis and Machine Intelligence (vol.32(9), pp.1627-1645).
    【5】Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012), “ImageNet classification with deep convolutional neural networks”, Advances in neural information processing systems(NIPS) (Vol. 25).
    【6】Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014), “Rich feature hierarchies for accurate object detection and semantic segmentation”, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)(pp. 580-587).
    【7】He, K., Zhang, X., Ren, S., & Sun, J. (2015), “Spatial pyramid pooling in deep convolutional networks for visual recognition”, IEEE Transactions on Pattern Analysis and Machine Intelligence (vol.37(9), pp.1904-1916).
    【8】Ren, S., He, K., Girshick, R., & Sun, J. (2015), “Faster R-CNN: Towards real-time object detection with region proposal networks”, In Advances in Neural Information Processing Systems (Vol. 28).
    【9】He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017), “Mask R-CNN”, In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV) (pp. 2980-2988).
    【10】Lin, T., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017), “Feature Pyramid Networks for Object Detection”, In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 936-944).
    【11】Bochkovskiy, A., Wang, C.-Y., & Liao, H.-Y. M. (2020), “YOLOv4: Optimal speed and accuracy of object detection”, arXiv preprint arXiv:2004.10934.
    【12】Li, C., Li, L., Jiang, H., Weng, K., Geng, Y., Li, L., Ke, Z., Li, Q., Cheng, M., Nie, W., Li, Y., Zhang, B., Liang, Y., Zhou, L., Xu, X., Chu, X., Wei, X., & Wei, X. (2022), “YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications”, arXiv preprint arXiv:2209.02976
    【13】Wang, C.-Y., Bochkovskiy, A., & Liao, H.-Y. M. (2023), “YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors”, In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 7464-7475).
    【14】Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017), “Attention is all you need”, Advances in neural information processing systems (NIPS)(Vol. 30).
    【15】Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020), “End-to-End Object Detection with Transformers”, arXiv preprint arXiv:2005.12872.
    【16】Zhu, X., Su, W., Lu, L., Li, B., Wang, X., & Dai, J. (2021), “Deformable DETR: Deformable Transformers for End-to-End Object Detection”, arXiv preprint arXiv:2010.04159.
    【17】Zhao, Y., Lv, W., Xu, S., Wei, J., Wang, G., Dang, Q., Liu, Y., & Chen, J. (2024), “DETRs beat YOLOs on real-time object detection”, In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 16965-16974).
    【18】Bolme, D. S., Beveridge, J. R., Draper, B. A., & Lui, Y. M. (2010), “Visual object tracking using adaptive correlation filters”, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition(CVPR) (pp. 2544-2550).
    【19】Henriques, J. F., Caseiro, R., Martins, P., & Batista, J. (2015), “High-Speed Tracking with Kernelized Correlation Filters”, IEEE Transactions on Pattern Analysis & Machine Intelligence (vol.37(03), pp.583-596).
    【20】Wang, N., & Yeung, D.-Y. (2013), “Learning a Deep Compact Image Representation for Visual Tracking”, Advances in Neural Information Processing Systems(NIPS) (Vol. 26).
    【21】Bewley, A., Ge, Z., Ott, L., Ramos, F., & Upcroft, B. (2016), “Simple online and realtime tracking”, In 2016 IEEE International Conference on Image Processing (ICIP) (pp. 3464-3468).
    【22】Wojke, N., Bewley, A., & Paulus, D. (2017), “Simple online and realtime tracking with a deep association metric”, In 2017 IEEE International Conference on Image Processing (ICIP) (pp. 3645–3649).
    【23】Zhang, Y., Sun, P., Jiang, Y., Yu, D., Weng, F., Yuan, Z., Luo, P., Liu, W., & Wang, X. (2022), “ByteTrack: Multi-object tracking by associating every detection box”, In Proceedings of the European Conference on Computer Vision (ECCV).
    【24】Du, Y., Zhao, Z., Song, Y., Zhao, Y., Su, F., Gong, T., & Meng, H. (2023), “StrongSort: Make DeepSORT great again”, IEEE Transactions on Multimedia (Vol. 25).
    【25】Aharon, N., Orfaig, R., & Bobrovsky, B.-Z. (2022), “BoT-SORT: Robust Associations Multi-Pedestrian Tracking”, arXiv preprint arXiv:2206.14651.
    【26】Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021), “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”, arXiv preprint arXiv:2010.11929.
    【27】Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021), “Emerging Properties in Self-Supervised Vision Transformers”, In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (pp. 9650-9660).
    【28】Zhang, H., Wang, Y., Dayoub, F., & Sunderhauf, N. (2021), “VarifocalNet: An IoU-aware Dense Object Detector”, In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 8510-8519).
    【29】Wang, C.Y., Yeh, I.H., & Liao, H.Y. M. (2024), “YOLOv9: Learning what you want to learn using programmable gradient information”, arXiv preprint arXiv:2402.13616
    【30】Chollet, F. (2017), “Xception: Deep Learning With Depthwise Separable Convolutions”, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
    【31】Han, K., Wang, Y., Tian, Q., Guo, J., Xu, C., & Xu, C. (2020), “GhostNet: More Features from Cheap Operations”, arXiv preprint arXiv:1911.11907.
    【32】Chen, J., Kao, S.H., He, H., Zhuo, W., Wen, S., Lee, C.H., & Chan, S.H.G. (2023), “Run, don't walk: Chasing higher FLOPS for faster neural networks”, arXiv preprint arXiv:2303.03667
    【33】Dauphin, Y. N., Fan, A., Auli, M., & Grangier, D. (2017), “Language Modeling with Gated Convolutional Networks”, arXiv preprint arXiv:1612.08083.
    【34】Shi, D. (2024), “TransNeXt: Robust foveal visual perception for vision transformers”, In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 17773-17783).
    【35】Williams, T., & Li, R. (2018), “Wavelet pooling for convolutional neural networks”, In Proceedings of the International Conference on Learning Representations.
    【36】Liu, S., Qi, L., Qin, H., Shi, J., & Jia, J. (2018), “Path Aggregation Network for Instance Segmentation”, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)(pp. 8759-8768).
    【37】Shao, S., Zhao, Z., Li, B., Xiao, T., Yu, G., Zhang, X., & Sun, J. (2018), “CrowdHuman: A benchmark for detecting human in a crowd”, arXiv preprint arXiv: 1805.00123
    【38】Zhang, S., Xie, Y., Wan, J., Xia, H., Li, S.Z., & Guo, G. (2019), “ WiderPerson: A diverse dataset for dense pedestrian detection in the wild”, arXiv preprint arXiv:1909.12118
    【39】Milan, A., Leal-Taixé, L., Reid, I., Roth, S. & Schindler, K. (2016), “MOT16: A Benchmark for Multi-Object Tracking”, arXiv preprint arXiv:1603.00831
    【40】https://motchallenge.net/data/MOT17/ 網頁
    【41】Milan, A., Schindler, K., & Roth, S. (2013), “Challenges of Ground Truth Evaluation of Multi-target Tracking”, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (pp. 735-742) .
    【42】Luiten, J., Osep, A., Dendorfer, P., Torr, P., Geiger, A., Leal-Taixé, L., & Leibe, B. (2020), “HOTA: A Higher Order Metric for Evaluating Multi-Object Tracking”, International Journal of Computer Vision (IJCV)(pp. 1-31)

    QR CODE