基於優化版YOLO架構之輕量化物件檢測系統設計與實現

簡易檢索 / 詳目顯示

回結果列表

研究生：	蕭翔 Hsiao,Hsiang
論文名稱：	基於優化版YOLO架構之輕量化物件檢測系統設計與實現 Design and Implementation of a Lightweight Object Detection System Based on an Optimized YOLO Architecture
指導教授：	馬席彬 Ma, Hsi-Pin
口試委員:	黃稚存 Huang, Chih-Tsun 蔡佩芸 Tsai, Pei-Yun
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 電機工程學系 Department of Electrical Engineering
論文出版年：	2025
畢業學年度：	113
語文別：	中文
論文頁數：	89
中文關鍵詞：	物件偵測、人工智慧、神經網路
外文關鍵詞：	Object Detection, Artificial intelligence, Neural Network
相關次數：	點閱：257 下載：3
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

隨著工業 4.0 時代的來臨,智慧工廠的自動化生產線對機器視覺系統的
即時性與準確性要求日益提升,而在機械手臂應用場域中,即時且準確地
辨識目標物件成為關鍵。然而,現行目標檢測系統在實務運用中,常因計
算資源受限、光照條件多變、物體受到遮擋或反光等因素影響,導致檢測
表現不易穩定。
本研究以輕量化目標檢測模型 YOLOv7-Tiny 為基礎,提出一套經過深
度精簡與性能強化的物件偵測方法。透過深度可分離卷積以降低計算複雜
度,並引入壓縮與激發注意力機制,以增強特徵提取的有效性,並將此模
型命名為結合深度可分離卷積以及壓縮與激發之注意力機制的改進 YOLO
模型 (DW-SE-YOLO)。同時,採用多樣化的資料增強策略,使模型面對不
同光照與雜訊環境時,仍能維持穩定的偵測效能。
實驗結果顯示,透過四折交叉驗證與消融實驗,證實了深度可分離卷積
和注意力機制的有效性。以交疊比 0.5 之平均精確度可達 0.976,並可達每
秒約 109.5 幀的即時偵測速度,同時將參數量縮減至約 542 萬。此外,設計
了輕量網頁框架的雲端部署架構,結合檢測系統與機械手臂,實現即時的
物件偵測與操作控制。
未來研究將著重於優化網路結構,探索更高效的特徵提取方法,並整合
多模態感知技術以適應工業場景中更多元化的應用需求,為工業視覺系統
的輕量化和實用化發展提供新的思路。

With the advent of Industry 4.0, smart factory production lines have placed in-
creasingly stringent demands on the real-time accuracy of machine vision systems,

particularly in robotic arm applications. However, existing detection frameworks
often exhibit unstable performance due to limited computational resources, varying
lighting conditions, and object occlusions or reflections.
This study proposes an enhanced lightweight object detection method based on

YOLOv7-Tiny. The model, named DW-SE-YOLO, incorporates depthwise sepa-
rable convolutions to reduce computational complexity and integrates a squeeze-
and-excitation attention mechanism to enhance feature extraction efficiency. Di-
verse data augmentation strategies are employed to maintain stable detection per-
formance under various environmental conditions.

Experimental results, through four-fold cross-validation and ablation studies,
validate the effectiveness of our approach. The model achieves a mean average
precision of 0.976 at an IoU threshold of 0.5, processes 109.5 frames per second,

and reduces parameters to 5.42 million. A lightweight web-based deployment architecture integrates the detection system with robotic arms for real-time object

detection and control.
Future research will focus on optimizing network architecture and exploring

efficient feature extraction methods to adapt to diverse industrial applications, pro-
viding new insights for industrial vision system development.

誌謝 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I
摘要 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . II
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . III
第一 章 緒論 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1 研究背景 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 研究動機 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3 研究架構 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
第二 章 文獻回顧 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1 目標檢測演進背景 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 深度學習目標檢測架構 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 二階檢測器 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 一階檢測器之架構演進 . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 YOLO 系列的基礎 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1 YOLOv2 與 YOLOv3 . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4 現代 YOLO 架構 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.1 YOLOv4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.2 YOLOv7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5 輕量級模型設計與優化 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.1 MobileNets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
6 YOLOv7-tiny . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
7 注意力機制在目標檢測中的應用 . . . . . . . . . . . . . . . . . . . . . . . . 21
7.1 壓縮與激發注意力機制 . . . . . . . . . . . . . . . . . . . . . . . . . 21
8 文獻分析 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
8.1 任務型態的選擇 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
8.2 物體偵測演算法討論 . . . . . . . . . . . . . . . . . . . . . . . . . . 25
第三 章 研究方法 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1 系統架構 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2 系統流程 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3 資料處理 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1 資料標註 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 資料增強 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4 DW-SE-YOLO 神經網路模型 . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1 訓練流程 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 交叉驗證 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3 神經網路架構 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.4 損失函數 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.5 目標檢測推論機制 . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
第四 章 實驗結果與討論 . . . . . . . . . . . . . . . . . . . . . . . . . 49
1 實驗環境 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
1.1 硬體環境 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
1.2 軟體環境 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
1.3 資料收集 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
1.4 超參數設定 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2 模型評估指標 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3 模型損失曲線比較與分析 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4 模型參數量及運算量分析 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5 DW-SE-YOLO 架構之消融實驗與效能分析 . . . . . . . . . . . . . . . . . . 75
6 自有資料集與其他論文之比較 . . . . . . . . . . . . . . . . . . . . . . . . . 76
7 使用 COCO 資料集與其他論文之比較 . . . . . . . . . . . . . . . . . . . . . 77
8 推論性能與其他論文之比較 . . . . . . . . . . . . . . . . . . . . . . . . . . 78
9 實際部屬與應用 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
第五 章 結論與未來規劃 . . . . . . . . . . . . . . . . . . . . . . . . . 85
1 結論 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
2 未來規劃 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
參考文獻 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
                                

[1] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate
object detection and semantic segmentation,” in Proceedings of the IEEE conference on
computer vision and pattern recognition, 2014, pp. 580–587.
[2] R. Girshick, “Fast R-CNN,” arXiv preprint arXiv:1504.08083, 2015.

[3] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detec-
tion with region proposal networks,” IEEE transactions on pattern analysis and machine

intelligence, vol. 39, no. 6, pp. 1137–1149, 2016.
[4] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd:

Single shot multibox detector,” in Computer Vision–ECCV 2016: 14th European Con-
ference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14.

Springer, 2016, pp. 21–37.
[5] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid
networks for object detection,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, 2017, pp. 2117–2125.

[6] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-
time object detection,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition,

2016, pp. 779–788.
[7] J. Redmon and A. Farhadi, “YOLO9000: better, faster, stronger,” in Proceedings of the
IEEE conference on computer vision and pattern recognition, 2017, pp. 7263–7271.
[8] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “Yolov4: Optimal speed and accuracy
of object detection,” arXiv preprint arXiv:2004.10934, 2020.
[9] “Yolov5.” [Online]. Available: https://github.com/ultralytics/yolov5
[10] C. Li, L. Li, H. Jiang, K. Weng, Y. Geng, L. Li, Z. Ke, Q. Li, M. Cheng, W. Nie et al.,
“Yolov6: A single-stage object detection framework for industrial applications,” arXiv
preprint arXiv:2209.02976, 2022.
[11] C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao, “Yolov7: Trainable bag-of-freebies sets

new state-of-the-art for real-time object detectors,” in Proceedings of the IEEE/CVF con-
ference on computer vision and pattern recognition, 2023, pp. 7464–7475.

[12] A. G. Howard, “Mobilenets: Efficient convolutional neural networks for mobile vision
applications,” arXiv preprint arXiv:1704.04861, 2017.
[13] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the
IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
[14] “Pascal Voc.” [Online]. Available: http://host.robots.ox.ac.uk/pascal/VOC/

[15] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” in Proc. IEEE Conf. Com-
puter Vision and Pattern Recognition, 2017, pp. 7263–7271.

[16] J. Redmon, “Yolov3: An incremental improvement,” arXiv preprint arXiv:1804.02767,
2018.
[17] “Coco Dataset.” [Online]. Available: https://cocodataset.org/#home

[18] C.-Y. Wang, H.-Y. M. Liao, Y.-H. Wu, P.-Y. Chen, J.-W. Hsieh, and I.-H. Yeh, “Csp-
net: A new backbone that can enhance learning capability of cnn,” in Proceedings of the

IEEE/CVF conference on computer vision and pattern recognition workshops, 2020, pp.
390–391.
[19] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional

networks for visual recognition,” IEEE transactions on pattern analysis and machine in-
telligence, vol. 37, no. 9, pp. 1904–1916, 2015.

[20] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for instance segmenta-
tion,” in Proceedings of the IEEE conference on computer vision and pattern recognition,

2018, pp. 8759–8768.
[21] Z. Zheng, P. Wang, D. Ren, W. Liu, R. Ye, Q. Hu, and W. Zuo, “Enhancing geometric
factors in model learning and inference for object detection and instance segmentation,”
IEEE transactions on cybernetics, vol. 52, no. 8, pp. 8574–8586, 2021.
[22] M. Tan, R. Pang, and Q. V. Le, “Efficientdet: Scalable and efficient object detection,”
in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition,
2020, pp. 10 781–10 790.
[23] S. Liu, D. Huang, and Y. Wang, “Learning spatial fusion for single-shot object detection,”
arXiv preprint arXiv:1911.09516, 2019.

[24] C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao, “Scaled-yolov4: Scaling cross stage par-
tial network,” in Proceedings of the IEEE/cvf conference on computer vision and pattern

recognition, 2021, pp. 13 029–13 038.
[25] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted
residuals and linear bottlenecks,” in Proceedings of the IEEE conference on computer
vision and pattern recognition, 2018, pp. 4510–4520.
[26] “Silu Pytorch API.” [Online]. Available: https://pytorch.org/docs/stable/generated/torch.
nn.SiLU.html
[27] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception
architecture for computer vision,” in Proceedings of the IEEE conference on computer
vision and pattern recognition, 2016, pp. 2818–2826.
[28] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in
Proceedings of the IEEE conference on computer vision and pattern recognition, 2016,
pp. 770–778.

[29] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep con-
volutional neural networks,” Advances in neural information processing systems, vol. 25,

2012.
[30] X. Ding, X. Zhang, N. Ma, J. Han, G. Ding, and J. Sun, “Repvgg: Making vgg-style
convnets great again,” in Proceedings of the IEEE/CVF conference on computer vision
and pattern recognition, 2021, pp. 13 733–13 742.
[31] “Labelimg.” [Online]. Available: https://github.com/HumanSignal/labelImg
[32] “Tm Robot TM14.” [Online]. Available: https://www.tm-robot.com/zh-hant/tm14/
[33] “Flask Official Documentation.” [Online]. Available: https://flask.palletsprojects.com/
en/stable/
[34] H. Zhang, “mixup: Beyond empirical risk minimization,” arXiv preprint
arXiv:1710.09412, 2017.
[35] Y.-F. Zhang, W. Ren, Z. Zhang, Z. Jia, L. Wang, and T. Tan, “Focal and efficient IOU loss
for accurate bounding box regression,” Neurocomputing, vol. 506, pp. 146–157, 2022.
[36] “Cuda toolkit documentation v11.6.0.” [Online]. Available: https://docs.nvidia.com/
cuda/archive/11.6.0/
[37] “Pytorch.” [Online]. Available: https://pytorch.org/
[38] “Torchvision.” [Online]. Available: https://pytorch.org/vision/stable/index.html
[39] “Werkzeug.” [Online]. Available: https://werkzeug.palletsprojects.com/en/stable/
[40] “Rest API Design Principles.” [Online]. Available: https://www.restapitutorial.com/
[41] G. Ghiasi, Y. Cui, A. Srinivas, R. Qian, T.-Y. Lin, E. D. Cubuk, Q. V. Le, and B. Zoph,
“Simple copy-paste is a strong data augmentation method for instance segmentation,” in
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition,
2021, pp. 2918–2928.
[42] “Yolov6.” [Online]. Available: https://github.com/meituan/YOLOv6

簡易檢索 / 詳目顯示

相關論文