簡易檢索 / 詳目顯示

研究生: 謝濬安
Hsieh, Chun-An
論文名稱: 整合深度學習與大型語言模型辨識人類意圖之人機協作系統研究開發
Development of a Human-Robot Collaboration System Based on Integrated Deep Learning and Large Language Models for Human Intentions Recognition
指導教授: 張禎元
Chang, Jen-Yuan
口試委員: 張賢廷
Chang, Hsien-Ting
馮國華
Feng, Guo-Hua
學位類別: 碩士
Master
系所名稱: 工學院 - 動力機械工程學系
Department of Power Mechanical Engineering
論文出版年: 2024
畢業學年度: 113
語文別: 中文
論文頁數: 100
中文關鍵詞: 深度學習人類意圖辨識人機協作大型語言模型
外文關鍵詞: deep learning, human intent recognition, human-robot collaboration, large-scale language models
相關次數: 點閱:77下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著工業4.0不斷發展,自動化技術在製造和工業領域的應用程度出現了飛躍性的進展。人機協作可望成為實現高效生產和靈活製造的重要角色。在此背景下,本研究主要內容為開發一種深度學習系統,以辨識人類的意圖,提高人機協作的流暢性。這個研究是對工業4.0理念的積極響應,旨在將自動化技術與人類工作者的協同能力相結合。
    本研究採用了電腦視覺和深度學習技術,以實現機器人對工作區域內工件和人類的感知。通過YOLOv7 (You Only Look Once version 7)以及LSTM (Long Short-Term Memory) 深度學習模型辨識人類手上的物件、手勢以及動作,使機器人可以理解人類的意圖。實際例子為,當人類打算將工件交由機器人時,一旦機器人確定人類的合作意圖,系統將智慧切換到協作模式,關閉避碰功能,並透過人機交流,使得人機協作更加無縫;若辨識人類意圖為人機互動,則會透過本研究利用OpenAI模型開發之命令拆解模型,讓機器人理解人類命令以去執行。本研究開發之人類意圖模型平均準確率可達到91%,且辨識幀數為40以上,使得人機互動更加適應和靈活,從而提升了人機協作的流暢性和效率28%。


    With the continuous development of Industry 4.0, automation technology in manufacturing and industrial fields has seen significant advancements, and Human-Robot Collaboration is expected to play a crucial role in achieving efficient production and flexible manufacturing. This research focuses on developing a deep learning system to recognize human intentions and enhance the fluidity of human-robot collaboration. Employing computer vision and deep learning techniques, specifically YOLOv7 and LSTM models, the system enables robots to perceive objects, gestures, and actions within the workspace, allowing them to understand human intentions. For instance, when a human intends to hand over an object, the robot switches to collaboration mode, disabling collision avoidance for seamless interaction. Additionally, using a command parsing model developed with OpenAI's language model, robots can understand and execute human commands. The developed human intention model achieves an average accuracy of 91% with a recognition frame rate above 40, enhancing the fluidity of human-robot collaboration and efficiency by 28%.

    摘要 I Abstract II 致 謝 III 目錄 IV 圖目錄 VII 表目錄 X 第一章 緒論 1 1.1 前言 1 1.2 研究動機 2 1.3 文獻回顧 5 1.3.1 人機關係 5 1.3.2 辨識人類意圖之設備 9 1.3.3 辨識人類意圖之方法 11 1.3.4 人機協作之流暢性 17 1.3.5 辨識人類意圖之深度學習模型 22 1.4 研究問題與目標 33 1.5 研究方法與步驟 34 1.6 章節規劃 36 1.7 預期成果 37 1.8 本論文研究贊助與產出 37 第二章 人類意圖辨識系統之架構 38 2.1 前言 38 2.2 人類意圖辨識系統設計方法 38 2.2.1 人類接近判斷 39 2.2.2 人類手部是否有抓取目標物判斷 39 2.2.3 人類是否伸手判斷 39 2.3 人類意圖辨識系統分析與比較 40 2.4 人機協作系統架構 45 2.5 本章總結 46 第三章 人類意圖辨識模型建立 47 3.1 前言 47 3.2 硬體設備介紹 47 3.3 選用之深度模型介紹 48 3.3.1 YOLOv7 (You Only Look Once version 7): 48 3.3.2 LSTM (Long Short-Term Memory) : 51 3.4 人類意圖辨識深度學習模型建立 53 3.4.1 工件辨識模型 54 3.4.2 動作辨識模型 56 3.4.3 工件拾取辨識機制 59 3.4.4 辨識模型之精進 60 3.5 整合深度模型辨識結果與人類意圖 61 3.6 本章總結 64 第四章 基於語言命令解析之人機動態互動 65 4.1 前言 65 4.2 機器人雙向交流 66 4.2.1 人類指令語言解析 66 4.2.2 機器人語音引導與回覆 68 4.3 人機動態互動 69 4.3.1 物體辨識 70 4.3.2 相對距離計算 73 4.3.3 指定位置到達 74 4.3.4 機械手臂跟隨、夾取等移動控制 75 4.4 本章總結 76 第五章 人機協作實驗結果與分析 77 5.1 前言 77 5.2 人類意圖辨識實驗結果 77 5.2.1 工件辨識模型 77 5.2.2 人類意圖辨識模型 84 5.3 人機動態互動實驗結果 86 5.4 人機協作系統實驗結果 89 5.5 本章總結 91 第六章 結論 92 6.1 總結 92 6.2 本文貢獻 93 6.3 未來展望 94 參考文獻 96

    [1] T. Detection. "Violence Detection (LSTM Neutral Network)." https://youtu.be/Rnu7qdCSr9Q (accessed.
    [2] J. E. Colgate, W. Wannasuphoprasit, and M. A. Peshkin, "Cobots: Robots for collaboration with human operators," in ASME international mechanical engineering congress and exposition, vol. 15281: American Society of Mechanical Engineers, pp. 433-439, 1996.
    [3] D. P. Gravel and W. S. Newman, "Flexible robotic assembly efforts at Ford Motor Company," in Proceeding of the 2001 IEEE International Symposium on Intelligent Control (ISIC'01)(Cat. No. 01CH37206), IEEE, pp. 173-182, 2001.
    [4] E. Matheson, R. Minto, E. G. Zampieri, M. Faccio, and G. Rosati, "Human–robot collaboration in manufacturing applications: A review," Robotics, vol. 8, no. 4, p. 100, 2019.
    [5] S. Gaskill and S. Went, "Safety issues in modern applications of robots," Reliability Engineering & System Safety, vol. 53, no. 3, pp. 301-307, 1996.
    [6] E. Magrini, F. Ferraguti, A. J. Ronga, F. Pini, A. De Luca, and F. Leali, "Human-robot coexistence and interaction in open industrial cells," Robotics and Computer-Integrated Manufacturing, vol. 61, p. 101846, 2020.
    [7] L. Wang et al., "Symbiotic human-robot collaborative assembly," CIRP annals, vol. 68, no. 2, pp. 701-726, 2019.
    [8] H. A. Yanco and J. Drury, "Classifying human-robot interaction: an updated taxonomy," in 2004 IEEE international conference on systems, man and cybernetics (IEEE Cat. No. 04CH37583), vol. 3: IEEE, pp. 2841-2846, 2004.
    [9] T. Shu, M. S. Ryoo, and S.-C. Zhu, "Learning social affordance for human-robot interaction," arXiv preprint arXiv:1604.03692, 2016.
    [10] A. Bauer, D. Wollherr, and M. Buss, "Human–robot collaboration: a survey," International Journal of Humanoid Robotics, vol. 5, no. 01, pp. 47-66, 2008.
    [11] A. Ajoudani, A. M. Zanchettin, S. Ivaldi, A. Albu-Schäffer, K. Kosuge, and O. Khatib, "Progress and prospects of the human–robot collaboration," Autonomous Robots, vol. 42, pp. 957-975, 2018.
    [12] H. Liu and L. Wang, "Gesture recognition for human-robot collaboration: A review," International Journal of Industrial Ergonomics, vol. 68, pp. 355-367, 2018.
    [13] S. Mitra and T. Acharya, "Gesture recognition: A survey," IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 37, no. 3, pp. 311-324, 2007.
    [14] T. Starner, "Visual recognition of american sign language using hidden markov models," Massachusetts Institute of Technology, 1995.
    [15] Y. Matsumoto and A. Zelinsky, "An algorithm for real-time stereo vision implementation of head pose and gaze direction measurement," in Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580), IEEE, pp. 499-504, 2000.
    [16] C. A. Cifuentes, A. Frizera, R. Carelli, and T. Bastos, "Human–robot interaction based on wearable IMU sensor and laser range finder," Robotics and Autonomous Systems, vol. 62, no. 10, pp. 1425-1439, 2014.
    [17] J. Smith, T. White, C. Dodge, J. Paradiso, N. Gershenfeld, and D. Allport, "Electric field sensing for graphical interfaces," IEEE Computer Graphics and Applications, vol. 18, no. 3, pp. 54-60, 1998.
    [18] F. Adib, C.-Y. Hsu, H. Mao, D. Katabi, and F. Durand, "Capturing the human figure through a wall," ACM Transactions on Graphics (TOG), vol. 34, no. 6, pp. 1-13, 2015.
    [19] J. Duan, Y. Fang, Q. Zhang, and J. Qin, "HRC of intelligent assembly system based on multimodal gesture control," The International Journal of Advanced Manufacturing Technology, pp. 1-13, 2023.
    [20] Z. Xia et al., "Vision-based hand gesture recognition for human-robot collaboration: a survey," in 2019 5th International Conference on Control, Automation and Robotics (ICCAR), IEEE, pp. 198-205, 2019.
    [21] C. Liu, X. Cao, and X. Li, "Human Intention Understanding and Trajectory Planning Based on Multi-modal Data," in International Conference on Cognitive Systems and Signal Processing, Springer, pp. 389-399, 2022.
    [22] Z. Liu, Q. Liu, W. Xu, Z. Liu, Z. Zhou, and J. Chen, "Deep learning-based human motion prediction considering context awareness for human-robot collaboration in manufacturing," in procedia cirp, vol. 83, pp. 272-278, 2019.
    [23] G. Hoffman, "Evaluating fluency in human–robot collaboration," IEEE Transactions on Human-Machine Systems, vol. 49, no. 3, pp. 209-218, 2019.
    [24] A. N. Vazquez, "Evaluating Team Fluency in Human-Industrial Robot Collaborative Design Tasks," in International Conference on Computer-Aided Architectural Design Futures, Springer, pp. 378-402, 2019.
    [25] Y. Bengio, Y. Lecun, and G. Hinton, "Deep learning for AI," Communications of the ACM, vol. 64, no. 7, pp. 58-65, 2021.
    [26] M. I. Jordan and T. M. Mitchell, "Machine learning: Trends, perspectives, and prospects," Science, vol. 349, no. 6245, pp. 255-260, 2015.
    [27] R. Caruana and A. Niculescu-Mizil, "An empirical comparison of supervised learning algorithms," in Proceedings of the 23rd international conference on Machine learning, pp. 161-168, 2006.
    [28] T. Hastie, R. Tibshirani, J. Friedman, T. Hastie, R. Tibshirani, and J. Friedman, "Unsupervised learning," The elements of statistical learning: Data mining, inference, and prediction, pp. 485-585, 2009.
    [29] C. J. Watkins and P. Dayan, "Q-learning," Machine learning, vol. 8, pp. 279-292, 1992.
    [30] K. O'Shea and R. Nash, "An introduction to convolutional neural networks," arXiv preprint arXiv:1511.08458, 2015.
    [31] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," Advances in neural information processing systems, vol. 25, 2012.
    [32] L. R. Medsker and L. Jain, "Recurrent neural networks," Design and Applications, vol. 5, no. 64-67, p. 2, 2001.
    [33] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You only look once: Unified, real-time object detection," in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779-788, 2016.
    [34] A. B. Chien-Yao Wang1, and Hong-Yuan Mark Liao1, "YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors," 2022.
    [35] Y. Yu, X. Si, C. Hu, and J. Zhang, "A review of recurrent neural networks: LSTM cells and network architectures," Neural computation, vol. 31, no. 7, pp. 1235-1270, 2019.
    [36] R. C. Staudemeyer and E. R. Morris, "Understanding LSTM--a tutorial into long short-term memory recurrent neural networks," arXiv preprint arXiv:1909.09586, 2019.
    [37] M. Carranza-García, J. Torres-Mateo, P. Lara-Benítez, and J. García-Gutiérrez, "On the performance of one-stage and two-stage object detectors in autonomous vehicles using camera data," Remote Sensing, vol. 13, no. 1, p. 89, 2021.
    [38] R. Girshick, "Fast r-cnn," in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440-1448.
    [39] W. Kin-Yiu. "yolov7." https://github.com/WongKinYiu/yolov7 (accessed.
    [40] F. Carrara, P. Elias, J. Sedmidubsky, and P. Zezula, "LSTM-based real-time action detection and prediction in human motion streams," Multimedia Tools and Applications, vol. 78, pp. 27309-27331, 2019.
    [41] J. Hu and W. Zheng, "Transformation-gated LSTM: Efficient capture of short-term mutation dependencies for multivariate time series prediction tasks," in 2019 International Joint Conference on Neural Networks (IJCNN), IEEE, pp. 1-8, 2019.
    [42] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber, "Gradient flow in recurrent nets: the difficulty of learning long-term dependencies," ed: A field guide to dynamical recurrent neural networks. IEEE Press In, 2001.
    [43] S. Merity, N. S. Keskar, and R. Socher, "Regularizing and optimizing LSTM language models," arXiv preprint arXiv:1708.02182, 2017.
    [44] Y.-P. Yeh, S.-J. Cheng, and C.-H. Shen, "Research on Intuitive Gesture Recognition Control and Navigation System of UAV," in 2022 IEEE 5th International Conference on Knowledge Innovation and Invention (ICKII), IEEE, pp. 5-8, 2022.
    [45] M. Huynh and G. Alaghband, "Trajectory prediction by coupling scene-LSTM with human movement LSTM," in Advances in Visual Computing: 14th International Symposium on Visual Computing, ISVC 2019, Lake Tahoe, NV, USA, October 7–9, 2019, Proceedings, Part I 14, Springer, pp. 244-259, 2019.
    [46] D. M. Nelson, A. C. Pereira, and R. A. De Oliveira, "Stock market's price movement prediction with LSTM neural networks," in 2017 International joint conference on neural networks (IJCNN), IEEE, pp. 1419-1426, 2017.
    [47] Y. Huang, L. Rozo, J. Silvério, and D. G. Caldwell, "Kernelized movement primitives," The International Journal of Robotics Research, vol. 38, no. 7, pp. 833-852, 2019.
    [48] "UNIVERSAL ROBOTS." https://www.universal-robots.com/ (accessed.
    [49]"AzureKinectDK. "https://azure.microsoft.com/zhtw/products/kinect-dk (accessed.
    [50]"YOLO家族系列模型的演變:從v1到v8(上)." https://cloud.tencent.com/developer/article/2212232?areaSource=104001.4&traceId=bTXJdiCqPM48qF63XTtFY (accessed.
    [51]N. Alam. "Understanding YOLOv7 Neural Network." https://medium.com/@nahidalam/understanding-yolov7-neural-network-343889e32e4e (accessed.
    [52]H.-y. Lee. "ML Lecture 21-1: Recurrent Neural Network (Part I)." https://youtu.be/xCGidAeyS4M (accessed.
    [53] google. "mediapipe." https://github.com/google/mediapipe (accessed.
    [54] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, "Grad-cam: Visual explanations from deep networks via gradient-based localization," in Proceedings of the IEEE international conference on computer vision, pp. 618-626, 2017.
    [55] M. Ahn et al., "Do as i can, not as i say: Grounding language in robotic affordances," arXiv preprint arXiv:2204.01691, 2022.

    QR CODE