研究生: |
謝濬安 Hsieh, Chun-An |
---|---|
論文名稱: |
整合深度學習與大型語言模型辨識人類意圖之人機協作系統研究開發 Development of a Human-Robot Collaboration System Based on Integrated Deep Learning and Large Language Models for Human Intentions Recognition |
指導教授: |
張禎元
Chang, Jen-Yuan |
口試委員: |
張賢廷
Chang, Hsien-Ting 馮國華 Feng, Guo-Hua |
學位類別: |
碩士 Master |
系所名稱: |
工學院 - 動力機械工程學系 Department of Power Mechanical Engineering |
論文出版年: | 2024 |
畢業學年度: | 113 |
語文別: | 中文 |
論文頁數: | 100 |
中文關鍵詞: | 深度學習 、人類意圖辨識 、人機協作 、大型語言模型 |
外文關鍵詞: | deep learning, human intent recognition, human-robot collaboration, large-scale language models |
相關次數: | 點閱:77 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著工業4.0不斷發展,自動化技術在製造和工業領域的應用程度出現了飛躍性的進展。人機協作可望成為實現高效生產和靈活製造的重要角色。在此背景下,本研究主要內容為開發一種深度學習系統,以辨識人類的意圖,提高人機協作的流暢性。這個研究是對工業4.0理念的積極響應,旨在將自動化技術與人類工作者的協同能力相結合。
本研究採用了電腦視覺和深度學習技術,以實現機器人對工作區域內工件和人類的感知。通過YOLOv7 (You Only Look Once version 7)以及LSTM (Long Short-Term Memory) 深度學習模型辨識人類手上的物件、手勢以及動作,使機器人可以理解人類的意圖。實際例子為,當人類打算將工件交由機器人時,一旦機器人確定人類的合作意圖,系統將智慧切換到協作模式,關閉避碰功能,並透過人機交流,使得人機協作更加無縫;若辨識人類意圖為人機互動,則會透過本研究利用OpenAI模型開發之命令拆解模型,讓機器人理解人類命令以去執行。本研究開發之人類意圖模型平均準確率可達到91%,且辨識幀數為40以上,使得人機互動更加適應和靈活,從而提升了人機協作的流暢性和效率28%。
With the continuous development of Industry 4.0, automation technology in manufacturing and industrial fields has seen significant advancements, and Human-Robot Collaboration is expected to play a crucial role in achieving efficient production and flexible manufacturing. This research focuses on developing a deep learning system to recognize human intentions and enhance the fluidity of human-robot collaboration. Employing computer vision and deep learning techniques, specifically YOLOv7 and LSTM models, the system enables robots to perceive objects, gestures, and actions within the workspace, allowing them to understand human intentions. For instance, when a human intends to hand over an object, the robot switches to collaboration mode, disabling collision avoidance for seamless interaction. Additionally, using a command parsing model developed with OpenAI's language model, robots can understand and execute human commands. The developed human intention model achieves an average accuracy of 91% with a recognition frame rate above 40, enhancing the fluidity of human-robot collaboration and efficiency by 28%.
[1] T. Detection. "Violence Detection (LSTM Neutral Network)." https://youtu.be/Rnu7qdCSr9Q (accessed.
[2] J. E. Colgate, W. Wannasuphoprasit, and M. A. Peshkin, "Cobots: Robots for collaboration with human operators," in ASME international mechanical engineering congress and exposition, vol. 15281: American Society of Mechanical Engineers, pp. 433-439, 1996.
[3] D. P. Gravel and W. S. Newman, "Flexible robotic assembly efforts at Ford Motor Company," in Proceeding of the 2001 IEEE International Symposium on Intelligent Control (ISIC'01)(Cat. No. 01CH37206), IEEE, pp. 173-182, 2001.
[4] E. Matheson, R. Minto, E. G. Zampieri, M. Faccio, and G. Rosati, "Human–robot collaboration in manufacturing applications: A review," Robotics, vol. 8, no. 4, p. 100, 2019.
[5] S. Gaskill and S. Went, "Safety issues in modern applications of robots," Reliability Engineering & System Safety, vol. 53, no. 3, pp. 301-307, 1996.
[6] E. Magrini, F. Ferraguti, A. J. Ronga, F. Pini, A. De Luca, and F. Leali, "Human-robot coexistence and interaction in open industrial cells," Robotics and Computer-Integrated Manufacturing, vol. 61, p. 101846, 2020.
[7] L. Wang et al., "Symbiotic human-robot collaborative assembly," CIRP annals, vol. 68, no. 2, pp. 701-726, 2019.
[8] H. A. Yanco and J. Drury, "Classifying human-robot interaction: an updated taxonomy," in 2004 IEEE international conference on systems, man and cybernetics (IEEE Cat. No. 04CH37583), vol. 3: IEEE, pp. 2841-2846, 2004.
[9] T. Shu, M. S. Ryoo, and S.-C. Zhu, "Learning social affordance for human-robot interaction," arXiv preprint arXiv:1604.03692, 2016.
[10] A. Bauer, D. Wollherr, and M. Buss, "Human–robot collaboration: a survey," International Journal of Humanoid Robotics, vol. 5, no. 01, pp. 47-66, 2008.
[11] A. Ajoudani, A. M. Zanchettin, S. Ivaldi, A. Albu-Schäffer, K. Kosuge, and O. Khatib, "Progress and prospects of the human–robot collaboration," Autonomous Robots, vol. 42, pp. 957-975, 2018.
[12] H. Liu and L. Wang, "Gesture recognition for human-robot collaboration: A review," International Journal of Industrial Ergonomics, vol. 68, pp. 355-367, 2018.
[13] S. Mitra and T. Acharya, "Gesture recognition: A survey," IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 37, no. 3, pp. 311-324, 2007.
[14] T. Starner, "Visual recognition of american sign language using hidden markov models," Massachusetts Institute of Technology, 1995.
[15] Y. Matsumoto and A. Zelinsky, "An algorithm for real-time stereo vision implementation of head pose and gaze direction measurement," in Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580), IEEE, pp. 499-504, 2000.
[16] C. A. Cifuentes, A. Frizera, R. Carelli, and T. Bastos, "Human–robot interaction based on wearable IMU sensor and laser range finder," Robotics and Autonomous Systems, vol. 62, no. 10, pp. 1425-1439, 2014.
[17] J. Smith, T. White, C. Dodge, J. Paradiso, N. Gershenfeld, and D. Allport, "Electric field sensing for graphical interfaces," IEEE Computer Graphics and Applications, vol. 18, no. 3, pp. 54-60, 1998.
[18] F. Adib, C.-Y. Hsu, H. Mao, D. Katabi, and F. Durand, "Capturing the human figure through a wall," ACM Transactions on Graphics (TOG), vol. 34, no. 6, pp. 1-13, 2015.
[19] J. Duan, Y. Fang, Q. Zhang, and J. Qin, "HRC of intelligent assembly system based on multimodal gesture control," The International Journal of Advanced Manufacturing Technology, pp. 1-13, 2023.
[20] Z. Xia et al., "Vision-based hand gesture recognition for human-robot collaboration: a survey," in 2019 5th International Conference on Control, Automation and Robotics (ICCAR), IEEE, pp. 198-205, 2019.
[21] C. Liu, X. Cao, and X. Li, "Human Intention Understanding and Trajectory Planning Based on Multi-modal Data," in International Conference on Cognitive Systems and Signal Processing, Springer, pp. 389-399, 2022.
[22] Z. Liu, Q. Liu, W. Xu, Z. Liu, Z. Zhou, and J. Chen, "Deep learning-based human motion prediction considering context awareness for human-robot collaboration in manufacturing," in procedia cirp, vol. 83, pp. 272-278, 2019.
[23] G. Hoffman, "Evaluating fluency in human–robot collaboration," IEEE Transactions on Human-Machine Systems, vol. 49, no. 3, pp. 209-218, 2019.
[24] A. N. Vazquez, "Evaluating Team Fluency in Human-Industrial Robot Collaborative Design Tasks," in International Conference on Computer-Aided Architectural Design Futures, Springer, pp. 378-402, 2019.
[25] Y. Bengio, Y. Lecun, and G. Hinton, "Deep learning for AI," Communications of the ACM, vol. 64, no. 7, pp. 58-65, 2021.
[26] M. I. Jordan and T. M. Mitchell, "Machine learning: Trends, perspectives, and prospects," Science, vol. 349, no. 6245, pp. 255-260, 2015.
[27] R. Caruana and A. Niculescu-Mizil, "An empirical comparison of supervised learning algorithms," in Proceedings of the 23rd international conference on Machine learning, pp. 161-168, 2006.
[28] T. Hastie, R. Tibshirani, J. Friedman, T. Hastie, R. Tibshirani, and J. Friedman, "Unsupervised learning," The elements of statistical learning: Data mining, inference, and prediction, pp. 485-585, 2009.
[29] C. J. Watkins and P. Dayan, "Q-learning," Machine learning, vol. 8, pp. 279-292, 1992.
[30] K. O'Shea and R. Nash, "An introduction to convolutional neural networks," arXiv preprint arXiv:1511.08458, 2015.
[31] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," Advances in neural information processing systems, vol. 25, 2012.
[32] L. R. Medsker and L. Jain, "Recurrent neural networks," Design and Applications, vol. 5, no. 64-67, p. 2, 2001.
[33] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You only look once: Unified, real-time object detection," in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779-788, 2016.
[34] A. B. Chien-Yao Wang1, and Hong-Yuan Mark Liao1, "YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors," 2022.
[35] Y. Yu, X. Si, C. Hu, and J. Zhang, "A review of recurrent neural networks: LSTM cells and network architectures," Neural computation, vol. 31, no. 7, pp. 1235-1270, 2019.
[36] R. C. Staudemeyer and E. R. Morris, "Understanding LSTM--a tutorial into long short-term memory recurrent neural networks," arXiv preprint arXiv:1909.09586, 2019.
[37] M. Carranza-García, J. Torres-Mateo, P. Lara-Benítez, and J. García-Gutiérrez, "On the performance of one-stage and two-stage object detectors in autonomous vehicles using camera data," Remote Sensing, vol. 13, no. 1, p. 89, 2021.
[38] R. Girshick, "Fast r-cnn," in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440-1448.
[39] W. Kin-Yiu. "yolov7." https://github.com/WongKinYiu/yolov7 (accessed.
[40] F. Carrara, P. Elias, J. Sedmidubsky, and P. Zezula, "LSTM-based real-time action detection and prediction in human motion streams," Multimedia Tools and Applications, vol. 78, pp. 27309-27331, 2019.
[41] J. Hu and W. Zheng, "Transformation-gated LSTM: Efficient capture of short-term mutation dependencies for multivariate time series prediction tasks," in 2019 International Joint Conference on Neural Networks (IJCNN), IEEE, pp. 1-8, 2019.
[42] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber, "Gradient flow in recurrent nets: the difficulty of learning long-term dependencies," ed: A field guide to dynamical recurrent neural networks. IEEE Press In, 2001.
[43] S. Merity, N. S. Keskar, and R. Socher, "Regularizing and optimizing LSTM language models," arXiv preprint arXiv:1708.02182, 2017.
[44] Y.-P. Yeh, S.-J. Cheng, and C.-H. Shen, "Research on Intuitive Gesture Recognition Control and Navigation System of UAV," in 2022 IEEE 5th International Conference on Knowledge Innovation and Invention (ICKII), IEEE, pp. 5-8, 2022.
[45] M. Huynh and G. Alaghband, "Trajectory prediction by coupling scene-LSTM with human movement LSTM," in Advances in Visual Computing: 14th International Symposium on Visual Computing, ISVC 2019, Lake Tahoe, NV, USA, October 7–9, 2019, Proceedings, Part I 14, Springer, pp. 244-259, 2019.
[46] D. M. Nelson, A. C. Pereira, and R. A. De Oliveira, "Stock market's price movement prediction with LSTM neural networks," in 2017 International joint conference on neural networks (IJCNN), IEEE, pp. 1419-1426, 2017.
[47] Y. Huang, L. Rozo, J. Silvério, and D. G. Caldwell, "Kernelized movement primitives," The International Journal of Robotics Research, vol. 38, no. 7, pp. 833-852, 2019.
[48] "UNIVERSAL ROBOTS." https://www.universal-robots.com/ (accessed.
[49]"AzureKinectDK. "https://azure.microsoft.com/zhtw/products/kinect-dk (accessed.
[50]"YOLO家族系列模型的演變:從v1到v8(上)." https://cloud.tencent.com/developer/article/2212232?areaSource=104001.4&traceId=bTXJdiCqPM48qF63XTtFY (accessed.
[51]N. Alam. "Understanding YOLOv7 Neural Network." https://medium.com/@nahidalam/understanding-yolov7-neural-network-343889e32e4e (accessed.
[52]H.-y. Lee. "ML Lecture 21-1: Recurrent Neural Network (Part I)." https://youtu.be/xCGidAeyS4M (accessed.
[53] google. "mediapipe." https://github.com/google/mediapipe (accessed.
[54] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, "Grad-cam: Visual explanations from deep networks via gradient-based localization," in Proceedings of the IEEE international conference on computer vision, pp. 618-626, 2017.
[55] M. Ahn et al., "Do as i can, not as i say: Grounding language in robotic affordances," arXiv preprint arXiv:2204.01691, 2022.