多物件無監督強化學習中針對物件之技能發現

簡易檢索 / 詳目顯示

回結果列表

研究生：	陳彥伃 Chen, Yan-Yu
論文名稱：	多物件無監督強化學習中針對物件之技能發現 Towards Object-Specific Skill Discovery in Multi-Object Unsupervised Reinforcement Learning
指導教授：	金仲達 King, Chung-Ta
口試委員:	邱德泉 Chiu, Te-Chuan 李皇辰 Lee, Huang-Chen
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊工程學系 Computer Science
論文出版年：	2024
畢業學年度：	113
語文別：	英文
論文頁數：	35
中文關鍵詞：	強化學習、技能發現、機器人控制
外文關鍵詞：	reinforcement learning, skill discovery, robot control
相關次數：	點閱：158 下載：1
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

在多物件的機器人操作任務中，代理人需要學習如何與環境中各種不同物件互動以達成所指派的任務。然而，隨著任務中涉及的物件數量及種類增加和狀態空間的維度增加，學習與每個物件的互動策略變得非常困難，對物件互動技能的學習造成很大的挑戰。無監督強化學習透過內在獎勵機制預先探索環境，所訓練出的模型可有效降低下游任務學習的複雜度及時間。一種常見的無監督強化學習方法是從環境中學習基礎技能，並透過最大化技能潛變量與狀態空間的相互資訊 (MI) 來實現技能學習。但是MI最大化的解可能過於簡單且不唯一，導致學習的技能過於簡單。為了提升技能的多樣性，前人提出 Lipschitz-constrained Skill Discovery (LSD) 方法，在狀態表徵函數使用 Lipschitz 條件，最大化技能潛變量對應的狀態變化，以增加技能對狀態空間的覆蓋範圍。然而，我們發現 LSD 在多物件操作任務中難以充分關注單一物件的狀態變化。為此，本論文提出一種基於遮罩的改進方法，將代理狀態與物件狀態分離，使代理人能專注於學習與單一物件互動的技能。此外，我們引入加權機制，對內在獎勵進行調整，以鼓勵代理人探索並掌握物件操縱技能。實驗結果表明，儘管遮罩的數量增加會延長訓練時間，但我們的方法在廚房場景中成功學習了 LSD 難以實現的物件互動技能。這項方法為複雜多物件操作任務中的技能探索提供了一個新的方向。未來，可以進一步探索基於深度學習的遮罩與權重自適應學習方法，以自動調整遮罩分佈並動態調節權重值，從而適應不同場景需求。

In multi-object robotic manipulation tasks, an agent must learn diverse interaction strategies with the various objects in the environment to achieve specific goals. As the number of object types increases and the dimensionality of the state space expands, learning interaction strategies for each object becomes more challenging, increasing the complexity in learning the tasks with reinforcement learning (RL). Unsupervised reinforcement learning aims to reduce this complexity by first pretraining a model to explore the environment through a certain intrinsic reward and then adapting the model to the downstream tasks, e.g., multi-object robotic manipulation, with dramatically reduced efforts. A general strategy for unsupervised RL is to explore the environment to discover low-level primitive skills for the agent by maximizing the mutual information (MI) between skill latent variables and the state space. However, since the solution to MI maximization may be overly simplistic and non-unique, the resulting skills are often less diverse. To enhance skill diversity, Lipschitz-constrained Skill Discovery (LSD) employs a state representation function with Lipschitz constraints to maximize the state-space variation as the objective, thereby increasing the coverage for skills. However, we find that LSD tends to ignore single-object state changes in multi-object manipulation tasks. Therefore, in this thesis we propose a masking-based improvement to LSD that separates agent and object states in the environment, enabling the agent to focus on learning interaction skills with individual objects in a multi-object setting. For this idea to work, we additionally propose a weighting mechanism within the mask to emphasize object state changes over agent state changes. This allows learned skills to be more object-centric. We also adjust the intrinsic reward by weighting object state changes more to promote the exploration of object manipulation skills. Experimental results show that, while training time increases with the number of masks, our method can successfully acquire object interaction skills that LSD struggles to learn in a kitchen environment. This approach offers a new direction for skill discovery in complex multi-object manipulation tasks. Future work will explore learnable masks and weights to further improve skill diversity and training efficiency in varying environments.

Abstract (Chinese) I
Abstract II
Contents IV
Introduction 1
Related Work 6
1 Unsupervised Reinforcement Learning . . . 7
2 Diverse Behaviors and Unsupervised Skill Discovery . . . 7
Method 10
1 Skill-based RL Notation . . . 11
2 Lipschitz-constrained Unsupervised Skill Discovery (LSD) . . . 11
3 One-hot Mask . . . 12
4 Weight Mechanism . . . 15
Experiments 17
1 Experimental Setup . . . 17
1.1 Environment . . . 17
1.2 Implementation Details . . . 18
2 Single Mask . . . 20
3 Multiple Masks . . . 22
3.1 Two Masks . . . 22
3.2 Three Masks . . . 25
4 One-hot and Non-one-hot Masks . . . 28
5 Summary of Experimental Results . . . 29
Conclusion and Future Work 31
Bibliography 33
                                

[1] Joshua Achiam, Harrison Edwards, Dario Amodei, and Pieter Abbeel. Vari ational option discovery algorithms, 2018.
[2] Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation. In International Conference on Learning Representations, 2019.
[3] Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Di versity is all you need: Learning skills without a reward function. In Interna tional Conference on Learning Representations, 2019.
[4] Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning, 2021.
[5] Pierre-Alexandre Kamienny, Jean Tarbouriech, Alessandro Lazaric, and Lu dovic Denoyer. Direct then diffuse: Incremental unsupervised skill discovery for state covering and goal reaching. In International Conference on Learning Representations, 2022.
[6] Saurabh Kumar, Aviral Kumar, Sergey Levine, and Chelsea Finn. One so lution is not all you need: Few-shot extrapolation via structured maxent rl. Advances in Neural Information Processing Systems, 33:8198–8210, 2020.
[7] Michael Laskin, Denis Yarats, Hao Liu, Kimin Lee, Albert Zhan, Kevin Lu, Catherine Cang, Lerrel Pinto, and Pieter Abbeel. URLB: Unsupervised reinforcement learning benchmark. In Thirty-fifth Conference on Neural Infor mation Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
[8] Hao Liu and Pieter Abbeel. Behavior from the void: Unsupervised active pre training. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021.
[9] Andrei Lupu, Brandon Cui, Hengyuan Hu, and Jakob Foerster. Trajectory diversity for zero-shot coordination. In Marina Meila and Tong Zhang, edi tors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 7204–7213. PMLR, 18–24 Jul 2021.
[10] A Rupam Mahmood, Dmytro Korenkevych, Brent J Komer, and James Bergstra. Setting up a reinforcement learning task with a real-world robot. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4635–4640. IEEE, 2018.
[11] Taewook Nam, Shao-Hua Sun, Karl Pertsch, Sung Ju Hwang, and Joseph J Lim. Skill-based meta-reinforcement learning. In International Conference on Learning Representations, 2022.
[12] Seohong Park, Jongwook Choi, Jaekyeom Kim, Honglak Lee, and Gunhee Kim. Lipschitz-constrained unsupervised skill discovery. In International Conference on Learning Representations, 2022.
[13] Seohong Park, Kimin Lee, Youngwoon Lee, and Pieter Abbeel. Controllability-aware unsupervised skill discovery. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
[14] Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. In International conference on machine learning, pages 2778–2787. PMLR, 2017.
[15] Archit Sharma, Shixiang Gu, Sergey Levine, Vikash Kumar, and Karol Haus man. Dynamics-aware unsupervised discovery of skills. In International Con ference on Learning Representations, 2020.
[16] Zhihui Xie, Zichuan Lin, Junyou Li, Shuai Li, and Deheng Ye. Pretraining in deep reinforcement learning: A survey. arXiv preprint arXiv:2211.03959, 2022.
[17] Zichun Xu, Yuntao Li, Xiaohang Yang, Zhiyuan Zhao, Lei Zhuang, and Jing dong Zhao. Open-source reinforcement learning environments implemented in mujoco with franka manipulator. In 2024 IEEE International Conference on Advanced Intelligent Mechatronics (AIM), pages 709–714. IEEE, 2024.
[18] Jiachen Yang, Igor Borovikov, and Hongyuan Zha. Hierarchical coop erative multi-agent reinforcement learning with skill discovery. CoRR, abs/1912.03558, 2019.
[19] Denis Yarats, Rob Fergus, Alessandro Lazaric, and Lerrel Pinto. Reinforce ment learning with prototypical representations. In International Conference on Machine Learning, pages 11920–11931. PMLR, 2021.
[20] Jesse Zhang, Haonan Yu, and Wei Xu. Hierarchical reinforcement learning by discovering intrinsic options. arXiv preprint arXiv:2101.06521,
[21] Zihan Zhou, Wei Fu, Bingliang Zhang, and Yi Wu. Continuously discover ing novel strategies via reward-switching policy optimization. In Deep RL Workshop NeurIPS 2021, 2021.

簡易檢索 / 詳目顯示

相關論文