簡易檢索 / 詳目顯示

研究生: 莫雅雯
Mo, Ya-Wen
論文名稱: 使用狀態機處理強化學習之獎勵設計複雜性 -- 以機器人不限定次數之重複性任務為例
Managing Shaping Complexity in Reinforcement Learning with State Machines -- Using Robotic Tasks with Unspecified Repetition as an Example
指導教授: 金仲達
King, Chung-Ta
口試委員: 朱宏國
Chu, Hung-Kuo
江振瑞
Jiang, Jehn-Ruey
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2021
畢業學年度: 110
語文別: 英文
論文頁數: 42
中文關鍵詞: 機器學習強化學習獎勵設計獎勵塑形機器人控制
外文關鍵詞: machine learning, reinforcement learning, reward design, reward shaping, robotic control
相關次數: 點閱:1下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來,強化學習 (reinforcement learning) 被廣泛地應用在機器人領域。強化學習的目標是要得出一個使累積獎勵最大化的策略 (policy)。而智慧代理 (agent) 可透過獎勵函數的指導來選擇較好的行為動作 (action)。因此,在強化學習中,獎勵函數 (reward function) 是一項很重要的元素。其中稀疏獎勵 (Sparse reward) 很容易設計,但對於較為複雜的任務,像是不限重複次數的機器人操作任務,使用稀疏獎勵來進行訓練會使智慧代理很少收到有用的獎勵訊號。雖然這個問題可以通過獎勵塑形 (reward shaping) 來新增有用的獎勵訊號,但多數應用獎勵塑形提出的獎勵函數都只針對特定某個任務。在這篇論文中,我們提出一套針對機器人不限定次數之重複性任務的系統性獎勵函數設計原則,並利用reward machine的架構來設計獎勵函數。以重複性方塊放置任務為代表,我們進行疊方塊與排列方塊的實驗,實驗結果顯示我們設計的獎勵函數應用在疊方塊及排列方塊任務的成功率分別為66%及81%。相較於使用較稀疏的獎勵函數最終收斂到非期望的策略,我們設計的獎勵函數有不錯的表現。


    Reinforcement learning has been widely used in the field of robotics in recent years. The goal of reinforcement learning is to find out a policy to maximize the cumulative reward. Through the guidance of reward function, the agent can choose better actions. Therefore, in reinforcement learning, designing reward function is critical. Sparse reward is easier to design, but for complex tasks, such as robot operations with unspecified repetition, the agent can rarely receive useful reward signals. Although the problem may be mitigated by dense reward with shaping, common practices on reward shaping are mostly ad hoc. In this thesis, we demonstrate that reward shaping may be done in a more systematic way for robotic tasks with unspecified repetition by using a set of reward design principles together with the reward machine. We applied the principles to block stacking and block lining up tasks. The experimental results show that our designed reward functions can achieve a success rate of 66% and 81% for block stacking and block lining up tasks respectively. The results are far better than using sparse rewards, which often converge to unexpected policies.

    中文摘要 Ⅰ Abstract Ⅱ 致謝 Ⅲ Table of Contents IV Chapter 1 Introduction 1 Chapter 2 Related Works 4 2.1 Block placement tasks 4 2.2 Reward Design 5 2.3 Reward Machine 5 Chapter 3 Preliminary 7 3.1 Deep Deterministic Policy Gradient 7 3.2 Combining RL with Demonstration Data 8 Chapter 4 Method 9 4.1. Reward design principle 9 4.2. Repetitive block placement tasks 10 4.3. Design reward for repetitive block placement tasks 12 Chapter 5 Experiment 19 5.1. Environment 19 5.2. Data Collection and Preprocessing 20 5.3. Training Details 20 5.4. Evaluation 21 5.4.1 Comparison between reward design with and without principle 1 23 5.4.2 Comparison between reward design with and without principle 2 26 5.4.3 Comparison between reward design with and without principle 3 30 5.4.4 Comparison between reward design with and without principle 4 34 Chapter 6 Conclusions and Future Works 39 References 41

    [1] Ashvin Nair, Bob McGrew, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel, “Overcoming exploration in reinforcement learning with demonstrations”, in IEEE International Conference on Robotics and Automation (ICRA), 2018
    [2] Vinicius G. Goecks, Gregory M. Gremillion, Vernon J. Lawhern , John Valasek, and Nicholas R. Waytowich, “Integrating Behavior Cloning and Reinforcement Learning for Improved Performance in Dense and Sparse Reward Environments”, in International Conference on Autonomous Agents and MultiAgent Systems (AAMAS), 2020
    [3] Richard Li, Allan Jabri, Trevor Darrell, and Pulkit Agrawal, “Towards Practical Multi-Object Manipulation using Relational Reinforcement Learning”, in IEEE International Conference on Robotics and Automation (ICRA), 2020
    [4] Ivaylo Popov, Nicolas Heess, Timothy Lillicrap, Roland Hafner, Gabriel Barth-Maron, Matej Vecerik, Thomas Lampe, Yuval Tassa, Tom Erez, and Martin Riedmiller, “Data-efficient Deep Reinforcement Learning for Dexterous Manipulation”, in International Conference on Learning Representations (ICLR), 2018
    [5] Rodrigo Toro Icarte, Toryn Q. Klassen, Richard Valenzano, and Sheila A. McIlraith, “Using Reward Machines for High-Level Task Specification and Decomposition in Reinforcement Learning”, in International Conference on Machine Learning (ICML), 2018
    [6] Rodrigo Toro Icarte, Toryn Q. Klassen, Richard Valenzano, and Sheila A. McIlraith, “Reward Machines Exploiting Reward Function Structure in Reinforcement Learning”, in arXiv:2010.03950
    [7] Karpathy, A., REINFORCEjs: WaterWorld demo. http://cs.stanford.edu/people/karpathy/reinforcejs/waterworld.html,2015
    [8] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., & Zaremba, W, OpenAI gym, 2016
    [9] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra, “Continuous control with deep reinforcement learning” in arXiv:1509.02971, 2015
    [10] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel , and Sergey Levine, “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor”, in International Conference on Machine Learning (ICML), 2018
    [11] Olivier Michel, Cyberbotics Ltd., Swiss Federal Institute of Technology in Lausanne, and BIRG & SWIS research groups, “Cyberbotics Ltd. Webots^TM: Professional Mobile Robot Simulation”, International Journal of Advanced Robotic Systems, 2004
    [12] Jette Randlov, Preben Alstrøm, “Learning to Drive a Bicycle Using Reinforcement Learning and Shaping.”, in International Conference on Machine Learning (ICML), 1998

    QR CODE