簡易檢索 / 詳目顯示

研究生: 蘇維哲
Su, Wei-Che
論文名稱: 改進多目標強化學習CURIOUS方法的新模組選擇策略
A New Module-Selection Policy to Enhance CURIOUS for Modular Multi-Goal Reinforcement Learning
指導教授: 金仲達
King, Chung-Ta
口試委員: 張貴雲
Chang, Guey-Yun
朱宏國
Chu, Hung-Kuo
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2023
畢業學年度: 111
語文別: 英文
論文頁數: 28
中文關鍵詞: 強化學習多臂吃角子老虎機
外文關鍵詞: Reinforcement learning, Multi-armed bandit
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在開放式的環境中,讓自動化的代理可以去持續學習去精進他們的技能
    以適應逐漸變化的環境是很重要的,一般來說,當環境變化時,有些目標
    對於代理來說可能會變得更簡單,而有些會變得更難甚至是無法學會的。
    因此,代理會需要去知道在任何時間下他們的學習進度並自動去選擇哪個
    目標去學習。最近的一篇研究,CURIOUS 提出了一個框架讓持續式模組化
    的多目標強化學習去使用一個參照相對比例去決定選取機率的方法作為選
    擇模組的策略去自動決定多個模組的訓練順序。當考慮多模組的學習進度
    的相對關係,CURIOUS 忽視了學習進度的實際大小的重要性。在這篇論文
    中,我們提出了一個使用強化學習的新的模組選擇策略。我們的方法使用
    一個神經網路去藉由預測學習進度的變化趨勢改善了原本 CURIOUS 的方
    法並且同時考慮了不同模組的學習進度之間的相對關係以及實際大小。實
    驗顯示我們的方法比原本 CURIOUS 的方法穩定並且可以有效加速訓練
    流程。


    In open-ended environments, it is important for autonomous agents to contin-
    ually learn to master their skills in order to adapt to the changing environments. In
    general, as the environments change, some of the goals for the agents to achieve
    may become easier and some become more difficult or even impossible. The agents
    thus have to know their current learning progresses and autonomously select which
    goal to practice at any moment. A recent work, CURIOUS (Continual Univer-
    sal Reinforcement learning with Intrinsically mOtivated sUbstitutionS), proposes
    a framework for continual modular multi-goal reinforcement learning (RL) that
    uses a proportional probability matching method as the module-selection policy to
    determine the training order automatically. While considering the relative learning
    progresses of the modules, CURIOUS overlooks the importance of the absolute
    learning progresses. In this thesis, a new module-selection policy using reinforce-
    ment learning is proposed. Our method improves CURIOUS by predicting the
    changing trend of the learning progresses with a neural network and considering
    the proportional as well as absolute learning progresses among the different mod-
    ules. Experiments show that our method is more stable and can accelerate the
    training progress more effectively than CURIOUS.

    Acknowledgements 摘要i Abstract ii 1 Introduction 1 2 Related Work 5 2.1 Multi-Armed Bandit Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 CURIOUS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3 Method 9 3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 Training Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.3 Module-Section Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.4 Dealing with the Non-stationary Environment . . . . . . . . . . . . . . . . . . 13 4 Experiments 15 4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.2 Inner Working of the Proposed Method . . . . . . . . . . . . . . . . . . . . . 17 4.3 Comparison with CURIOUS . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.4 Sensory Perturbation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.5 Distracting Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 5 Conclusion 25 References 27

    [1] C. Colas, P. Fournier, M. Chetouani, O. Sigaud, and P.-Y. Oudeyer, “Curious: Intrinsically
    motivated modular multi-goal reinforcement learning,” in International Conference on
    Machine Learning, pp. 1331–1340, 2019.
    [2] A. Slivkins et al., “Introduction to multi-armed bandits,” Foundations and Trends® in
    Machine Learning, vol. 12, no. 1-2, pp. 1–286, 2019.
    [3] S. Forestier, R. Portelas, Y. Mollard, and P.-Y. Oudeyer, “Intrinsically moti-
    vated goal exploration processes with automatic curriculum learning,” arXiv preprint
    arXiv:1708.02190, 2017.
    [4] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath, “Deep reinforcement
    learning: A brief survey,” IEEE Signal Processing Magazine, vol. 34, no. 6, pp. 26–38,
    2017.
    [5] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and
    W. Zaremba, “Openai gym,” arXiv preprint arXiv:1606.01540, 2016.
    [6] E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for model-based control,”
    in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS),
    pp. 5026–5033, IEEE, 2012.
    [7] V. Kuleshov and D. Precup, “Algorithms for multi-armed bandit problems,” arXiv preprint
    arXiv:1402.6028, 2014.
    [8] S. Agrawal and N. Goyal, “Analysis of thompson sampling for the multi-armed bandit
    problem,” in Conference on Learning Theory, pp. 39–1, JMLR Workshop and Conference
    Proceedings, 2012.
    [9] D. E. Koulouriotis and A. Xanthopoulos, “Reinforcement learning and evolutionary algo-
    rithms for non-stationary multi-armed bandit problems,” Applied Mathematics and Com-
    putation, vol. 196, no. 2, pp. 913–922, 2008.
    [10] A. Garivier and E. Moulines, “On upper-confidence bound policies for non-stationary
    bandit problems,” arXiv preprint arXiv:0805.3415, 2008.
    [11] D. Thierens, “An adaptive pursuit strategy for allocating operator probabilities,” in 7th
    Conference on Genetic and Evolutionary Computation, pp. 1539–1546, 2005.
    [12] C. Hartland, N. Baskiotis, S. Gelly, M. Sebag, and O. Teytaud, “Change point detec-
    tion and meta-bandits for online learning in dynamic environments,” in CAp 2007: 9è
    Conférence francophone sur l’apprentissage automatique, pp. 237–250, 2007.
    [13] D. V. Hinkley, “Inference about the change-point from cumulative sum tests,” Biometrika,
    vol. 58, no. 3, pp. 509–523, 1971.
    [14] J. Mellor and J. Shapiro, “Thompson sampling in switching environments with bayesian
    online change detection,” in Artificial Intelligence and Statistics, pp. 442–450, 2013.
    [15] R. Allesiardo and R. Féraud, “Exp3 with drift detection for the switching bandit problem,”
    in 2015 IEEE International Conference on Data Science and Advanced Analytics, pp. 1–7,
    IEEE, 2015.
    [16] T. Schaul, D. Horgan, K. Gregor, and D. Silver, “Universal value function approximators,”
    in International Conference on Machine Learning, pp. 1312–1320, PMLR, 2015.
    [17] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew,
    J. Tobin, O. Pieter Abbeel, and W. Zaremba, “Hindsight experience replay,” Advances in
    Neural Information Processing Systems, vol. 30, 2017.
    [18] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” arXiv
    preprint arXiv:1511.05952, 2015.

    QR CODE