研究生: |
洪章瑋 Hong, Zhang-Wei |
---|---|
論文名稱: |
深度強化學習中的多元化驅動探索策略 Diversity-driven Exploration Strategy for Deep Reinforcement Learning |
指導教授: |
李濬屹
Lee, Chun-Yi |
口試委員: |
黃稚存
Huang, Chih-Tsun 周志遠 Chou, Jerry |
學位類別: |
碩士 Master |
系所名稱: |
|
論文出版年: | 2019 |
畢業學年度: | 107 |
語文別: | 英文 |
論文頁數: | 35 |
中文關鍵詞: | 強化學習 、深度學習 、機器學習 、探索 |
外文關鍵詞: | Reinforcement learning, Deep learning, Machine learning, Exploration |
相關次數: | 點閱:4 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
有效率的探索策略在強化學習中仍然是一個極具挑戰性的研究 問題,尤其是當環境的狀態空間相當龐大或者環境具有欺騙性或者稀疏性獎勵結構時。
為了解決這個問題,我們提出了一個由策略多樣性驅動的探索方法,並且這個方法可以輕易地與 Off-policy 以及 On-policy 的強化學 習演算法做結合。我們證明只要加上由策略間的差異測量所構成的正則化至損失函數上就能顯著的提升強化學習代理人的探索表現。改良 的探索策略能夠避免策略的學習陷入局部最優解。除此之外,我們提 出一個適應性的調節策略以更近一步提昇表現
本文在大型的 2D-Gridworld 以及多個經典的實驗環境(例如: Atari 2600 以及MuJoCo)下都展示出了優異的探索表現。
Efficient exploration remains a challenging research problem in reinforcement learning, especially when an environment contains large state spaces, deceptive or sparse rewards. To tackle this problem, we present a diversity-driven approach for exploration, which can be easily combined with both off- and on-policy reinforcement learning algorithms. We show that by simply adding a distance measure regularization to the loss function, the proposed methodology significantly enhances an agent’s exploratory behavior, and thus prevents the policy from being trapped in local optima. We further propose an adaptive scaling strategy to enhance the performance. We demonstrate the effectiveness of our method in huge 2D gridworlds and a variety of benchmark environments, including Atari 2600 and MuJoCo. Experimental results validate that our method outperforms baseline approaches in most tasks in terms of mean scores and exploration efficiency.
[1] M. et al. Bellemare. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems (NIPS), pages 1471–1479, December 2016.
[2] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research (JAIR), 47:253–279, May 2013.
[3] E. et al. Conti. Improving exploration in evolution strategies for deep reinforce- ment learning via a population of novelty-seeking agents. arXiv:1712.06560, December 2017.
[4] M. et al. Fortunato. Noisy networks for exploration. In Proc. Int. Conf. Learning Representations (ICLR), May 2018.
[5] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv:1801.01290, January 2018.
[6] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor- critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proc. of Machine Learning Research, pages 1861–1870, Stockholmsmssan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
[7] P. et al. Henderson. Deep reinforcement learning that matters. arXiv:1709.06560, November 2017.
[8] R. et al. Houthooft. Vime: Variational information maximizing exploration. In Advances in Neural Information Processing Systems (NIPS), pages 1109–1117, December 2016.
[9] J. Lehman and K. O. Stanley. Abandoning objectives: Evolution through the search for novelty alone. Evolutionary computation, 19(2):189–223, May 2011.
[10] J. Lehman and K. O. Stanley. Evolving a diversity of virtual creatures through novelty search and local competition. In Proc. Conf. Genetic and Evolutionary
Computation, pages 211–218, July 2011.
[11] J. Lehman and K. O. Stanley. Novelty search and the problem with objectives.
Genetic Programming Theory and Practice IX, pages 37–56, October 2011. [12] T. P. et al. Lillicrap. Continuous control with deep reinforcement learning.
arXiv:1509.02971, February 2016.
[13] V. et al. Mnih. Human-level control through deep reinforcement learning. Nature, vol. 518, no. 7540, pp. 529-533, February 2015.
[14] V. et al. Mnih. Asynchronous methods for deep reinforcement learning. In Proc. Int. Conf. Machine Learning (ICML), pages 1928–1937, June 2016.
[15] I. Osband, C. Blundell, A. Pritzel, and B. Van Roy. Deep exploration via bootstrapped DQN. In Advances in Neural Information Processing Systems
(NIPS), pages 4026–4034, December 2016.
[16] I. Osband, D. Russo, Z. Wen, and B. Van Roy. Deep exploration via randomized value functions. arXiv:1703.07608, March 2017.
[17] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. Curiosity-driven exploration by self-supervised prediction. In Proc. Int. Conf. Machine Learning (ICML), pages 2778–2787, August 2017.
[18] Jan Peters, Katharina Mu ̈lling, and Yasemin Altun. Relative entropy policy search. In Association for the Advancement of Artificial Intelligence (AAAI), pages 1607–1612. Atlanta, 2010.
[19] M. et al. Plappert. Parameter space noise for exploration. In Proc. Int. Conf. Learning Representations (ICLR), May 2018.
[20] J. et al. Schulman. Trust region policy optimization. In Proc. Int. Conf. Machine Learning (ICML), pages 1889–1897, July 2015.
[21] J. et al. Schulman. Proximal policy optimization algorithms. arXiv:1707.06347, August 2017.
[22] D. et al. Silver. Deterministic policy gradient algorithms. In Proc. Int. Conf. Machine Learning (ICML), pages 387–395, June 2014.
[23] D. et al. Silver. Mastering the game of Go with deep neural networks and tree search. Nature, vol. 529, no. 7587, pp. 484-489, January 2016.
[24] B. C. Stadie, S. Levine, and P. Abbeel. Incentivizing exploration in reinforcement learning with deep predictive models. arXiv:1507.00814, November 2015.
[25] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT Press, 1998.
[26] H. et al. Tang. #Exploration: A study of count-based exploration for deep reinforcement learning. In Advances in Neural Information Processing Systems
(NIPS), pages 2750–2759, December 2017.
[27] E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In Proc. Int. Conf. Intelligent Robots and Systems (IROS), pages 5026–5033, December 2012.
[28] A. et al. van den Oord. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems (NIPS), pages 4790–4798, December 2016.
[29] Y. et al. Wu. Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation. In Advances in Neural Information Processing Systems (NIPS), pages 5285–5294, December 2017.
[30] M. et al. Zhang. Learning deep neural network policies with continuous memory states. In Proc. Int. Conf. Robotics and Automation (ICRA), pages 520–527, May 2016.