深度強化學習中的多元化驅動探索策略｜國立清華大學博碩士論文庫

簡易檢索 / 詳目顯示

回結果列表

研究生：	洪章瑋 Hong, Zhang-Wei
論文名稱：	深度強化學習中的多元化驅動探索策略 Diversity-driven Exploration Strategy for Deep Reinforcement Learning
指導教授：	李濬屹 Lee, Chun-Yi
口試委員:	黃稚存 Huang, Chih-Tsun 周志遠 Chou, Jerry
學位類別：	碩士 Master
系所名稱：
論文出版年：	2019
畢業學年度：	107
語文別：	英文
論文頁數：	35
中文關鍵詞：	強化學習、深度學習、機器學習、探索
外文關鍵詞：	Reinforcement learning, Deep learning, Machine learning, Exploration
相關次數：	點閱：4 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

有效率的探索策略在強化學習中仍然是一個極具挑戰性的研究問題，尤其是當環境的狀態空間相當龐大或者環境具有欺騙性或者稀疏性獎勵結構時。
為了解決這個問題，我們提出了一個由策略多樣性驅動的探索方法，並且這個方法可以輕易地與 Off-policy 以及 On-policy 的強化學習演算法做結合。我們證明只要加上由策略間的差異測量所構成的正則化至損失函數上就能顯著的提升強化學習代理人的探索表現。改良的探索策略能夠避免策略的學習陷入局部最優解。除此之外，我們提出一個適應性的調節策略以更近一步提昇表現
本文在大型的 2D-Gridworld 以及多個經典的實驗環境(例如: Atari 2600 以及MuJoCo)下都展示出了優異的探索表現。

Efficient exploration remains a challenging research problem in reinforcement learning, especially when an environment contains large state spaces, deceptive or sparse rewards. To tackle this problem, we present a diversity-driven approach for exploration, which can be easily combined with both off- and on-policy reinforcement learning algorithms. We show that by simply adding a distance measure regularization to the loss function, the proposed methodology significantly enhances an agent’s exploratory behavior, and thus prevents the policy from being trapped in local optima. We further propose an adaptive scaling strategy to enhance the performance. We demonstrate the effectiveness of our method in huge 2D gridworlds and a variety of benchmark environments, including Atari 2600 and MuJoCo. Experimental results validate that our method outperforms baseline approaches in most tasks in terms of mean scores and exploration efficiency.

Chinese Abstract i
Abstract ii
Acknowledgements iii
Contents iv
List of Figures vi
List of Tables vii
Introduction 1
Background 4
1 ReinforcementLearning ......................... 4
2 Off-PolicyMethods............................ 4
3 On-PolicyMethods............................ 5
Methodology 7
1 Implementation on Off-PolicyMethods ................. 8
2 Implementation on On-PolicyMethods ................. 9
3 AdaptiveScalingStrategy ........................ 9
4 ClippingofDistanceMeasure ...................... 11
Experiments 12
1 ExperimentalSetup............................ 12
1.1 Environments........................... 12
1.2 Baseline Methods......................... 13
2 Exploration in Huge Gridworld ..................... 13
3 Performance Comparison in Atari2600 ................. 14
3.1 Hard Exploration Games..................... 15
3.2 Easy Exploration Games..................... 18
4 Performance Comparison in MuJoCo Environments . . . . . . . . . . 21
4.1 Environments with Deceptive Rewards . . . . . . . . . . . . . 21
4.2 Environments with Large State Spaces . . . . . . . . . . . . . 22
4.3 Environments with Sparse Rewards............... 22
Related Work 24
1 Entropy regularization for RL....................... 24
2 Maximum entropy principle for RL.................... 24
Conclusion 26
Supplementary Material 27
1 Choice of Distance Measure ....................... 27
2 Training Detail .............................. 27
3 Clipping Distance Measure........................ 28
4 Adaptive Scaling ............................. 29
5 Analysis of Adaptive Scaling Method .................. 30
                                

[1] M. et al. Bellemare. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems (NIPS), pages 1471–1479, December 2016.
[2] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research (JAIR), 47:253–279, May 2013.
[3] E. et al. Conti. Improving exploration in evolution strategies for deep reinforce- ment learning via a population of novelty-seeking agents. arXiv:1712.06560, December 2017.
[4] M. et al. Fortunato. Noisy networks for exploration. In Proc. Int. Conf. Learning Representations (ICLR), May 2018.
[5] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv:1801.01290, January 2018.
[6] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor- critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proc. of Machine Learning Research, pages 1861–1870, Stockholmsmssan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
[7] P. et al. Henderson. Deep reinforcement learning that matters. arXiv:1709.06560, November 2017.
[8] R. et al. Houthooft. Vime: Variational information maximizing exploration. In Advances in Neural Information Processing Systems (NIPS), pages 1109–1117, December 2016.
[9] J. Lehman and K. O. Stanley. Abandoning objectives: Evolution through the search for novelty alone. Evolutionary computation, 19(2):189–223, May 2011.
[10] J. Lehman and K. O. Stanley. Evolving a diversity of virtual creatures through novelty search and local competition. In Proc. Conf. Genetic and Evolutionary
Computation, pages 211–218, July 2011.
[11] J. Lehman and K. O. Stanley. Novelty search and the problem with objectives.
Genetic Programming Theory and Practice IX, pages 37–56, October 2011. [12] T. P. et al. Lillicrap. Continuous control with deep reinforcement learning.
arXiv:1509.02971, February 2016.
[13] V. et al. Mnih. Human-level control through deep reinforcement learning. Nature, vol. 518, no. 7540, pp. 529-533, February 2015.
[14] V. et al. Mnih. Asynchronous methods for deep reinforcement learning. In Proc. Int. Conf. Machine Learning (ICML), pages 1928–1937, June 2016.
[15] I. Osband, C. Blundell, A. Pritzel, and B. Van Roy. Deep exploration via bootstrapped DQN. In Advances in Neural Information Processing Systems
(NIPS), pages 4026–4034, December 2016.
[16] I. Osband, D. Russo, Z. Wen, and B. Van Roy. Deep exploration via randomized value functions. arXiv:1703.07608, March 2017.
[17] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. Curiosity-driven exploration by self-supervised prediction. In Proc. Int. Conf. Machine Learning (ICML), pages 2778–2787, August 2017.
[18] Jan Peters, Katharina Mu ̈lling, and Yasemin Altun. Relative entropy policy search. In Association for the Advancement of Artificial Intelligence (AAAI), pages 1607–1612. Atlanta, 2010.
[19] M. et al. Plappert. Parameter space noise for exploration. In Proc. Int. Conf. Learning Representations (ICLR), May 2018.
[20] J. et al. Schulman. Trust region policy optimization. In Proc. Int. Conf. Machine Learning (ICML), pages 1889–1897, July 2015.
[21] J. et al. Schulman. Proximal policy optimization algorithms. arXiv:1707.06347, August 2017.
[22] D. et al. Silver. Deterministic policy gradient algorithms. In Proc. Int. Conf. Machine Learning (ICML), pages 387–395, June 2014.
[23] D. et al. Silver. Mastering the game of Go with deep neural networks and tree search. Nature, vol. 529, no. 7587, pp. 484-489, January 2016.
[24] B. C. Stadie, S. Levine, and P. Abbeel. Incentivizing exploration in reinforcement learning with deep predictive models. arXiv:1507.00814, November 2015.
[25] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT Press, 1998.
[26] H. et al. Tang. #Exploration: A study of count-based exploration for deep reinforcement learning. In Advances in Neural Information Processing Systems
(NIPS), pages 2750–2759, December 2017.
[27] E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In Proc. Int. Conf. Intelligent Robots and Systems (IROS), pages 5026–5033, December 2012.
[28] A. et al. van den Oord. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems (NIPS), pages 4790–4798, December 2016.
[29] Y. et al. Wu. Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation. In Advances in Neural Information Processing Systems (NIPS), pages 5285–5294, December 2017.
[30] M. et al. Zhang. Learning deep neural network policies with continuous memory states. In Proc. Int. Conf. Robotics and Automation (ICRA), pages 520–527, May 2016.

簡易檢索 / 詳目顯示

相關論文