研究生: |
林士軒 Lin, Shih-Hsuan |
---|---|
論文名稱: |
分析並解決在訓練基於模型的強化式學習架構中遞迴網路的梯度爆炸 Analyzing and Resolving Gradient Exploding in Training Recurrent Networks for Model-based Reinforcement Learning |
指導教授: |
李濬屹
Lee, Chun-Yi |
口試委員: |
謝秉均
Hsieh, Ping-Chun 周志遠 Chou, Jerry |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Computer Science |
論文出版年: | 2022 |
畢業學年度: | 111 |
語文別: | 英文 |
論文頁數: | 27 |
中文關鍵詞: | 強化式學習 |
外文關鍵詞: | Reinforcement Learning |
相關次數: | 點閱:1 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
基於模型的強化式學習架構預測環境的走向,大多數基於模型的強化式學習架構選擇遞迴網路作為推斷潛在走向的環境模型,基於模型的強化式學習架構也承襲了遞迴網絡的問題。在本文中,我們分析了先前方法中的遞迴網路問題,並提出了解決這些問題的方法。在 DeepMind 控制上的實驗表明,所提出的方法比先前方法實現了更好的平均表現。
Model-based Reinforcement Learning (MBRL) consumes the high-dimensional pixel observations as inputs, whereas the latent dynamics model can learn the image inputs into latent state spaces. Most MBRL choose RNN as the world model for inferring latent dynamics. The use of recurrent networks inherits the RNN issue to MBRL. In this paper, we analyze RNN issues in Dreamer and propose a method to resolve the issues. We show our proposed method on the DeepMind Control suite and achieve better overall performance than Dreamer.
[1] L. Pinto, M. Andrychowicz, P. Welinder, W. Zaremba, and P. Abbeel, “Asymmetric actor critic for image-based robot learning,” arXiv preprint arXiv:1710.06542, 2017.
[2] D. Ha and J. Schmidhuber, “World models,” arXiv preprint arXiv:1803.10122, 2018.
[3] D. Hafner, T. P. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson, “Learning latent dynamics for planning from pixels,” in Proceedings of the 36th International Conference on Machine Learning, ICML, 2019.
[4] T. Yu, G. Thomas, L. Yu, S. Ermon, J. Y. Zou, S. Levine, C. Finn, and T. Ma, “Mopo: Model-based offline policy optimization,” in Advances in Neural Information Processing Systems 33, NeurIPS, 2020.
[5] P. Shyam, W. Jaskowski, and F. Gomez, “Model-based active exploration,” in Proceedings of the 36th International Conference on Machine Learning, ICML, 2019.
[6] A. Byravan, J. T. Springenberg, A. Abdolmaleki, R. Hafner, M. Neunert, T. Lampe, N. Y. Siegel, N. Heess, and M. A. Riedmiller, “Imagined value gradients: Model-based policy optimization with transferable latent dynamics models,” arXiv preprint arXiv:1910.04142, 2019.
[7] R. Sekar, O. Rybkin, K. Daniilidis, P. Abbeel, D. Hafner, and D. Pathak, “Planning to explore via self-supervised world models,” in Proceedings of the 37th International Conference on Machine Learning, ICML, 2020.
[8] M. Watter, J. T. Springenberg, J. Boedecker, and M. A. Riedmiller, “Embed to control: A locally linear latent dynamics model for control from raw images,” in Advances in Neural Information Processing Systems 28, NeurIPS, 2015.
[9] J. Oh, S. Singh, and H. Lee, “Value prediction network,” in Advances in Neural Information Processing Systems 30, NeurIPS, 2017.
[10] J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, S. Schmitt, A. Guez, E. Lockhart, D. Hassabis, T. Graepel, T. P. Lillicrap, and D. Silver, “Mastering atari, go, chess and shogi by planning with a learned model,” arXiv preprint arXiv:1911.08265, 2019.
[11] I. Antonoglou, J. Schrittwieser, S. Ozair, T. K. Hubert, and D. Silver, “Planning in stochastic environments with a learned model,” in International Conference on Learning Representations, ICLR, 2022.
[12] D. Hafner, T. P. Lillicrap, J. Ba, and M. Norouzi, “Dream to control: Learning behaviors by latent imagination,” in International Conference on Learning Representations, ICLR, 2020.
[13] D. Hafner, T. P. Lillicrap, M. Norouzi, and J. Ba, “Mastering atari with discrete world models,” in International Conference on Learning Representations, ICLR, 2021.
[14] N. A. Hansen, H. Su, and X. Wang, “Temporal difference learning for model predictive control,” in Proceedings of the 39th International Conference on Machine Learning, ICML, 2022.
[15] Y. Mu, Y. Zhuang, B. Wang, G. Zhu, W. Liu, J. Chen, P. Luo, S. Li, C. Zhang, and J. Hao, “Model-based reinforcement learning via imagination with derived memory,” in Advances in Neural Information Processing Systems 34, NeurIPS, 2021.
[16] T. Wang, S. Du, A. Torralba, P. Isola, A. Zhang, and Y. Tian, “Denoised mdps: Learning world models better than the world itself,” in Proceedings of the 39th International Conference on Machine Learning, ICML, 2022.
[17] C. Yu, D. Li, J. Hao, J. Wang, and N. Burgess, “Learning state representations via retracing in reinforcement learning,” in International Conference on Learning Representations, ICLR, 2022.
[18] L. P. Fröhlich, M. Lefarov, M. N. Zeilinger, and F. Berkenkamp, “On-policy model errors in reinforcement learning,” in International Conference on Learning Representations, ICLR, 2022.
[19] Y. Oh, J. Shin, E. Yang, and S. J. Hwang, “Model-augmented prioritized experience replay,” in International Conference on Learning Representations, ICLR, 2022.
[20] A. Byravan, L. Hasenclever, P. Trochim, M. Mirza, A. D. Ialongo, Y. Tassa, J. T. Springenberg, A. Abdolmaleki, N. Heess, J. Merel, and M. A. Riedmiller, “Evaluating model-based planning and planner amortization for continuous control,” in International Confer-
ence on Learning Representations, ICLR, 2022.
[21] R. J. Williams and D. Zipser, “A learning algorithm for continually running fully recurrent neural networks,” Neural Comput., 1989.
[22] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, 1997.
[23] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent neural networks,” in Proceedings of the 30th International Conference on Machine Learning, ICML, 2013.
[24] J. Zhang, T. He, S. Sra, and A. Jadbabaie, “Why gradient clipping accelerates training: A theoretical justification for adaptivity,” in International Conference on Learning Representations, ICLR, 2020.
[25] M. Okada and T. Taniguchi, “Dreaming: Model-based reinforcement learning by latent imagination without reconstruction,” in IEEE International Conference on Robotics and Automation, ICRA, 2021.
[26] M. Okada and T. Taniguchi, “Dreamingv2: Reinforcement learning with discrete world models without reconstruction,” arXiv preprint arXiv:2203.00494, 2022.
[27] F. Deng, I. Jang, and S. Ahn, “Dreamerpro: Reconstruction-free model-based reinforcement learning with prototypical representations,” in International Conference on Machine Learning, ICML, 2022.
[28] T. Nguyen, R. Shu, T. Pham, H. Bui, and S. Ermon, “Temporal predictive coding for model-based planning in latent space,” in Proceedings of the 38th International Conference on Machine Learning, ICML, 2021.
[29] Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. de Las Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, T. Lillicrap, and M. Riedmiller, “Deepmind control suite,” arXiv preprint arXiv:1801.00690, 2018.
[30] R. Coulom, “Efficient selectivity and backup operators in monte-carlo tree search,” in International Conference on Computers and Games, 2006.
[31] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. P. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis, “Mastering the game of go without human knowledge,” Nature, vol. 550, no. 7676, 2017.
[32] T. Wang and J. Ba, “Exploring model-based planning with policy networks,” in International Conference on Learning Representations, 2020.
[33] Z. I. Botev, D. P. Kroese, R. Y. Rubinstein, and P. L'Ecuyer, “The cross-entropy method for optimization,” in Handbook of statistics, vol. 31, pp. 35–59, Elsevier, 2013.
[34] P. d. Boer, D. P. Kroese, S. Mannor, and R. Y. Rubinstein, “A tutorial on the cross-entropy method,” Annals of Operations Research, vol. 134, no. 1, pp. 19–67, 2005.
[35] J. Y. Koh, H. Lee, Y. Yang, J. Baldridge, and P. Anderson, “Pathdreamer: A world model for indoor navigation,” in 2021 IEEE/CVF International Conference on Computer Vision, ICCV, 2021.
[36] L. Kaiser, M. Babaeizadeh, P. Milos, B. Osinski, R. H. Campbell, K. Czechowski, D. Erhan, C. Finn, P. Kozakowski, S. Levine, R. Sepassi, G. Tucker, and H. Michalewski, “Model-based reinforcement learning for atari,” arXiv preprint arXiv:1903.00374, 2019.
[37] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin, “Unsupervised learning of visual features by contrasting cluster assignments,” in Advances in Neural Information Processing Systems 33, NeurIPS, 2020.
[38] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
[39] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. MIT Press, 2018.
[40] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. A. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, 2015.