基於能量正規化流的最大熵強化學習｜國立清華大學博碩士論文庫

簡易檢索 / 詳目顯示

回結果列表

研究生：	馮謙 Feng, Chien
論文名稱：	基於能量正規化流的最大熵強化學習 Maximum Entropy Reinforcement Learning via Energy-Based Normalizing Flow
指導教授：	李濬屹 Lee, Chun-Yi
口試委員:	周志遠 Chou, Jerry 陳奕廷 Chen, Yi-Ting
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊工程學系 Computer Science
論文出版年：	2024
畢業學年度：	112
語文別：	英文
論文頁數：	49
中文關鍵詞：	強化學習、正規化流
外文關鍵詞：	Reinforcement Learning, Normalizing Flow
相關次數：	點閱：53 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

現有針對連續動作空間的最大熵（MaxEnt）強化學習（RL）方法通常基於
演員-評論家（Actor-Critic）框架，並通過交替進行的策略評估和策略改進步驟
來優化。在策略評估步驟中，評論家被更新以捕獲 Soft-Q 函數。在策略改進步
驟中，演員根據更新的 Soft-Q 函數進行調整。在本文中，我們介紹了一種使用
基於能量的正規化流（EBFlow）建模的新最大熵強化學習框架。這個框架整合
了策略評估步驟和策略改進步驟，形成了一個單一目標的訓練過程。我們的方
法無需蒙地卡羅近似即可計算用於策略評估目標的 Soft-Value 函數。此外，這
種設計支持多模態動作分佈的建模，同時促進高效的動作採樣。為了評估我們
方法的性能，我們在 MuJoCo 基準套件和 Omniverse Isaac Gym 模擬的多個高
維機器人任務上進行了實驗。評估結果顯示，我們的方法在性能上優於廣泛採
用的代表性基準方法。

Existing Maximum-Entropy (MaxEnt) Reinforcement Learning (RL) methods
for continuous action spaces are typically formulated based on actor-critic frameworks and optimized through alternating steps of policy evaluation and policy
improvement. In the policy evaluation steps, the critic is updated to capture the soft
Q-function. In the policy improvement steps, the actor is adjusted in accordance
with the updated soft Q-function. In this paper, we introduce a new MaxEnt
RL framework modeled using Energy-Based Normalizing Flows (EBFlow). This
framework integrates the policy evaluation steps and the policy improvement steps,
resulting in a single objective training process. Our method enables the calculation
of the soft value function used in the policy evaluation target without Monte
Carlo approximation. Moreover, this design supports the modeling of multi-modal
action distributions while facilitating efficient action sampling. To evaluate the
performance of our method, we conducted experiments on the MuJoCo benchmark
suite and a number of high-dimensional robotic tasks simulated by Omniverse
Isaac Gym. The evaluation results demonstrate that our method achieves superior
performance compared to widely-adopted representative baselines.

Contents
Acknowledgements (Chinese) 1
Abstract (Chinese) 2
Abstract 3
Contents 4
1 Introduction 6
2 Background and Related Works 7
2.1 Maximum Entropy Reinforcement Learning . . . . . . . . . . . . . 8
2.2 Actor-Critic Frameworks and Soft Value Estimation in MaxEnt RL 10
2.3 Energy-Based Normalizing Flows . . . . . . . . . . . . . . . . . . . 12
3 Methodology 14
3.1 MaxEnt RL via EBFlow . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Techniques for Improving the Training Process of MEow . . . . . . 16
3.3 Algorithm Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4 Experiments 19
4.1 Evaluation on a Multi-Goal Environment . . . . . . . . . . . . . . . 19
4.2 Performance Comparison on the MuJoCo Environments . . . . . . . 20
4.3 Performance Comparison on the Omniverse Issac Gym Environments 22
4.4 Ablation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5 Conclusion 24
References 25
A Appendix 35
A.1 The Soft Value Estimation Methods in SAC and SQL . . . . . . . . 35
A.2 Properties of MEow and Learnable Reward Shifting . . . . . . . . . 36
A.3 Inference using a Deterministic Policy in MEow . . . . . . . . . . . 38
A.4 The Issue of Numerical Instability . . . . . . . . . . . . . . . . . . . 39
A.5 Supplementary Experiments . . . . . . . . . . . . . . . . . . . . . . 41
A.5.1 Comparison between Stochastic and Deterministic Policies . 41
A.5.2 Comparison of Additive and Affine Transformations . . . . . 42
A.5.3 Influences of Parameterization in MaxEnt RL Actor-Critic
Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
A.5.4 Modeling Multi-Modal Distributions using Flow-based Models 44
A.5.5 Applying Learnable Reward Shifting to SAC . . . . . . . . . 45
A.5.6 MEow’s Variant with an Actor-Critic Design . . . . . . . . . 45
A.6 Experimental Setups . . . . . . . . . . . . . . . . . . . . . . . . . . 46
A.6.1 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . 46
A.6.2 Experiments on the Multi-Goal Environment . . . . . . . . . 48
A.6.3 Experiments on the MuJoCo Environments . . . . . . . . . 48
A.6.4 Experiments on the Omniverse Isaac Gym Environments . . 49
                                

References
[1] Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nico-
las Heess, and Martin Riedmiller. Maximum a Posteriori Policy Optimisation.
In Proceedings of the International Conference on Learning Representations
(ICLR), 2018.
[2] Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer Normalization.
ArXiv, abs/1607.06450, 2016.
[3] Chen-Hao Chao, Wei-Fang Sun, Yen-Chang Hsu, Zsolt Kira, and Chun-Yi Lee.
Training Energy-Based Normalizing Flow with Score-Matching Objectives. In
Proceedings of the International Conference on Neural Information Processing
Systems (NeurIPS), 2023.
[4] Junior Costa de Jesus, Victor Augusto Kich, Alisson Henrique Kolling, Ri-
cardo Bedin Grando, Marco Antonio de Souza Leite Cuadros, and Daniel
Fernando Tello Gamarra. Soft Actor-Critic for Navigation of Mobile Robots.
Journal of Intelligent and Robotic Systems, 2021.
[5] Michel Dekking. A Modern Introduction to Probability and Statistics. 2007.
[6] Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: Non-linear Inde-
pendent Components Estimation. Workshop at the International Conference
on Learning Representations (ICLR), 2015.
[7] Laurent Dinh, Jascha Narain Sohl-Dickstein, and Samy Bengio. Density
Estimation using Real NVP. Proceedings of the International Conference on
Learning Representations (ICLR), 2017.
[8] Conor Durkan, Artur Bekasov, Iain Murray, and George Papamakarios. Neu-
ral Spline Flows. Proceedings of the International Conference on Neural
Information Processing Systems (NeurIPS), 2019.
[9] Benjamin Eysenbach and Sergey Levine. Maximum Entropy RL Provably
Solves Some Robust RL Problems. In Proceedings of the International Confer-
ence on Learning Representations (ICLR), 2022.
[10] Roy Fox, Ari Pakman, and Naftali Tishby. Taming the Noise in Reinforcement
Learning via Soft Updates. In Proceedings of the Conference on Uncertainty
in Artificial Intelligence (UAI), 2016.
[11] Scott Fujimoto, Herke van Hoof, and David Meger. Addressing Function Ap-
proximation Error in Actor-Critic Methods. In Proceedings of the International
Conference on Machine Learning (ICML), 2018.
[12] Mathieu Germain, Karol Gregor, Iain Murray, and H. Larochelle. MADE:
Masked Autoencoder for Distribution Estimation. 2015.
[13] Michael B. Giles. Multilevel Monte Carlo Methods. Acta Numerica, 24:259 –
328, 2013.
[14] Will Grathwohl, Kuan-Chieh Jackson Wang, Jörn-Henrik Jacobsen,
David Kristjanson Duvenaud, Mohammad Norouzi, and Kevin Swersky. Your
Classifier is Secretly an Energy Based Model and You Should Treat it Like
One. Proceedings of the International Conference on Learning Representations
(ICLR), 2019.
[15] Will Grathwohl, Kuan-Chieh Jackson Wang, Jörn-Henrik Jacobsen,
David Kristjanson Duvenaud, and Richard S. Zemel. Learning the Stein
Discrepancy for Training and Evaluating Energy-Based Models without Sampling. In Proceedings of the International Conference on Machine Learning,
2020.
[16] L. Gresele, G. Fissore, A. Javaloy, B. Schölkopf, and A. Hyvärinen. Relative
Gradient Optimization of the Jacobian Term in Unsupervised Deep Learning.
In Proceedings of the Conference on Neural Information Processing Systems
(NeurIPS), 2020.
[17] Tuomas Haarnoja, Kristian Hartikainen, P. Abbeel, and Sergey Levine. Latent
Space Policies for Hierarchical Reinforcement Learning. In Proceedings of the
International Conference on Machine Learning (ICML), 2018.
[18] Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Rein-
forcement Learning with Deep Energy-Based Policies. In Proceedings of the
International Conference on Machine Learning (ICML), 2017.
[19] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft
Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning
with a Stochastic Actor. In Proceedings of the International Conference on
Machine Learning (ICML), 2017.
[20] Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, G. Tucker, Sehoon Ha,
Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, P. Abbeel, and Sergey
Levine. Soft Actor-Critic Algorithms and Applications. ArXiv, abs/1812.05905,
2018.
[21] Moritz Hardt, Benjamin Recht, and Yoram Singer. Train faster, generalize
better: Stability of stochastic gradient descent. 2015.
[22] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and
Ruslan Salakhutdinov. Improving neural networks by preventing co-adaptation
of feature detectors. ArXiv, abs/1207.0580, 2012.
[23] Emiel Hoogeboom, Rianne van den Berg, and Max Welling. Emerging Con-
volutions for Generative Normalizing Flows. Proceedings of the International
Conference on Machine Learning (ICML), 2019.
[24] Shengyi Huang, Rousslan Fernand Julien Dossa, Chang Ye, Jeff Braga, Dipam
Chakraborty, Kinal Mehta, and João G.M. Araújo. CleanRL: High-quality
Single-file Implementations of Deep Reinforcement Learning Algorithms. Jour-
nal of Machine Learning Research (JMLR), 23(274):1–18, 2022.
[25] A. Hyvärinen. Estimation of Non-Normalized Statistical Models by Score
Matching. Journal of Machine Learning Research (JMLR), 2005.
[26] A. Hyvärinen and E. Oja. Independent Component Analysis: Algorithms
and Applications. Neural Networks: the Official Journal of the International
Neural Network Society, 13 4-5:411–30, 2000.
[27] Malvin H. Kalos and Paula A. Whitlock. Monte Carlo methods. Vol. 1: basics.
Wiley-Interscience, 1986.
[28] H J Kappen. Path Integrals and Symmetry Breaking for Optimal Control
Theory. Journal of Statistical Mechanics: Theory and Experiment, 2005.
[29] T. Anderson Keller, Jorn W. T. Peters, Priyank Jaini, Emiel Hoogeboom,
Patrick Forr’e, and Max Welling. Self Normalizing Flows. In Proceedings of
the International Conference on Machine Learning (ICML), 2020.
[30] Diederik Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization.
In Proceedings of the International Conference on Learning Representations
(ICLR), 2015.
[31] Diederik P. Kingma and Prafulla Dhariwal. Glow: Generative Flow with
Invertible 1x1 Convolutions. Proceedings of the International Conference on
Neural Information Processing Systems (NeurIPS), 2018.
[32] Diederik P. Kingma, Tim Salimans, and Max Welling. Improved Variational
Inference with Inverse Autoregressive Flow. Proceedings of the International
Conference on Neural Information Processing Systems (NeurIPS), 2016.
[33] Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes. 2013.
[34] Yann LeCun, Sumit Chopra, Raia Hadsell, Aurelio Ranzato, and Fu Jie Huang.
A Tutorial on Energy-Based Learning. 2006.
[35] Kyungjae Lee, Sungyub Kim, Sungbin Lim, Sungjoon Choi, and Songhwai Oh.
Tsallis Reinforcement Learning: A Unified Framework for Maximum Entropy
Reinforcement Learning. ArXiv, abs/1902.00137, 2019.
[36] Xiang Li, Wenhao Yang, Jiadong Liang, Zhihua Zhang, and Michael I. Jordan.
A Statistical Analysis of Polyak-Ruppert Averaged Q-Learning. In Proceed-
ings of the International Conference on Artificial Intelligence and Statistics
(AISTAT), 2021.
[37] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Man-
fred Otto Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra.
Continuous control with deep reinforcement learning. Proceedings of the
International Conference on Learning Representations (ICLR), 2016.
[38] Qiang Liu and Dilin Wang. Stein Variational Gradient Descent: A General
Purpose Bayesian Inference Algorithm. In Proceedings of the International
Conference on Neural Information Processing Systems (NeurIPS), 2016.
[39] You Lu and Bert Huang. Woodbury Transformations for Deep Generative
Flows. Proceedings of the International Conference on Neural Information
Processing Systems (NeurIPS), 2020.
[40] Xuezhe Ma and Eduard H. Hovy. MaCow: Masked Convolutional Generative
Flow. Proceedings of the International Conference on Neural Information
Processing Systems (NeurIPS), 2019.
[41] Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier
Storey, Miles Macklin, David Hoeller, N. Rudin, Arthur Allshire, Ankur
Handa, and Gavriel State. Isaac Gym: High Performance GPU-Based Physics
Simulation For Robot Learning. Proceedings of the International Conference
on Neural Information Processing Systems (NeurIPS) Dataset and Benchmark
Track, 2021.
[42] Bogdan Mazoure, Thang Doan, Audrey Durand, R Devon Hjelm, and Joelle
Pineau. Leveraging Exploration in Off-policy Algorithms via Normalizing
Flows. In Proceedings of the Conference on Robot Learning (CoRL), 2019.
[43] Chenlin Meng, Linqi Zhou, Kristy Choi, Tri Dao, and Stefano Ermon. Butter-
flyFlow: Building Invertible Layers with Butterfly Matrices. Proceedings of
the International Conference on Machine Learning (ICML), 2022.
[44] Safa Messaoud, Billel Mokeddem, Zhenghai Xue, Linsey Pang, Bo An, Haipeng
Chen, and Sanjay Chawla. S2AC: Energy-Based Reinforcement Learning with
Stein Soft Actor Critic. In Proceedings of the International Conference on
Learning Representations (ICLR), 2024.
[45] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Ve-
ness, Marc G. Bellemare, Alex Graves, Martin A. Riedmiller, Andreas Kirkeby
Fidjeland, Georg Ostrovski, Stig Petersen, Charlie Beattie, Amir Sadik, Ioannis
Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and
Demis Hassabis. Human-level Control through Deep Reinforcement Learning.
Nature, 518:529–533, 2015.
[46] Thomas Müller, Brian McWilliams, Fabrice Rousselle, Markus H. Gross, and
Jan Novák. Neural Importance Sampling. ACM Transactions on Graphics
(TOG), 2018.
[47] Iman Nematollahi, Erick Rosete-Beas, Adrian Roefer, Tim Welschehold, Abhi-
nav Valada, and Wolfram Burgard. Robot Skill Adaptation via Soft Actor-
Critic Gaussian Mixture Models. Proceedings of IEEE International Conference
on Robotics and Automation (ICRA), 2022.
[48] Brendan O’Donoghue, Rémi Munos, Koray Kavukcuoglu, and Volodymyr
Mnih. PGQ: Combining Policy Gradient and Q-learning. In Proceedings of
the International Conference on Learning Representations (ICLR), 2017.
[49] George Papamakarios, Iain Murray, and Theo Pavlakou. Masked Autoregressive
Flow for Density Estimation. Proceedings of the International Conference on
Neural Information Processing Systems (NeurIPS), 2017.
[50] George Papamakarios, Eric T. Nalisnick, Danilo Jimenez Rezende, Shakir
Mohamed, and Balaji Lakshminarayanan. Normalizing Flows for Probabilistic
Modeling and Inference. Journal of Machine Learning Research (JMLR), 2019.
[51] Kwan-Woo Park, MyeongSeop Kim, Jung-Su Kim, and Jae-Han Park. Path
Planning for Multi-Arm Manipulators Using Soft Actor-Critic Algorithm with
Position Prediction of Moving Obstacles via LSTM. Applied Sciences, 2022.
[52] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury,
Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga,
Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison,
Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai,
and Soumith Chintala. Pytorch: An imperative style, high-performance deep
learning library. 2019.
[53] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic
Programming. John Wiley & Sons, Inc., 1994.
[54] Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian
Ernestus, and Noah Dormann. Stable-Baselines3: Reliable Reinforcement
Learning Implementations. Journal of Machine Learning Research (JMLR),
22(268):1–8, 2021.
[55] Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for Activation
Functions. arXiv:1710.05941, 2017.
[56] Konrad Rawlik, Marc Toussaint, and Sethu Vijayakumar. On Stochastic
Optimal Control and Reinforcement Learning by Approximate Inference. In
Proceedings of the International Joint Conference on Artificial Intelligence
(IJCAI), 2012.
[57] Gareth O. Roberts and Jeffrey S. Rosenthal. Optimal scaling of discrete
approximations to Langevin diffusions. Journal of the Royal Statistical Society:
Series B (Statistical Methodology), 1998.
[58] Gareth O. Roberts and Richard L. Tweedie. Exponential convergence of
Langevin distributions and their discrete approximations. Bernoulli, 1996.
[59] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg
Klimov. Proximal Policy Optimization Algorithms. ArXiv, abs/1707.06347,
2017.
[60] Antonio Serrano-Muñoz, Dimitrios Chrysostomou, Simon Bøgh, and Nestor
Arana-Arexolaleiba. skrl: Modular and Flexible Library for Reinforcement
Learning. Journal of Machine Learning Research (JMLR), 24(254):1–9, 2023.
[61] Wenjie Shi, Shiji Song, and Cheng Wu. Soft Policy Gradient Method for
Maximum Entropy Deep Reinforcement Learning. In Proceedings of the
International Joint Conference on Artificial Intelligence (IJCAI), 2019.
[62] Vincent Stimper, David Liu, Andrew Campbell, Vincent Berenz, Lukas Ryll,
Bernhard Schölkopf, and José Miguel Hernández-Lobato. normflows: A Py-
Torch Package for Normalizing Flows. Journal of Open Source Software,
8(86):5361, 2023.
[63] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Intro-
duction. A Bradford Book, 2018.
[64] Emanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: A physics engine
for model-based control. In Proceedings of the International Conference on
Intelligent Robots and Systems (IROS), 2012.
[65] Surya T. Tokdar and Robert E. Kass. Importance Sampling: A Review. Wiley
Interdisciplinary Reviews: Computational Statistics, 2, 2010.
[66] Marc Toussaint. Robot Trajectory Optimization using Approximate Inference.
In Proceedings of the International Conference on Machine Learning (ICML),
2009.
[67] Mark Towers, Jordan K. Terry, Ariel Kwiatkowski, John U. Balis, Gianluca de
Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Arjun KG, Markus
Krimmel, Rodrigo Perez-Vicente, Andrea Pierré, Sander Schulhoff, Jun Jet
Tai, Andrew Tan Jin Shen, and Omar G. Younis. Gymnasium, 2023.
[68] Dilin Wang and Qiang Liu. Learning to Draw Samples: With Application to
Amortized MLE for Generative Adversarial Learning. In Proceedings of the
International Conference on Learning Representations (ICLR), 2016.
[69] Patrick Nadeem Ward, Ariella Smofsky, and A. Bose. Improving Exploration
in Soft-Actor-Critic with Normalizing Flows Policies. 2019.
[70] Max Welling and Yee Whye Teh. Bayesian Learning via Stochastic Gradient
Langevin Dynamics. In Proceedings of the International Conference on Machine
Learning (ICML), 2011.
[71] Denis Yarats, Amy Zhang, Ilya Kostrikov, Brandon Amos, Joelle Pineau, and
Rob Fergus. Improving sample efficiency in model-free reinforcement learning
from images. In Proceedings of the AAAI Conference on Artificial Intelligence
(AAAI), 2021.
[72] Dinghuai Zhang, Aaron Courville, Yoshua Bengio, Qinqing Zheng, Amy Zhang,
and Ricky T. Q. Chen. Latent State Marginalization as a Low-cost Approach
to Improving Exploration. In Proceedings of the International Conference on
Learning Representations (ICLR), 2023.
[73] Brian Ziebart, Andrew Maas, J. Bagnell, and Anind Dey. Maximum Entropy
Inverse Reinforcement Learning. In Proceedings of the AAAI Conference on
Artificial Intelligence (AAAI), 2008.
[74] Brian D. Ziebart. Modeling Purposeful Adaptive Behavior with the Principle
of Maximum Causal Entropy. PhD thesis, USA, 2010.

簡易檢索 / 詳目顯示

相關論文