簡易檢索 / 詳目顯示

研究生: 馮 謙
Feng, Chien
論文名稱: 基於能量正規化流的最大熵強化學習
Maximum Entropy Reinforcement Learning via Energy-Based Normalizing Flow
指導教授: 李濬屹
Lee, Chun-Yi
口試委員: 周志遠
Chou, Jerry
陳奕廷
Chen, Yi-Ting
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2024
畢業學年度: 112
語文別: 英文
論文頁數: 49
中文關鍵詞: 強化學習正規化流
外文關鍵詞: Reinforcement Learning, Normalizing Flow
相關次數: 點閱:53下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 現有針對連續動作空間的最大熵(MaxEnt)強化學習(RL)方法通常基於
    演員-評論家(Actor-Critic)框架,並通過交替進行的策略評估和策略改進步驟
    來優化。在策略評估步驟中,評論家被更新以捕獲 Soft-Q 函數。在策略改進步
    驟中,演員根據更新的 Soft-Q 函數進行調整。在本文中,我們介紹了一種使用
    基於能量的正規化流(EBFlow)建模的新最大熵強化學習框架。這個框架整合
    了策略評估步驟和策略改進步驟,形成了一個單一目標的訓練過程。我們的方
    法無需蒙地卡羅近似即可計算用於策略評估目標的 Soft-Value 函數。此外,這
    種設計支持多模態動作分佈的建模,同時促進高效的動作採樣。為了評估我們
    方法的性能,我們在 MuJoCo 基準套件和 Omniverse Isaac Gym 模擬的多個高
    維機器人任務上進行了實驗。評估結果顯示,我們的方法在性能上優於廣泛採
    用的代表性基準方法。


    Existing Maximum-Entropy (MaxEnt) Reinforcement Learning (RL) methods
    for continuous action spaces are typically formulated based on actor-critic frameworks and optimized through alternating steps of policy evaluation and policy
    improvement. In the policy evaluation steps, the critic is updated to capture the soft
    Q-function. In the policy improvement steps, the actor is adjusted in accordance
    with the updated soft Q-function. In this paper, we introduce a new MaxEnt
    RL framework modeled using Energy-Based Normalizing Flows (EBFlow). This
    framework integrates the policy evaluation steps and the policy improvement steps,
    resulting in a single objective training process. Our method enables the calculation
    of the soft value function used in the policy evaluation target without Monte
    Carlo approximation. Moreover, this design supports the modeling of multi-modal
    action distributions while facilitating efficient action sampling. To evaluate the
    performance of our method, we conducted experiments on the MuJoCo benchmark
    suite and a number of high-dimensional robotic tasks simulated by Omniverse
    Isaac Gym. The evaluation results demonstrate that our method achieves superior
    performance compared to widely-adopted representative baselines.

    Contents Acknowledgements (Chinese) 1 Abstract (Chinese) 2 Abstract 3 Contents 4 1 Introduction 6 2 Background and Related Works 7 2.1 Maximum Entropy Reinforcement Learning . . . . . . . . . . . . . 8 2.2 Actor-Critic Frameworks and Soft Value Estimation in MaxEnt RL 10 2.3 Energy-Based Normalizing Flows . . . . . . . . . . . . . . . . . . . 12 3 Methodology 14 3.1 MaxEnt RL via EBFlow . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2 Techniques for Improving the Training Process of MEow . . . . . . 16 3.3 Algorithm Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4 Experiments 19 4.1 Evaluation on a Multi-Goal Environment . . . . . . . . . . . . . . . 19 4.2 Performance Comparison on the MuJoCo Environments . . . . . . . 20 4.3 Performance Comparison on the Omniverse Issac Gym Environments 22 4.4 Ablation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 5 Conclusion 24 References 25 A Appendix 35 A.1 The Soft Value Estimation Methods in SAC and SQL . . . . . . . . 35 A.2 Properties of MEow and Learnable Reward Shifting . . . . . . . . . 36 A.3 Inference using a Deterministic Policy in MEow . . . . . . . . . . . 38 A.4 The Issue of Numerical Instability . . . . . . . . . . . . . . . . . . . 39 A.5 Supplementary Experiments . . . . . . . . . . . . . . . . . . . . . . 41 A.5.1 Comparison between Stochastic and Deterministic Policies . 41 A.5.2 Comparison of Additive and Affine Transformations . . . . . 42 A.5.3 Influences of Parameterization in MaxEnt RL Actor-Critic Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 A.5.4 Modeling Multi-Modal Distributions using Flow-based Models 44 A.5.5 Applying Learnable Reward Shifting to SAC . . . . . . . . . 45 A.5.6 MEow’s Variant with an Actor-Critic Design . . . . . . . . . 45 A.6 Experimental Setups . . . . . . . . . . . . . . . . . . . . . . . . . . 46 A.6.1 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . 46 A.6.2 Experiments on the Multi-Goal Environment . . . . . . . . . 48 A.6.3 Experiments on the MuJoCo Environments . . . . . . . . . 48 A.6.4 Experiments on the Omniverse Isaac Gym Environments . . 49

    References
    [1] Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nico-
    las Heess, and Martin Riedmiller. Maximum a Posteriori Policy Optimisation.
    In Proceedings of the International Conference on Learning Representations
    (ICLR), 2018.
    [2] Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer Normalization.
    ArXiv, abs/1607.06450, 2016.
    [3] Chen-Hao Chao, Wei-Fang Sun, Yen-Chang Hsu, Zsolt Kira, and Chun-Yi Lee.
    Training Energy-Based Normalizing Flow with Score-Matching Objectives. In
    Proceedings of the International Conference on Neural Information Processing
    Systems (NeurIPS), 2023.
    [4] Junior Costa de Jesus, Victor Augusto Kich, Alisson Henrique Kolling, Ri-
    cardo Bedin Grando, Marco Antonio de Souza Leite Cuadros, and Daniel
    Fernando Tello Gamarra. Soft Actor-Critic for Navigation of Mobile Robots.
    Journal of Intelligent and Robotic Systems, 2021.
    [5] Michel Dekking. A Modern Introduction to Probability and Statistics. 2007.
    [6] Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: Non-linear Inde-
    pendent Components Estimation. Workshop at the International Conference
    on Learning Representations (ICLR), 2015.
    [7] Laurent Dinh, Jascha Narain Sohl-Dickstein, and Samy Bengio. Density
    Estimation using Real NVP. Proceedings of the International Conference on
    Learning Representations (ICLR), 2017.
    [8] Conor Durkan, Artur Bekasov, Iain Murray, and George Papamakarios. Neu-
    ral Spline Flows. Proceedings of the International Conference on Neural
    Information Processing Systems (NeurIPS), 2019.
    [9] Benjamin Eysenbach and Sergey Levine. Maximum Entropy RL Provably
    Solves Some Robust RL Problems. In Proceedings of the International Confer-
    ence on Learning Representations (ICLR), 2022.
    [10] Roy Fox, Ari Pakman, and Naftali Tishby. Taming the Noise in Reinforcement
    Learning via Soft Updates. In Proceedings of the Conference on Uncertainty
    in Artificial Intelligence (UAI), 2016.
    [11] Scott Fujimoto, Herke van Hoof, and David Meger. Addressing Function Ap-
    proximation Error in Actor-Critic Methods. In Proceedings of the International
    Conference on Machine Learning (ICML), 2018.
    [12] Mathieu Germain, Karol Gregor, Iain Murray, and H. Larochelle. MADE:
    Masked Autoencoder for Distribution Estimation. 2015.
    [13] Michael B. Giles. Multilevel Monte Carlo Methods. Acta Numerica, 24:259 –
    328, 2013.
    [14] Will Grathwohl, Kuan-Chieh Jackson Wang, Jörn-Henrik Jacobsen,
    David Kristjanson Duvenaud, Mohammad Norouzi, and Kevin Swersky. Your
    Classifier is Secretly an Energy Based Model and You Should Treat it Like
    One. Proceedings of the International Conference on Learning Representations
    (ICLR), 2019.
    [15] Will Grathwohl, Kuan-Chieh Jackson Wang, Jörn-Henrik Jacobsen,
    David Kristjanson Duvenaud, and Richard S. Zemel. Learning the Stein
    Discrepancy for Training and Evaluating Energy-Based Models without Sampling. In Proceedings of the International Conference on Machine Learning,
    2020.
    [16] L. Gresele, G. Fissore, A. Javaloy, B. Schölkopf, and A. Hyvärinen. Relative
    Gradient Optimization of the Jacobian Term in Unsupervised Deep Learning.
    In Proceedings of the Conference on Neural Information Processing Systems
    (NeurIPS), 2020.
    [17] Tuomas Haarnoja, Kristian Hartikainen, P. Abbeel, and Sergey Levine. Latent
    Space Policies for Hierarchical Reinforcement Learning. In Proceedings of the
    International Conference on Machine Learning (ICML), 2018.
    [18] Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Rein-
    forcement Learning with Deep Energy-Based Policies. In Proceedings of the
    International Conference on Machine Learning (ICML), 2017.
    [19] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft
    Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning
    with a Stochastic Actor. In Proceedings of the International Conference on
    Machine Learning (ICML), 2017.
    [20] Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, G. Tucker, Sehoon Ha,
    Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, P. Abbeel, and Sergey
    Levine. Soft Actor-Critic Algorithms and Applications. ArXiv, abs/1812.05905,
    2018.
    [21] Moritz Hardt, Benjamin Recht, and Yoram Singer. Train faster, generalize
    better: Stability of stochastic gradient descent. 2015.
    [22] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and
    Ruslan Salakhutdinov. Improving neural networks by preventing co-adaptation
    of feature detectors. ArXiv, abs/1207.0580, 2012.
    [23] Emiel Hoogeboom, Rianne van den Berg, and Max Welling. Emerging Con-
    volutions for Generative Normalizing Flows. Proceedings of the International
    Conference on Machine Learning (ICML), 2019.
    [24] Shengyi Huang, Rousslan Fernand Julien Dossa, Chang Ye, Jeff Braga, Dipam
    Chakraborty, Kinal Mehta, and João G.M. Araújo. CleanRL: High-quality
    Single-file Implementations of Deep Reinforcement Learning Algorithms. Jour-
    nal of Machine Learning Research (JMLR), 23(274):1–18, 2022.
    [25] A. Hyvärinen. Estimation of Non-Normalized Statistical Models by Score
    Matching. Journal of Machine Learning Research (JMLR), 2005.
    [26] A. Hyvärinen and E. Oja. Independent Component Analysis: Algorithms
    and Applications. Neural Networks: the Official Journal of the International
    Neural Network Society, 13 4-5:411–30, 2000.
    [27] Malvin H. Kalos and Paula A. Whitlock. Monte Carlo methods. Vol. 1: basics.
    Wiley-Interscience, 1986.
    [28] H J Kappen. Path Integrals and Symmetry Breaking for Optimal Control
    Theory. Journal of Statistical Mechanics: Theory and Experiment, 2005.
    [29] T. Anderson Keller, Jorn W. T. Peters, Priyank Jaini, Emiel Hoogeboom,
    Patrick Forr’e, and Max Welling. Self Normalizing Flows. In Proceedings of
    the International Conference on Machine Learning (ICML), 2020.
    [30] Diederik Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization.
    In Proceedings of the International Conference on Learning Representations
    (ICLR), 2015.
    [31] Diederik P. Kingma and Prafulla Dhariwal. Glow: Generative Flow with
    Invertible 1x1 Convolutions. Proceedings of the International Conference on
    Neural Information Processing Systems (NeurIPS), 2018.
    [32] Diederik P. Kingma, Tim Salimans, and Max Welling. Improved Variational
    Inference with Inverse Autoregressive Flow. Proceedings of the International
    Conference on Neural Information Processing Systems (NeurIPS), 2016.
    [33] Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes. 2013.
    [34] Yann LeCun, Sumit Chopra, Raia Hadsell, Aurelio Ranzato, and Fu Jie Huang.
    A Tutorial on Energy-Based Learning. 2006.
    [35] Kyungjae Lee, Sungyub Kim, Sungbin Lim, Sungjoon Choi, and Songhwai Oh.
    Tsallis Reinforcement Learning: A Unified Framework for Maximum Entropy
    Reinforcement Learning. ArXiv, abs/1902.00137, 2019.
    [36] Xiang Li, Wenhao Yang, Jiadong Liang, Zhihua Zhang, and Michael I. Jordan.
    A Statistical Analysis of Polyak-Ruppert Averaged Q-Learning. In Proceed-
    ings of the International Conference on Artificial Intelligence and Statistics
    (AISTAT), 2021.
    [37] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Man-
    fred Otto Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra.
    Continuous control with deep reinforcement learning. Proceedings of the
    International Conference on Learning Representations (ICLR), 2016.
    [38] Qiang Liu and Dilin Wang. Stein Variational Gradient Descent: A General
    Purpose Bayesian Inference Algorithm. In Proceedings of the International
    Conference on Neural Information Processing Systems (NeurIPS), 2016.
    [39] You Lu and Bert Huang. Woodbury Transformations for Deep Generative
    Flows. Proceedings of the International Conference on Neural Information
    Processing Systems (NeurIPS), 2020.
    [40] Xuezhe Ma and Eduard H. Hovy. MaCow: Masked Convolutional Generative
    Flow. Proceedings of the International Conference on Neural Information
    Processing Systems (NeurIPS), 2019.
    [41] Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier
    Storey, Miles Macklin, David Hoeller, N. Rudin, Arthur Allshire, Ankur
    Handa, and Gavriel State. Isaac Gym: High Performance GPU-Based Physics
    Simulation For Robot Learning. Proceedings of the International Conference
    on Neural Information Processing Systems (NeurIPS) Dataset and Benchmark
    Track, 2021.
    [42] Bogdan Mazoure, Thang Doan, Audrey Durand, R Devon Hjelm, and Joelle
    Pineau. Leveraging Exploration in Off-policy Algorithms via Normalizing
    Flows. In Proceedings of the Conference on Robot Learning (CoRL), 2019.
    [43] Chenlin Meng, Linqi Zhou, Kristy Choi, Tri Dao, and Stefano Ermon. Butter-
    flyFlow: Building Invertible Layers with Butterfly Matrices. Proceedings of
    the International Conference on Machine Learning (ICML), 2022.
    [44] Safa Messaoud, Billel Mokeddem, Zhenghai Xue, Linsey Pang, Bo An, Haipeng
    Chen, and Sanjay Chawla. S2AC: Energy-Based Reinforcement Learning with
    Stein Soft Actor Critic. In Proceedings of the International Conference on
    Learning Representations (ICLR), 2024.
    [45] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Ve-
    ness, Marc G. Bellemare, Alex Graves, Martin A. Riedmiller, Andreas Kirkeby
    Fidjeland, Georg Ostrovski, Stig Petersen, Charlie Beattie, Amir Sadik, Ioannis
    Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and
    Demis Hassabis. Human-level Control through Deep Reinforcement Learning.
    Nature, 518:529–533, 2015.
    [46] Thomas Müller, Brian McWilliams, Fabrice Rousselle, Markus H. Gross, and
    Jan Novák. Neural Importance Sampling. ACM Transactions on Graphics
    (TOG), 2018.
    [47] Iman Nematollahi, Erick Rosete-Beas, Adrian Roefer, Tim Welschehold, Abhi-
    nav Valada, and Wolfram Burgard. Robot Skill Adaptation via Soft Actor-
    Critic Gaussian Mixture Models. Proceedings of IEEE International Conference
    on Robotics and Automation (ICRA), 2022.
    [48] Brendan O’Donoghue, Rémi Munos, Koray Kavukcuoglu, and Volodymyr
    Mnih. PGQ: Combining Policy Gradient and Q-learning. In Proceedings of
    the International Conference on Learning Representations (ICLR), 2017.
    [49] George Papamakarios, Iain Murray, and Theo Pavlakou. Masked Autoregressive
    Flow for Density Estimation. Proceedings of the International Conference on
    Neural Information Processing Systems (NeurIPS), 2017.
    [50] George Papamakarios, Eric T. Nalisnick, Danilo Jimenez Rezende, Shakir
    Mohamed, and Balaji Lakshminarayanan. Normalizing Flows for Probabilistic
    Modeling and Inference. Journal of Machine Learning Research (JMLR), 2019.
    [51] Kwan-Woo Park, MyeongSeop Kim, Jung-Su Kim, and Jae-Han Park. Path
    Planning for Multi-Arm Manipulators Using Soft Actor-Critic Algorithm with
    Position Prediction of Moving Obstacles via LSTM. Applied Sciences, 2022.
    [52] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury,
    Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga,
    Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison,
    Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai,
    and Soumith Chintala. Pytorch: An imperative style, high-performance deep
    learning library. 2019.
    [53] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic
    Programming. John Wiley & Sons, Inc., 1994.
    [54] Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian
    Ernestus, and Noah Dormann. Stable-Baselines3: Reliable Reinforcement
    Learning Implementations. Journal of Machine Learning Research (JMLR),
    22(268):1–8, 2021.
    [55] Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for Activation
    Functions. arXiv:1710.05941, 2017.
    [56] Konrad Rawlik, Marc Toussaint, and Sethu Vijayakumar. On Stochastic
    Optimal Control and Reinforcement Learning by Approximate Inference. In
    Proceedings of the International Joint Conference on Artificial Intelligence
    (IJCAI), 2012.
    [57] Gareth O. Roberts and Jeffrey S. Rosenthal. Optimal scaling of discrete
    approximations to Langevin diffusions. Journal of the Royal Statistical Society:
    Series B (Statistical Methodology), 1998.
    [58] Gareth O. Roberts and Richard L. Tweedie. Exponential convergence of
    Langevin distributions and their discrete approximations. Bernoulli, 1996.
    [59] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg
    Klimov. Proximal Policy Optimization Algorithms. ArXiv, abs/1707.06347,
    2017.
    [60] Antonio Serrano-Muñoz, Dimitrios Chrysostomou, Simon Bøgh, and Nestor
    Arana-Arexolaleiba. skrl: Modular and Flexible Library for Reinforcement
    Learning. Journal of Machine Learning Research (JMLR), 24(254):1–9, 2023.
    [61] Wenjie Shi, Shiji Song, and Cheng Wu. Soft Policy Gradient Method for
    Maximum Entropy Deep Reinforcement Learning. In Proceedings of the
    International Joint Conference on Artificial Intelligence (IJCAI), 2019.
    [62] Vincent Stimper, David Liu, Andrew Campbell, Vincent Berenz, Lukas Ryll,
    Bernhard Schölkopf, and José Miguel Hernández-Lobato. normflows: A Py-
    Torch Package for Normalizing Flows. Journal of Open Source Software,
    8(86):5361, 2023.
    [63] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Intro-
    duction. A Bradford Book, 2018.
    [64] Emanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: A physics engine
    for model-based control. In Proceedings of the International Conference on
    Intelligent Robots and Systems (IROS), 2012.
    [65] Surya T. Tokdar and Robert E. Kass. Importance Sampling: A Review. Wiley
    Interdisciplinary Reviews: Computational Statistics, 2, 2010.
    [66] Marc Toussaint. Robot Trajectory Optimization using Approximate Inference.
    In Proceedings of the International Conference on Machine Learning (ICML),
    2009.
    [67] Mark Towers, Jordan K. Terry, Ariel Kwiatkowski, John U. Balis, Gianluca de
    Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Arjun KG, Markus
    Krimmel, Rodrigo Perez-Vicente, Andrea Pierré, Sander Schulhoff, Jun Jet
    Tai, Andrew Tan Jin Shen, and Omar G. Younis. Gymnasium, 2023.
    [68] Dilin Wang and Qiang Liu. Learning to Draw Samples: With Application to
    Amortized MLE for Generative Adversarial Learning. In Proceedings of the
    International Conference on Learning Representations (ICLR), 2016.
    [69] Patrick Nadeem Ward, Ariella Smofsky, and A. Bose. Improving Exploration
    in Soft-Actor-Critic with Normalizing Flows Policies. 2019.
    [70] Max Welling and Yee Whye Teh. Bayesian Learning via Stochastic Gradient
    Langevin Dynamics. In Proceedings of the International Conference on Machine
    Learning (ICML), 2011.
    [71] Denis Yarats, Amy Zhang, Ilya Kostrikov, Brandon Amos, Joelle Pineau, and
    Rob Fergus. Improving sample efficiency in model-free reinforcement learning
    from images. In Proceedings of the AAAI Conference on Artificial Intelligence
    (AAAI), 2021.
    [72] Dinghuai Zhang, Aaron Courville, Yoshua Bengio, Qinqing Zheng, Amy Zhang,
    and Ricky T. Q. Chen. Latent State Marginalization as a Low-cost Approach
    to Improving Exploration. In Proceedings of the International Conference on
    Learning Representations (ICLR), 2023.
    [73] Brian Ziebart, Andrew Maas, J. Bagnell, and Anind Dey. Maximum Entropy
    Inverse Reinforcement Learning. In Proceedings of the AAAI Conference on
    Artificial Intelligence (AAAI), 2008.
    [74] Brian D. Ziebart. Modeling Purposeful Adaptive Behavior with the Principle
    of Maximum Causal Entropy. PhD thesis, USA, 2010.

    QR CODE