簡易檢索 / 詳目顯示

研究生: 薩克凌
SAKKARIN KRARAT
論文名稱: 使用雙評論和蒙特卡羅樹搜索於逆強化學習以增進自動駕駛
Enhancing Autonomous Driving Using Double Critic and MCTS for Inverse Reinforcement Learning
指導教授: 蘇豐文
Soo, Von-Wun
口試委員: Baghban, Hojjat
Baghban, Hojjat
沈之涯
Shen, Chih-Ya
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊系統與應用研究所
Institute of Information Systems and Applications
論文出版年: 2024
畢業學年度: 112
語文別: 英文
論文頁數: 64
中文關鍵詞: 雙評論蒙特卡羅樹搜索逆強化學習自動駕駛
外文關鍵詞: Double Critic, Monte Carlo Tree Search, Inverse Reinforcement Learning, Autonomous Driving
相關次數: 點閱:1下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在連續動作與狀態的真實世界的情境中, 逆增強式學習成為一個關鍵性的方法來訓練自動駕駛的複雜問題。 克服人類的示範逆增強式學習有利於代理人探索環境,並在動態與不確定的背景下能有效提供適性的樣本。近來逆增強式學習的進展已經示範了可以達到接近人類程度報酬回饋的潛力。
    但是我們認知到它在自動駕駛領域的應用潛力仍是有挑戰的。這些挑戰中關鍵的是適當報酬函數的 產生和駕馭錯綜複雜展延的狀態與動作空間。

    這個碩士論文探討自動駕駛中逆增強式學習的應用, 貢獻目標鎖定在控制演算法的精進以應付錯綜復雜的駕駛情境。我們聚焦在訓練一個善於在虛擬環境中做增強式學習, 強調在從真實世界應用中模擬以轉移學習到自動駕駛的技巧。
    我們提出一個 AMIRL的架構整合了最大亂度, 雙評論, 與蒙特卡羅樹搜索法等觀念於逆增強式學習法。結果顯示雙評論減少報酬過度估算的偏差,蒙特卡羅樹搜索法增強了探索動作空間而最大亂度原理消除了在逆增強式學習中選擇好政策的歧異。
    我們利用AirSim的環境來模擬自動駕駛的情境, 其中兩種自動駕駛的測試目標情境如1.社區導航2.障礙迴避被設計模擬。
    我們研究了逆增強式學習的細節, 提供了自動駕駛的車輛控制理論架構與實際真實世界間效能的差距間的理解與解法。


    Inverse Reinforcement Learning(IRL) emerges as a pivotal approach for
    training autonomous driving agents in complex tasks mirroring continuous actions
    and states encountered in real-world scenarios. By harnessing human demon-
    strations, IRL incentivizes agents to thoroughly explore their environment, offer-
    ing advantages in sample efficiency and adaptability to dynamic, uncertain con-
    texts. Recent advancements in IRL have demonstrated its potential in approxi-
    mating human-level reward achievements. Yet, realizing its full potential in the
    autonomous driving domain is not without challenges. Critical among these is the
    generation of suitable reward functions and managing the intricacies of expansive
    state and action spaces.
    This master thesis delves into the application of IRL within autonomous
    driving, aiming to contribute significantly to the advancement of control algorithms
    for intricate driving scenarios. We focus on training Reinforcement Learning (RL)
    agents that excel in virtual environments, emphasizing the transition of learned au-
    tonomous driving skills from simulations to real-world applications. We propose a
    AMIRL framework by integtrating the concepts of Maximum entropy and Double
    Critics as well as using Monte Carlo Tree Search (MCTS) in the inverse reinforce-
    ment learning. It turns out that Double Critic method reduces reward overesti-
    mation bias, MCTS method enhances exploratory action spaces, while Maximum
    entropy principle resolves ambiguities in distribution in choosing a good policy in
    Inverse Reinforcement Learning. We conduct the simulation experiments on the
    autonomous driving scenarios using AirSim environment where driving scenarios
    are simulated two kinds of driving test objectives: 1. Urban navigation and 2.
    Obstacle Avoidance navigation. Our study navigates through the nuances of IRL,
    offering insights and solutions that bridge the gap between theoretical frameworksOur study navigates through the nuances of IRL, offering insights and solutions that bridge the gap between theoretical frameworks and practical, real-world efficacy in autonomous vehicular control.

    Abstract(Chinese)----------I Abstract----------II Acknowledgements----------IV Contents----------V List of Figures----------VIII List of Tables----------XI List of Algorithms----------XII 1 Introduction----------1 2 Background and Related Work----------8 2.1 Inverse Reinforcement Learning----------8 2.2 Actor-Critic----------9 2.3 Deep-Q Learning----------10 2.4 Policy Optimization----------11 2.5 Markov Decision Process----------13 2.6 Monte Carlo Tree Search----------13 2.7 Maximum Entropy Inverse Reinforcement Learning----------16 2.8 Actor-Critic with Experience Replay----------18 2.9 Twin Delayed DDPG----------20 3 Methodology----------22 3.1 Expert's demonstration----------23 3.2 Monte Carlo Tree Search----------24 3.2.1 Selection Phase----------25 3.2.2 Expansion Phase----------25 3.2.1 Simulation Phase----------26 3.2.1 Backpropagation Phase----------26 3.3 Double Critic----------27 3.4 Maximum Entropy IRL----------28 4 Experiments----------32 4.1 Driving Scenarios----------32 4.1.1 Urban Navigation----------34 4.1.2 Obstacle Avoiding Navigation----------34 4.2 Evaluation----------36 4.2.1 The Evaluation Metrics----------36 4.2.2 The BaseLine model----------38 4.3 Main Result----------40 4.3.1 Double Critic integrated MCTS versus TD3 Experiment----------40 4.3.2 Maximum Entropy IRL Experiment----------45 4.4 Comparison Experiments----------53 4.4.1 Reinforcement Learning VS Inverse Reinforcement Learning--------54 4.4.1 Actor Critic VS Double Critic--------56 5 Discussion and Conclusion----------59 Bibliography----------62

    [1] Ng A. and Russell S. Algorithms for inverse reinforcement learning. Proceed-
    ings of the seventeenth International Conference on Machine Learning, pages
    663–670, 2000.
    [2] Saurabh Arora and Prashant Doshi. A survey of inverse reinforcement learn-
    ing: Challenges, methods and progress. Artificial Intelligence, 29, 2021.
    [3] Kartal B., Hernandez-Leal P., and Taylor E. M. Action guidance with mcts
    for deep reinforcement learning. CoRR, 2019.
    [4] Silver D., Huang A., Maddison C. J., Guez A., Sifre L., Driessche G., Schrit-
    twieser J., Antonoglou I., Panneershelvam V., Lanctot M., Dieleman S.,
    Grewe D., Nham J., Kalchbrenner N., Sutskever I., Lillicrap T., Leach M.,
    Kavukcuogu K., Graepel T., and Hassabis D. Mastering the game of go with
    deep neural networks and tree search. Nature, 529:484–489, 2016.
    [5] Ziebart B. D., Mass A., Bagnell J. A., and Dey A. K. Maximum entropy
    inverse reinforcement learning. AAAI, pages 1433–1438, 2008.
    [6] Hasselt V. H., Guez A., and Silver D. Deep reinforcement learning with double
    q-learning. CoRR, 2015.
    [7] Ho J. and Ermon S. Generative adversarial imitation learning. CoRR, 2016.
    8] Schulman J., Levine S., Moritz P., Jordan M., and Abbeel P. Trust region
    policy optimization. CoRR, 2015.
    [9] Kocsis L. and Szepesvari C. Bandit based monte-carlo planning. European
    Conference, pages 1–5, 2006.
    [10] Abbeel P. and Ng A. Apprenticeship learning via inverse reinforcement
    learning. Proceeding of the twenty-first International Conference on Machine
    Learning, page 1, 2004.
    [11] Fujimoto S., Hoof V. H., and Meager D. Addressing function approximation
    error in actor-critic methods. CoRR, abs/1802.09477, 2018.
    [12] Gelly S. and Silver D. Monte carlo tree search and rapid action value estima-
    tion in computer go. Artificial Intelligence, 175:1856–1875, 2011.
    [13] Richard S. S. and Andrew G. B. Reinforcement learning an introduction.
    Journal, pages 47–48, 2020.
    [14] Russell S. Learning agents for uncertain environment(extended abstract).
    Computational Learning Theory, COLT’98, pages 101–103, 1998.
    [15] Shah S., Dey D., Lovett C., and Kapoor A. Airsim: High-fidelity visual and
    physical simulation for autonomous vehicles. CoRR, abs/1705.05065, 2017.
    [16] Harrnoja T. and Levine S. Zhou A., Abbeel P. Soft actor-critic: Off-policy
    maximum entropy deep reinforcement learning with a stochastic actor. CoRR,
    2018.
    [17] Lillicrap T., Hunt J., Pritzel A., Heess N., Erez T., Tassa Y., Silver D., and
    Wierstra D. Continuous control with deep reinforcement learning. CoRR,
    2016.
    [18] Mhin V., Kavukcuoglu K., Silver D., Graves A., Antonoglou I., Wierstra D.,
    and Riedmiller A. M. Playing atari with deep reinforcement learning. CoRR,
    abs/1312.5602, 2013.
    [19] Konda V.R. and Tsitsiklis J. N. Actor critic algorithms. Proceeding in Ad-
    vances in Neural Information Processing Systems, 1999.
    [20] Wang Z., Victor Bapst V., Heess N., Mnih V.and Munos R., Kavukcuoglu
    K., and Freitas N. Sample efficient actor-critic with experience replay. CoRR,
    abs/1611.01224, 2016.

    QR CODE