研究生: |
薩克凌 SAKKARIN KRARAT |
---|---|
論文名稱: |
使用雙評論和蒙特卡羅樹搜索於逆強化學習以增進自動駕駛 Enhancing Autonomous Driving Using Double Critic and MCTS for Inverse Reinforcement Learning |
指導教授: |
蘇豐文
Soo, Von-Wun |
口試委員: |
Baghban, Hojjat
Baghban, Hojjat 沈之涯 Shen, Chih-Ya |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊系統與應用研究所 Institute of Information Systems and Applications |
論文出版年: | 2024 |
畢業學年度: | 112 |
語文別: | 英文 |
論文頁數: | 64 |
中文關鍵詞: | 雙評論 、蒙特卡羅樹搜索 、逆強化學習 、自動駕駛 |
外文關鍵詞: | Double Critic, Monte Carlo Tree Search, Inverse Reinforcement Learning, Autonomous Driving |
相關次數: | 點閱:1 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在連續動作與狀態的真實世界的情境中, 逆增強式學習成為一個關鍵性的方法來訓練自動駕駛的複雜問題。 克服人類的示範逆增強式學習有利於代理人探索環境,並在動態與不確定的背景下能有效提供適性的樣本。近來逆增強式學習的進展已經示範了可以達到接近人類程度報酬回饋的潛力。
但是我們認知到它在自動駕駛領域的應用潛力仍是有挑戰的。這些挑戰中關鍵的是適當報酬函數的 產生和駕馭錯綜複雜展延的狀態與動作空間。
這個碩士論文探討自動駕駛中逆增強式學習的應用, 貢獻目標鎖定在控制演算法的精進以應付錯綜復雜的駕駛情境。我們聚焦在訓練一個善於在虛擬環境中做增強式學習, 強調在從真實世界應用中模擬以轉移學習到自動駕駛的技巧。
我們提出一個 AMIRL的架構整合了最大亂度, 雙評論, 與蒙特卡羅樹搜索法等觀念於逆增強式學習法。結果顯示雙評論減少報酬過度估算的偏差,蒙特卡羅樹搜索法增強了探索動作空間而最大亂度原理消除了在逆增強式學習中選擇好政策的歧異。
我們利用AirSim的環境來模擬自動駕駛的情境, 其中兩種自動駕駛的測試目標情境如1.社區導航2.障礙迴避被設計模擬。
我們研究了逆增強式學習的細節, 提供了自動駕駛的車輛控制理論架構與實際真實世界間效能的差距間的理解與解法。
Inverse Reinforcement Learning(IRL) emerges as a pivotal approach for
training autonomous driving agents in complex tasks mirroring continuous actions
and states encountered in real-world scenarios. By harnessing human demon-
strations, IRL incentivizes agents to thoroughly explore their environment, offer-
ing advantages in sample efficiency and adaptability to dynamic, uncertain con-
texts. Recent advancements in IRL have demonstrated its potential in approxi-
mating human-level reward achievements. Yet, realizing its full potential in the
autonomous driving domain is not without challenges. Critical among these is the
generation of suitable reward functions and managing the intricacies of expansive
state and action spaces.
This master thesis delves into the application of IRL within autonomous
driving, aiming to contribute significantly to the advancement of control algorithms
for intricate driving scenarios. We focus on training Reinforcement Learning (RL)
agents that excel in virtual environments, emphasizing the transition of learned au-
tonomous driving skills from simulations to real-world applications. We propose a
AMIRL framework by integtrating the concepts of Maximum entropy and Double
Critics as well as using Monte Carlo Tree Search (MCTS) in the inverse reinforce-
ment learning. It turns out that Double Critic method reduces reward overesti-
mation bias, MCTS method enhances exploratory action spaces, while Maximum
entropy principle resolves ambiguities in distribution in choosing a good policy in
Inverse Reinforcement Learning. We conduct the simulation experiments on the
autonomous driving scenarios using AirSim environment where driving scenarios
are simulated two kinds of driving test objectives: 1. Urban navigation and 2.
Obstacle Avoidance navigation. Our study navigates through the nuances of IRL,
offering insights and solutions that bridge the gap between theoretical frameworksOur study navigates through the nuances of IRL, offering insights and solutions that bridge the gap between theoretical frameworks and practical, real-world efficacy in autonomous vehicular control.
[1] Ng A. and Russell S. Algorithms for inverse reinforcement learning. Proceed-
ings of the seventeenth International Conference on Machine Learning, pages
663–670, 2000.
[2] Saurabh Arora and Prashant Doshi. A survey of inverse reinforcement learn-
ing: Challenges, methods and progress. Artificial Intelligence, 29, 2021.
[3] Kartal B., Hernandez-Leal P., and Taylor E. M. Action guidance with mcts
for deep reinforcement learning. CoRR, 2019.
[4] Silver D., Huang A., Maddison C. J., Guez A., Sifre L., Driessche G., Schrit-
twieser J., Antonoglou I., Panneershelvam V., Lanctot M., Dieleman S.,
Grewe D., Nham J., Kalchbrenner N., Sutskever I., Lillicrap T., Leach M.,
Kavukcuogu K., Graepel T., and Hassabis D. Mastering the game of go with
deep neural networks and tree search. Nature, 529:484–489, 2016.
[5] Ziebart B. D., Mass A., Bagnell J. A., and Dey A. K. Maximum entropy
inverse reinforcement learning. AAAI, pages 1433–1438, 2008.
[6] Hasselt V. H., Guez A., and Silver D. Deep reinforcement learning with double
q-learning. CoRR, 2015.
[7] Ho J. and Ermon S. Generative adversarial imitation learning. CoRR, 2016.
8] Schulman J., Levine S., Moritz P., Jordan M., and Abbeel P. Trust region
policy optimization. CoRR, 2015.
[9] Kocsis L. and Szepesvari C. Bandit based monte-carlo planning. European
Conference, pages 1–5, 2006.
[10] Abbeel P. and Ng A. Apprenticeship learning via inverse reinforcement
learning. Proceeding of the twenty-first International Conference on Machine
Learning, page 1, 2004.
[11] Fujimoto S., Hoof V. H., and Meager D. Addressing function approximation
error in actor-critic methods. CoRR, abs/1802.09477, 2018.
[12] Gelly S. and Silver D. Monte carlo tree search and rapid action value estima-
tion in computer go. Artificial Intelligence, 175:1856–1875, 2011.
[13] Richard S. S. and Andrew G. B. Reinforcement learning an introduction.
Journal, pages 47–48, 2020.
[14] Russell S. Learning agents for uncertain environment(extended abstract).
Computational Learning Theory, COLT’98, pages 101–103, 1998.
[15] Shah S., Dey D., Lovett C., and Kapoor A. Airsim: High-fidelity visual and
physical simulation for autonomous vehicles. CoRR, abs/1705.05065, 2017.
[16] Harrnoja T. and Levine S. Zhou A., Abbeel P. Soft actor-critic: Off-policy
maximum entropy deep reinforcement learning with a stochastic actor. CoRR,
2018.
[17] Lillicrap T., Hunt J., Pritzel A., Heess N., Erez T., Tassa Y., Silver D., and
Wierstra D. Continuous control with deep reinforcement learning. CoRR,
2016.
[18] Mhin V., Kavukcuoglu K., Silver D., Graves A., Antonoglou I., Wierstra D.,
and Riedmiller A. M. Playing atari with deep reinforcement learning. CoRR,
abs/1312.5602, 2013.
[19] Konda V.R. and Tsitsiklis J. N. Actor critic algorithms. Proceeding in Ad-
vances in Neural Information Processing Systems, 1999.
[20] Wang Z., Victor Bapst V., Heess N., Mnih V.and Munos R., Kavukcuoglu
K., and Freitas N. Sample efficient actor-critic with experience replay. CoRR,
abs/1611.01224, 2016.