運用混和式步數返還值在自助型深度Q網路｜國立清華大學博碩士論文庫

簡易檢索 / 詳目顯示

回結果列表

研究生：	江柏翰 Chiang, Po-Han
論文名稱：	運用混和式步數返還值在自助型深度Q網路 Mixture of Step Returns in Bootstrapped DQN
指導教授：	李濬屹 Lee, Chun-Yi
口試委員:	周志遠胡敏君
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊工程學系 Computer Science
論文出版年：	2020
畢業學年度：	108
語文別：	英文
論文頁數：	39
中文關鍵詞：	增強式學習
外文關鍵詞：	Reinforcement Learngin
相關次數：	點閱：1 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

利用不同的長度的步數返還值已經被廣泛運用在深度增強式學習，不同長度在更新價值函數上提供了不同的好處，包含了影響價值估計的偏差和變異、收斂速度、代理人的探索行為等。傳統的方式像是TD-lambda透過指數平均方式去結合長度不同的步數返還值，把平均後的結果當作價值更新的目標。然而結合成單一個目標會犧牲了各自的優點。
為了解決這個問題，我們提出了Mixture Bootstrapped DQN (MB-DQN)，MB-DQN基於Bootstrapped DQN的架構上而在每個Bootstrapped head使用了各自長度的步數返還值。MB-DQN能保持不同目標值之間的異質性而這是只有單一個目標值的方法所無法達到的。我們首先在迷宮環境上面呈現出我們的動機，接著我們把MB-DQN訓練在Atari上展現出相比於原本方式，他優化的幅度，同時也對於不同配置的MB-DQN做更進一步的分析。

The concept of utilizing multi-step returns for updating value functions has been adopted in deep reinforcement learning (DRL) for a number of years. Updating value functions with different backup lengths provides advantages in different aspects, including bias and variance of value estimates, convergence speed, and exploration behavior of the agent. Conventional methods such as TD~($\lambda$) leverage these advantages by using a target value equivalent to an exponential average of different step returns. Nevertheless, integrating step returns into a single target sacrifices the diversity of the advantages offered by different step return targets. To address this issue, we propose Mixture Bootstrapped DQN (MB-DQN) built on top of bootstrapped DQN, and uses different backup lengths for different bootstrapped heads. MB-DQN enables heterogeneity of the target values that is unavailable in approaches relying only on a single target value. Hence, it is able to maintain the advantages offered by different backup lengths. In this paper, we first discuss the motivational insights through a simple maze environment. To validate the effectiveness of MB-DQN, we perform experiments on the Atari 2600, and demonstrate the performance improvement of MB-DQN over several baseline methods. We further provide ablation studies to examine the impacts of different design configurations of MB-DQN.

1 Introduction 1
2 Background 5
2.0.1 Markov Decision Process and One-Step Return 5
2.0.2 Multi-StepReturn 6
2.0.3 DeepQ-Network 7
3 Methodology 9
3.0.1 Agent Behavior with Different Backup Lengths in DQN 10
3.0.2 Mixture of Step Returns in Bootstrapped DQN 10
4 Experiments 13
4.0.1 Comparison of MB-DQN and Bootstrapped DQN 14
4.0.2 Advantages offered by MB-DQN in the Quality of the Collected Data Sample 17
4.0.3 Single λ-Target versus Multiple Bootstrapped Targets 19
4.0.4 Ablation Analysis 20
5 Conclusion 23
6 Supplementary 24
S.1 Additional Background Material 24
S.1.1 DQN(λ) 24
S.1.2 Saliency Map Generation Methodology of the Attention Maps for
MB-DQN 26
S.2 Additional Details of the Experimental Setup 27
S.2.1 The Hyperparameters for Training MB-DQN 28
S.2.2 Network Structure 28
S.3 Additional Experimental Results 29
S.3.1 Additional Evaluation Results on Atari 2600 Games 29
S.3.2 Additional Examples of the Attention Maps from the MB-DQN Agents 30
S.3.3 Additional Ablation Analyses of the Backup Lengths 32
S.4 Computing Infrastructure 33


                                

[1] Samuel Greydanus, Anurag Koul, Jonathan Dodge, and Alan Fern. Visualizing and understanding atari agents. In Proceedings of International Conference on Machine Learning (ICML), volume 80, pages 1787–1796, 2018.
[2] Brett Daley and Christopher Amato. Reconciling λ – returns with experience replay. In Proceedings of Advances in Neural Information Processing Systems (NIPS), pages 1133–1142. 2019.
[3] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, pages 529–533, 2015.
[4] Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt, Marc Lanctot, and Nando Freitas. Dueling network architectures for deep reinforcement learning. In Proceedings of International Conference on Machine Learning (ICML), pages 1995–2003, 2016.
[5] HadovanHasselt,ArthurGuez,andDavidSilver.Deepreinforcementlearningwith double q-learning. In Proceedings of AAAI Conference on Artificial Intelligence (AAAI), page 2094–2100, 2016.
[6] Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped dqn. In Proceedings of Advances in Neural Information Processing Systems (NIPS), pages 4026–4034. 2016.
[7] Matteo Hessel, Joseph Modayil, Hado van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Gheshlaghi Azar, and David Sil- ver. Rainbow: Combining improvements in deep reinforcement learning. In Pro- ceedings of AAAI Conference on Artificial Intelligence (AAAI), pages 3215–3222, 2018.
[8] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, 1998.
[9] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous meth- ods for deep reinforcement learning. In Proceedings of International Conference on Machine Learning (ICML), pages 1928–1937, 2016.
[10] GabrielBarth-Maron,MatthewW.Hoffman,DavidBudden,WillDabney,DanHor- gan, Dhruva TB, Alistair Muldal, Nicolas Heess, and Timothy P. Lillicrap. Dis- tributed distributional deterministic policy gradients. In Proceedings of Interna- tional Conference on Learning Representations (ICLR), 2018.
[11] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep rein- forcement learning. In Proceedings of International Conference on Learning Rep- resentations (ICLR), 2016.
[12] Kristopher De Asis, J. Fernando Hernandez-Garcia, G. Zacharias Holland, and Richard S. Sutton. Multi-step reinforcement learning: A unifying algorithm. In Proceedings of AAAI Conference on Artificial Intelligence, pages 2902–2909, 2018.
[13] ArtemijAmiranashvili,AlexeyDosovitskiy,VladlenKoltun,andThomasBrox.An- alyzing the role of temporal differencing in deep reinforcement learning. In Proceed- ings of International Conference on Learning Representations (ICLR), 2018.
[14] Marc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. In Proceedings of International Joint Conference on Artificial Intelligence (IJCAI), pages 4148–4152, 2015.
[15] Tommi S. Jaakkola, Michael I. Jordan, and Satinder P. Singh. On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6(6): 1185–1201, 1994.
[16] MarcG.Bellemare,SriramSrinivasan,GeorgOstrovski,TomSchaul,DavidSaxton, and Re ́mi Munos. Unifying count-based exploration and intrinsic motivation. In Proceedings of Advances in Neural Information Processing Systems (NIPS), pages 1471–1479, Dec. 2016.

簡易檢索 / 詳目顯示

相關論文