巨集行動策略之建構方法｜國立清華大學博碩士論文庫

簡易檢索 / 詳目顯示

回結果列表

研究生：	張以詳 Chang, Yi-Hsiang
論文名稱：	巨集行動策略之建構方法 Construction of Macro Actions for Deep Reinforcement Learning
指導教授：	李濬屹 Lee, Chun-Yi
口試委員:	周志遠 Chou, Jerry 黃稚存 Huang, Chih-Tsun
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊工程學系 Computer Science
論文出版年：	2019
畢業學年度：	108
語文別：	英文
論文頁數：	29
中文關鍵詞：	巨集行動、遺傳演算法、深度強化學習
外文關鍵詞：	Macro Action, Genetic Algorithm, Deep Reinforcement Learning
相關次數：	點閱：82 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

傳統的深度強化學習通常需要每個時間步(timestep)都決定一個原始的動作(primitive action)，但這樣我們就需要花費大量的時間和力氣來學習一個有效的策略(policy)，而這在複雜的大環境裡會變得更明顯。我們則使用巨集動作(macro action)來解決這個問題，巨集動作是一連串的原始動作，和原始動作的動作空間組合起來便會成為增強的動作空間(augmented action space)。而問題就在於我們要如何找到一個適當的巨集動作來加強原本的原始動作空間。使用巨集動作空間的代理(agent)可以一次跳到比較遠的狀態(state)，這樣就可以加速探索與學習。

在先前的研究中，巨集動作通常都取決於先前最常做過的動作序列或是重複一連串的原始動作，但先前最常做過的動作由於是舊的代理產生的，有可能只會加強原本代理的行為。另一方面，重複一連串的原始動作可能會限制代理多元的行為。

我們則提出用遺傳演算法(genetic algorithm)來建構巨集動作，因此可以避免使用舊的策略。我們的方法會一次將一個巨集動作加到原始動作空間並且評估此巨集動作是否會改進效能。我們不但進行了廣泛的實驗並且這些實驗還顯示藉由我們構造的巨集動作可以加速一些深度強化學習過程。實驗還顯示這些巨集動作也可以在不同的強化學習方法和相似的環境中有不錯的表現。最後我們也提供了詳細的簡化測試(ablation study)來驗證我們提出的方法確實有效。

Conventional deep reinforcement learning typically determines an appropriate primitive action at each timestep, which requires enormous amount of time and effort for learning an effective policy, especially in large and complex environments. To deal with the issue fundamentally, we incorporate macro actions, defined as sequences of primitive actions, into the primitive action space to form an augmented action space. The problem lies in how to find an appropriate macro action to augment the primitive action space. The agent using a proper augmented action space is able to jump to a farther state and thus speed up the exploration process as well as facilitate the learning procedure. In previous researches, macro actions are developed by mining the most frequently used action sequences or repeating previous actions. However, the most frequently used action sequences are extracted from a past policy, which may only reinforce the original behavior of that policy. On the other hand, repeating actions may limit the diversity of behaviors of the agent. Instead, we propose to construct macro actions by a genetic algorithm, which eliminates the dependency of the macro action derivation procedure from the past policies of the agent. Our approach appends a macro action to the primitive action space once at a time and evaluates whether the augmented action space leads to promising performance or not. We perform extensive experiments and show that the constructed macro actions are able to speed up the learning process for a variety of deep reinforcement learning methods. Our experimental results also demonstrate that the macro actions suggested by our approach are transferable among deep reinforcement learning methods and similar environments. We further provide a comprehensive set of ablation analysis to validate our methodology.

Abstract (Chinese) i
Abstract ii
Acknowledgements iii
Contents iv
List of Figures vi
List of Tables viii
List of Algorithms ix
1 Introduction 1
2 Framework Formulation 4
3 Genetic Algorithm for Constructing Macro Actions 6
4 Experimental Results 9
4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2 Comparison of Macros among Generations . . . . . . . . . . . . . . 10
4.3 Compatibility of the Constructed Macros
with DRL Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.4 Transferability of the Constructed Macros among DRL Methods . . 11
4.5 Transferability of the Constructed Macros to Harder Environments 12
4.6 Ablation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5 Conclusions 16
Bibliography 17
Appendices 21
A1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
A2 Pseudo Code of the Genetic Operators . . . . . . . . . . . . . . . . 22
iv
A2.1 Constructing Macro Actions . . . . . . . . . . . . . . . . . . 22
A2.2 Genetic Operators . . . . . . . . . . . . . . . . . . . . . . . 23
A3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . 23
A4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 24
A4.1 Additional Comparison of the Learning
Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
A4.2 Additional Evaluation Results . . . . . . . . . . . . . . . . . 28
A4.3 Detailed Comparison of the Macros
Constructed by Algorithm 1 with Random Macros . . . . . . 28
A5 Computing Infrastructure . . . . . . . . . . . . . . . . . . . . . . . 28
                                

[1] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing Atari with deep reinforcement learning.
arXiv:1312.5602, Dec. 2013.
[2] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare,
A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level
control through deep reinforcement learning. Nature, 518(7540):529–533, Feb.
2015.
[3] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver,
and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning.
In Proc. Int. Conf. Machine Learning (ICML), pages 1928–1937, Jun. 2016.
[4] T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever. Evolution strategies
as a scalable alternative to reinforcement learning. arXiv:1703.03864, Sep.
2017.
[5] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal
policy optimization algorithms. arXiv:1707.06347, Aug. 2017.
[6] D. E. Moriarty, A. C. Schultz, and J. J. Grefenstette. Evolutionary algorithms
for reinforcement learning. J. Artificial Intelligence Research (JAIR), 11:
241–276, 1999.
[7] F. P. Such, V. Madhavan, E. Conti, J. Lehman, K. O. Stanley, and J. Clune.
Deep neuroevolution: Genetic algorithms are a competitive alternative for
training deep neural networks for reinforcement learning. arXiv:1712.06567,
Apr. 2018.
[8] R. Houthooft, X. Chen, Y. Duan, J. Schulman, F. De Turck, and P. Abbeel.
VIME: Variational information maximizing exploration. In Proc. Advances
in Neural Information Processing Systems (NeurIPS), pages 1109–1117, Dec.
2016.
[9] R. S. Sutton, D. Precup, and S. Singh. Between MDPs and semi-MDPs:
A framework for temporal abstraction in reinforcement learning. Artificial
Intelligence, 112(1-2):181–211, Aug. 1999.
[10] A. Vezhnevets, V. Mnih, S. Osindero, A. Graves, O. Vinyals, J. Agapiou,
et al. Strategic attentive writer for learning macro-actions. In Proc. Advances
in Neural Information Processing Systems (NeurIPS), pages 3486–3494, Dec.
2016.
[11] T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum. Hierarchical
deep reinforcement learning: Integrating temporal abstraction and intrinsic
motivation. In Proc. Advances in Neural Information Processing Systems
(NeurIPS), pages 3675–3683, Dec. 2016.
[12] P.-L. Bacon, J. Harb, and D. Precup. The option-critic architecture. In
Proc. the Thirty-First AAAI Conf. Artificial Intelligence (AAAI-17), pages
1726–1734, Feb. 2017.
[13] K. Frans, J. Ho, X. Chen, P. Abbeel, and J. Schulman. Meta learning shared
hierarchies. In Proc. Int. Conf. Learning Represenations (ICLR), Apr.-May
2017.
[14] C. Daniel, H. Van Hoof, J. Peters, and G. Neumann. Probabilistic inference
for determining options in reinforcement learning. Machine Learning, 104(2-3):
337–357, Sep. 2016.
[15] C. Florensa, Y. Duan, and P. Abbeel. Stochastic neural networks for hierarchical reinforcement learning. Proc. Int. Conf. Learning Represenations (ICLR),
Apr. 2017.
[16] M. C. Machado, M. G. Bellemare, and M. Bowling. A laplacian framework for
option discovery in reinforcement learning. In Proc. of the 34th International
Conf. on Machine Learning (ICML), volume 70, pages 2295–2304, Aug. 2017.
[17] S. Girgin, F. Polat, and R. Alhajj. Improving reinforcement learning by using
sequence trees. Machine Learning, 81(3):283–331, Dec. 2010.
[18] N. Heess, G. Wayne, Y. Tassa, T. Lillicrap, M. Riedmiller, and D. Silver.
Learning and transfer of modulated locomotor controllers. arXiv:1610.05182,
Oct. 2016.
[19] R. E. Fikes and N. J. Nilsson. STRIPS: A new approach to the application of
theorem proving to problem solving. Artificial Intelligence, 2(3-4):189–208,
Sep. 1971.
[20] L. Siklossy and C. Dowson. The role of preprocessing in problem solving
systems. In Proc. Int. Joint Conf. Artificial Intelligence (IJCAI), pages
465–471, Aug. 1977.
[21] S. Minton. Selectively generalizing plans for problem-solving. In Proc. Int.
Joint Conf. Artificial Intelligence (IJCAI), pages 596–599, Aug. 1985.
[22] M. Pickett and A. G. Barto. PolicyBlocks: An algorithm for creating useful
macro-actions in reinforcement learning. In Proc. Int. Conf. Machine Learning
(ICML), pages 506–513, Jul. 2002.
[23] A. Botea, M. Enzenberger, M. M¨uller, and J. Schaeffer. Macro-FF: Improving AI planning with automatically learned macro-operators. J. Artificial
Intelligence Research (JAIR), 24:581–621, Oct. 2005.
[24] M. Newton, J. Levine, and M. Fox. Genetically evolved macro-actions in AI
planning problems. Proc. the UK Planning and Scheduling Special Interest
Group (PlanSIG) Wksp., pages 163–172, 2005.
[25] M. A. H. Newton, J. Levine, M. Fox, and D. Long. Learning macro-actions
for arbitrary planners and domains. In Proc. Int. Conf. Automated Planning
and Scheduling (ICAPS), pages 256–263, Sep. 2007.
[26] J. J. DiStefano III, A. R. Stubberud, and I. J. Williams. Feedback and control
systems. McGraw-Hill, 1967.
[27] I. P. Durugkar, C. Rosenbaum, S. Dernbach, and S. Mahadevan. Deep
reinforcement learning with macro-actions. arXiv:1606.04615, Jun. 2016.
[28] J. Randlov. Learning macro-actions in reinforcement learning. In Proc.
Advances in Neural Information Processing Systems (NeurIPS), pages 1045–
1051, Dec. 1999.
[29] T. Yoshikawa and M. Kurihara. An acquiring method of macro-actions
in reinforcement learning. In Proc. IEEE Int. Conf. Systems, Man and
Cybernetics, pages 4813–4817, Nov. 2006.
[30] H. Onda and S. Ozawa. A reinforcement learning model using macro-actions
in multi-task grid-world problems. In Proc. IEEE Int. Conf. Systems, Man
and Cybernetics, pages 3088–3093, Oct. 2009.
[31] F. M. Garcia, B. C. da Silva, and P. S. Thomas. A compression-inspired
framework for macro discovery. In Proc. of the 18th International Conf. on
Autonomous Agents and Multi Agent Systems (AAMAS), pages 1973–1975,
May. 2019.
[32] A. S. Lakshminarayanan, S. Sharma, and B. Ravindran. Dynamic action
repetition for deep reinforcement learning. In Proc. the Thirty-First AAAI
Conf. Artificial Intelligence (AAAI-17), Feb. 2017.
[33] S. Sharma, A. S. Lakshminarayanan, and B. Ravindran. Learning to repeat:
Fine grained action repetition for deep reinforcement learning. In Proc. Int.
Conf. Learning Represenations (ICLR), Apr.-May 2017.
[34] A. McGovern, R. S. Sutton, and A. H. Fagg. Roles of macro-actions in
accelerating reinforcement learning. In Proc. Grace Hopper celebration of
women in computing, volume 1317, 1997.
[35] K. Heecheol, M. Yamada, K. Miyoshi, and H. Yamakawa. Macro action
reinforcement learning with sequence disentanglement using variational autoencoder. arXiv:1903.09366, May 2019.
[36] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. A
Bradford Book, 2018.
[37] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang,
and W. Zaremba. OpenAI Gym. arXiv:1606.01540, Jun. 2016.
[38] M. Kempka, M. Wydmuch, G. Runc, J. Toczek, and W. Ja´skowskim. ViZDoom:
A Doom-based AI research platform for visual reinforcement learning. In IEEE
Conf. Computational Intelligence and Games (CIG), pages 1–8, Sep. 2016.
[39] J. E. Hopcroft and J. D. Ullman. Introduction to automata theory, languages,
and computation. Addison-Wesley, 1979.
[40] A. Hill, A. Raffin, M. Ernestus, A. Gleave, R. Traore, et al. Stable baselines.
https://github.com/hill-a/stable-baselines, 2018.
[41] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. Curiosity-driven exploration by self-supervised prediction. In Proc. Int. Conf. Machine Learning
(ICML), Aug. 2017.
[42] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents. J. Artificial
Intelligence Research (JAIR), 47:253–279, Jun. 2013.
[43] M. Wydmuch, M. Kempka, and W. Ja´skowski. ViZDoom competitions: Playing
Doom from pixels. IEEE Trans. Games, 2018.

簡易檢索 / 詳目顯示

相關論文