簡易檢索 / 詳目顯示

研究生: 江家丞
Chiang, Chia-Cheng
論文名稱: 基於專家鄰近性獎勵函數的單一示範模仿學習
Expert Proximity as Surrogate Rewards for Single Demonstration Imitation Learning
指導教授: 李濬屹
Lee, Chun-Yi
口試委員: 吳廸融
Wu, Ti-Rong
謝秉均
Hsieh, Ping-Chun
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2024
畢業學年度: 112
語文別: 英文
論文頁數: 55
中文關鍵詞: 模仿學習深度增強式學習
外文關鍵詞: Imitation Learning, Deep Reinforcement Learning
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 這項研究探討了具有挑戰性的單一示範模仿學習情境。在這種情況下, 增強式學習代理人的學習僅依賴單一專家示範,並在缺乏外部獎勵信號、 人類反饋或先前類似知識的環境中運作,因為通常難以獲得多個示範或設 計複雜的獎勵函數。在這些限制下,此研究引入了轉換鑑別器來進行模仿學 習(TDIL)。TDIL旨在透過納入環境動態來增加可用獎勵信號的密度並提高訓 練性能。此方法的核心精神為,代理人應該首先致力於達到接近專家行為的狀 態,再去遵循有限的專家示範。此研究引入了一個由轉換鑑別器近似的替代獎 勵函數,以實現上述過程。TDIL在解決稀疏示範模仿學習方面顯示出潛力。通 過與多個基準演算法測試上進行的一系列全面實驗驗證了TDIL相對於現有模仿 學習方法的有效性。


    This study investigates the challenging single-demonstration imitation learn- ing (IL) setting. In this context, the learning agent relies solely on a single expert demonstration and operates in an environment that lacks external reward signals, human feedback, or prior analogous knowledge, as obtaining multiple demonstra- tions or engineering complex reward functions is often infeasible. Given these constraints, the study introduces a methodology termed Transition Discriminator- based IL (TDIL). TDIL aims to augment the density of available reward signals and enhance agent performance by incorporating environmental dynamics. It posits that rather than strictly adhering to a limited expert demonstration, the agent should first aim to reach states proximal to expert behavior. The study introduces a surrogate reward function, approximated by a transition discriminator, to facil- itate this process. TDIL demonstrates promise in addressing the sparse-reward problem common in single-demonstration IL, and stabilizing the learning process of the agent during training. A comprehensive set of experiments across multiple benchmarks validate the effectiveness of TDIL over existing IL methods.

    Abstract (Chinese) I Acknowledgements (Chinese) II Abstract III Acknowledgements IV Contents V List of Figures VIII List of Tables X 1 Introduction 1 2 Preliminary 6 3 Methodology 8 3.1 Motivational Example : IL Under Single Demonstration Constraint 8 3.2 Expert Proximity and Surrogate Rewards 10 3.3 TDIL Training Algorithm 13 4 Related Work 16 5 Experimental Results 18 5.1 Experimental Setup 18 5.2 MuJoCo Results 19 5.3 Ablation Study 21 5.4 BlindModel Selection 22 6 Conclusion 23 Bibliography 24 A1 Appendix 31 A2 Extended Review of Related Work 32 A3 Algorithm and Training Details 35 A3.1 Practical Algorithm 35 A3.2 Training stabilization 35 A3.3 Algorithm detail 36 A3.4 Color coding in Fig. 1.1 37 A4 Experimental Setups 39 A4.1 Model Architecture of TDIL 39 A4.2 Code Implementation and Hardware Configuration 39 A5 Additional Experiments 41 A5.1 Training Curves 41 A5.2 Performance comparison between TDIL and baselines with BC loss 42 A5.3 Experiments in Adroit hand environment 42 A5.4 Exploring Relative Rewards for Blind Model Selection 43 A5.5 Blind Model Selection Experiments 45 A5.6 An Analysis of the Accuracy of the Transition Discriminator 46 A5.7 Sensitivity analysis on the hyper-parameter α 47 A5.8 Experimental result on different choices of hyper-parameter β 48 A6 Multi-Step Expert Proximity 54

    [1] S. Levine, “Reinforcement learning and control as probabilistic inference: Tu- torial and review,” arXiv preprint arXiv:1805.00909, 2018.
    [2] M. Bain and C. Sammut, “A framework for behavioural cloning.,” in Machine Intelligence 15, pp. 103–129, 1995.
    [3] E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for model- based control,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033, IEEE, 2012.
    [4] A. Rajeswaran, V. Kumar, A. Gupta, G. Vezzani, J. Schulman, E. Todorov, and S. Levine, “Learning complex dexterous manipulation with deep rein- forcement learning and demonstrations,” arXiv preprint arXiv:1709.10087, 2017.
    [5] R. de Lazcano, K. Andreas, J. J. Tai, S. R. Lee, and J. Terry, “Gymnasium robotics,” 2023.
    [6] R. Dadashi, L. Hussenot, M. Geist, and O. Pietquin, “Primal wasserstein imitation learning,” in Proc. of Int. Conf. on Learning Representations, 2021.
    [7] M. Arjovsky and L. Bottou, “Towards principled methods for training gener- ative adversarial networks,” arXiv preprint arXiv:1701.04862, 2017.
    [8] J. Ho and S. Ermon, “Generative adversarial imitation learning,” in Proc. Int. of Conf. on Neural Information Processing Systems (NeurIPS), vol. 29, 2016.
    [9] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in Proc. of Int. Conf. on machine learning, pp. 1861–1870, PMLR, 2018.
    [10] S. Haldar, J. Pari, A. Rai, and L. Pinto, “Teach a robot to fish: Versatile imi- tation from one minute of demonstrations,” arXiv preprint arXiv:2303.01497, 2023.
    [11] S. Haldar, V. Mathur, D. Yarats, and L. Pinto, “Watch and match: Su- percharging imitation with regularized optimal transport,” in Conference on Robot Learning, pp. 32–43, PMLR, 2023.
    [12] Y. Duan, M. Andrychowicz, B. Stadie, J. Ho, J. Schneider, I. Sutskever, P. Abbeel, and W. Zaremba, “One-shot imitation learning,” in Proc. Int. of Conf. on Neural Information Processing Systems (NeurIPS), vol. 30, 2017.
    [13] C. Finn, T. Yu, T. Zhang, P. Abbeel, and S. Levine, “One-shot visual imi- tation learning via meta-learning,” in Proc. Int. of Conf. on Robot Learning, pp. 357–368, PMLR, 2017.
    [14] T. Yu, C. Finn, A. Xie, S. Dasari, T. Zhang, P. Abbeel, and S. Levine, “One- shot imitation from observing humans via domain-adaptive meta-learning,” arXiv preprint arXiv:1802.01557, 2018.
    [15] S. Dasari and A. Gupta, “Transformers for one-shot visual imitation,” in Proc. Int. of Conf. on on Robot Learning, pp. 2071–2084, PMLR, 2021.
    [16] T. Yu, P. Abbeel, S. Levine, and C. Finn, “One-shot hierarchical imitation learning of compound visuomotor tasks,” arXiv preprint arXiv:1810.11043, 2018.
    [17] Z. Mandi, F. Liu, K. Lee, and P. Abbeel, “Towards more generalizable one-shot visual imitation learning,” in Proc. Int. of Conf. on Robotics and Automation (ICRA), 2022.
    [18] D.-A. Huang, D. Xu, Y. Zhu, A. Garg, S. Savarese, L. Fei-Fei, and J. C. Niebles, “Continuous relaxation of symbolic planner for one-shot imitation learning,” in Proc. Int. of Conf. on Intelligent Robots and Systems (IROS), 2019.
    [19] A. Netanyahu, T. Shu, J. Tenenbaum, and P. Agrawal, “Discovering gener- alizable spatial goal representations via graph-based active reward learning,” in Proc. Int. of Conf. on Machine Learning, pp. 16480–16495, 2022.
    [20] E. Valassakis, G. Papagiannis, N. Di Palo, and E. Johns, “Demonstrate once, imitate immediately (dome): Learning visual servoing for one-shot imitation learning,” in 2Proc. Int. of Conf. on Intelligent Robots and Systems (IROS), pp. 8614–8621, IEEE, 2022.
    [21] Z. Hu, Z. Gan, W. Li, J. Z. Wen, D. Zhou, and X. Wang, “Two-stage model- agnostic meta-learning with noise mechanism for one-shot imitation,” IEEE Access, vol. 8, pp. 182720–182730, 2020.
    [22] J. Fu, K. Luo, and S. Levine, “Learning robust rewards with adverse- rial inverse reinforcement learning,” in Proc. Int. of Conf. on Learning Representations, 2018.
    [23] L. Ke, S. Choudhury, M. Barnes, W. Sun, G. Lee, and S. Srinivasa, “Imitation learning as f-divergence minimization,” in In International Workshop on the Algorithmic Foundations of Robotics, pp. 313–329, Springer, 2021.
    [24] S. K. S. Ghasemipour, R. Zemel, and S. Gu, “A divergence minimization per- spective on imitation learning methods,” in Proc. Int. of Conf. on Robot Learning (L. P. Kaelbling, D. Kragic, and K. Sugiura, eds.), vol. 100 of Proceedings of Machine Learning Research, pp. 1259–1277, PMLR, 30 Oct–01 Nov 2020.
    [25] T. Ni, H. Sikchi, Y. Wang, T. Gupta, L. Lee, and B. Eysenbach, “f-irl: Inverse reinforcement learning via state marginal matching,” in Proc. Int. of Conf. on Robot Learning, 2020.
    [26] G. Swamy, S. Choudhury, J. A. Bagnell, and S. Wu, “Of moments and match- ing: A game-theoretic framework for closing the imitation gap,” in Proc. Int. of Conf. on Machine Learning (M. Meila and T. Zhang, eds.), vol. 139 of Proceedings of Machine Learning Research, pp. 10022–10032, PMLR, 18–24 Jul 2021.
    [27] I. Kostrikov, O. Nachum, and J. Tompson, “Imitation learning via off-policy distribution matching,” in Proc. Int. of Conf. on Learning Representations, 2020.
    [28] A. Camacho, I. Gur, M. L. Moczulski, O. Nachum, and A. Faust, “Sparsedice: Imitation learning for temporally sparse data via regularization,” in ICML 2021 Workshop on Unsupervised Reinforcement Learning, 2021.
    [29] G. Freund, E. Sarafian, and S. Kraus, “A coupled flow approach to imitation learning,” in Proc. Int. of Conf. on Machine Learning, 2023.
    [30] I. K., K. K. Agrawal, D. Dwibedi, S. Levine, and J. Tompson, “Discriminator- actor-critic: Addressing sample inefficiency and reward bias in adversarial imitation learning,” in Proc. Int. of Conf. on Learning Representations, 2019.
    [31] D.-S. Han, H. Kim, H. Lee, J. Ryu, and B.-T. Zhang, “Robust imitation via mirror descent inverse reinforcement learning,” in Proc. Int. of Conf. on Neural Information Processing Systems (A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, eds.), 2022.
    [32] S. Zeng, C. Li, A. Garcia, and M. Hong, “Maximum-likelihood inverse re- inforcement learning with finite-time guarantees,” in Proc. Int. of Conf. on Neural Information Processing Systems (A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, eds.), 2022.
    [33] L. Viano, A. Kamoutsi, G. Neu, I. Krawczuk, and V. Cevher, “Proximal point imitation learning,” in Proc. Int. of Conf. on Neural Information Processing Systems (A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, eds.), 2022.
    [34] R. Wang, C. Ciliberto, P. V. Amadori, and Y. Demiris, “Random expert distillation: Imitation learning via expert policy support estimation,” in Proc. Int. of Conf. on Machine Learning (K. Chaudhuri and R. Salakhutdinov, eds.), vol. 97 of Proceedings of Machine Learning Research, pp. 6536–6544, PMLR, 09–15 Jun 2019.
    [35] K. Brantley, W. Sun, and M. Henaff, “Disagreement-regularized imitation learning,” in Proc. Int. of Conf. on Learning Representations, 2020.
    [36] M. Liu, T. He, M. Xu, and W. Zhang, “Energy-based imitation learning,” in Proc. of Int. Conf. on Autonomous Agents and Multiagent Systems, 2020.
    [37] K. Kim, A. Jindal, Y. Song, J. Song, Y. Sui, and S. Ermon, “Imitation with neural density models,” 2020.
    [38] H. Xiao, M. Herman, J. Wagner, S. Ziesche, J. Etesami, and T. H. Linh, “Wasserstein adversarial imitation learning,” arXiv preprint arXiv:1906.08113, 2019.
    [39] F. Al-Hafez, D. Tateo, O. Arenz, G. Zhao, and J. Peters, “LS-IQ: Implicit reward regularization for inverse reinforcement learning,” in Proc. of Int. Conf. on Learning Representations, 2023.
    [40] H. Sikchi, A. Saran, W. Goo, and S. Niekum, “A ranking game for imitation learning,” arXiv preprint arXiv:2202.03481, 2022.
    [41] D. A. Pomerleau, “Efficient training of artificial neural networks for au- tonomous navigation,” Neural computation, vol. 3, no. 1, pp. 88–97, 1991.
    [42] S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitation learning and structured prediction to no-regret online learning,” in Proc. of Int. Conf. on Artificial Intelligence and Statistics, PMLR, 2011.
    [43] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” Communications of the ACM, vol. 63, no. 11, pp. 139–144, 2020.
    [44] O. Nachum, Y. Chow, B. Dai, and L. Li, “Dualdice: Behavior-agnostic es- timation of discounted stationary distribution corrections,” in Proc. Int. of Conf. on Neural Information Processing Systems, vol. 32, 2019.
    [45] S. Reddy, A. D. Dragan, and S. Levine, “Sqil: Imitation learning via rein- forcement learning with sparse rewards,” in Proc. Int. of Conf. on Learning Representations, 2019.
    [46] M. Sun, S. Devlin, K. Hofmann, and S. Whiteson, “Deterministic and discrim- inative imitation (d2-imitation): Revisiting adversarial imitation for sample efficiency,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2022.
    [47] K. Ciosek, “Imitation learning by reinforcement learning,” in Proc. Int. of Conf. on Learning Representations, 2022.
    [48] Y. Burda, H. Edwards, A. Storkey, and O. Klimov, “Exploration by random network distillation,” in Proc. Int. of Conf. on Learning Representations, 2019.
    [49] Y. Song and D. P. Kingma, “How to train your energy-based models,” 2021.
    [50] M. Germain, K. Gregor, I. Murray, and H. Larochelle, “Made: Masked au- toencoder for distribution estimation,” in Proc. of Int. Conf. on Machine Learning, 2015.
    [51] D. Garg, S. Chakraborty, C. Cundy, J. Song, and S. Ermon, “Iq-learn: Inverse soft-q learning for imitation,” in Proc. of Int. Conf. on Neural Information Processing Systems, 2021.

    QR CODE