簡易檢索 / 詳目顯示

研究生: 賴建鈞
Lai, Chien-Chun
論文名稱: 增強技能多樣性以實現快速任務推斷:基於後繼特徵的對比學習之無監督技能發現
Enhancing skill diversity for fast task inference: A successor-feature-based contrastive learning approach for unsupervised skill discovery
指導教授: 金仲達
King, Chung-Ta
口試委員: 朱宏國
Chu, Hung-Kuo
莊仁輝
Chuang, Jen-Hui
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2023
畢業學年度: 112
語文別: 英文
論文頁數: 32
中文關鍵詞: 強化學習無監督技能發現後繼特徵技能多樣性噪聲對比估計快速任務推斷
外文關鍵詞: Reinforcement learning, unsupervised skill discovery, successor features, skill diversity, noise contrastive estimation, fast task inference
相關次數: 點閱:63下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 無監督強化學習藉由探索環境來學習環境知識,可以應用先前的知識有效
    率的學習下游任務。在眾多方法中,有一派作法額外學習技能來抽象化學得的
    知識,藉此幫助代理學習任務。在無監督學習時,如果我們可以學習各種技能
    來做出多樣化的行為,就意味著在學習下游任務時,可以挑選適合的技能來幫
    助我們快速的達成目標任務。而如何探索環境以學習多樣化的行為便是無監
    督技能學習中的關鍵因素。近期的一篇研究,APS [1]提出了使用介於”基於模
    型”和”無模型”的後繼特徵框架: (successor feature framework),結合狀態熵作為無監督技能學習最大互訊息的下界,不需要額外的模型就可以捕捉環境的狀態
    轉換,使得探索過的資訊不會被遺忘,可快速應用技能至下游任務。但其技能
    並無完善的多樣化學習,因此在下游任務中的表現不佳。因此我們提出了一種
    最大互資訊無監督技能學習方法,使用噪聲對比估計作為學習技能與狀態特徵
    的目標函數;在此之上我們發現,理想中,使用噪聲對比估計恰好符合SF為線
    性的限制,而即使存在著微小的噪聲依然可以在高維度的環境表現得很好。實
    驗結果顯示,在運動模型與機械手臂操作的任務中,我們的效能比最先進的無
    監督技能學習方法CIC還要好7%,且與同樣使用SF框架的APS相比,效能增幅
    高達83%


    Learning diverse primitive skills to efficiently solve novel tasks is the essence of
    unsupervised skill discovery. One general strategy is to specify a learning objective that maximizes the mutual information (MI) between skill and states so that
    the learned skills are distinct and can collectively explore large parts of the state
    space, facilitating the learning of novel downstream tasks. Unfortunately, as the
    above learning objective is intractable, most previous works had to resort to strong
    assumptions or approximation to make it tractable. This results in poor state coverage by the discovered skills due to insufficient exploration of the environment.
    A recent work, Active Pretraining with Successor Features (APS) [1], proposes a
    novel lower bound of MI by combining state entropy maximization [2] with variational successor features [3] to increase state exploration and speed up learning
    for downstream tasks. Although APS increases environmental exploration, the
    proposed variational approximation falls short in learning distinguishable skills,
    leading to poor diversity and short-sighted behavior. To address this limitation,
    we propose a successor-feature-based contrastive learning approach for unsupervised skill discovery by utilizing noise contrastive estimation as the lower bound
    of the intractable conditional entropy for learning distinctive skills and state features. With the learned representation, the behavioral diversity is enhanced by
    reaching individually distinct regions. The proposed approach is evaluated along
    with other baselines on locomotion and manipulation tasks on the Unsupervised
    Reinforcement Learning Benchmark (URLB) [4]. Experimental results show that
    our method is superior to previous unsupervised skill discovery methods, such as
    CIC, with an average 7% improvement. Furthermore, when compared to APS,
    which also shares the SF framework, our method results in an average of 83%
    boost in performance.

    Contents Abstract (Chinese) I Abstract II Contents III 1 Introduction 1 2 Preliminaries and Related Works 6 2.1 Markov Decision Process . . . 6 2.2 Unsupervised Skill Discovery . . . 7 2.3 Successor Features . . . 8 3 Proposed Method 11 3.1 Contrastive Variational Approximation . . . 12 3.2 State Exploration . . . 14 3.3 Policy Learning . . . 15 4 Experiment 17 4.1 Experimental Setup . . . 17 4.1.1 Environment . . . 17 4.1.2 Baseline . . . 19 4.1.3 Implementation Details . . . 19 III 4.2 Visualization of Learned Skills . . . 20 4.3 Discriminability between Skills and States . . . 21 4.4 Effects of Using Exploration Skills . . . 23 4.5 Performance Comparison on URLB . . . 23 5 Conclusion and Future Work 27 Bibliography 28

    Bibliography
    [1] H. Liu and P. Abbeel, “Aps: Active pretraining with successor features,” in
    International Conference on Machine Learning, pp. 6736–6747, PMLR, 2021.
    [2] H. Liu and P. Abbeel, “Behavior from the void: Unsupervised active pretraining,” Advances in Neural Information Processing Systems, vol. 34,
    pp. 18459–18473, 2021.
    [3] S. Hansen, W. Dabney, A. Barreto, T. Van de Wiele, D. Warde-Farley, and
    V. Mnih, “Fast task inference with variational intrinsic successor features,”
    arXiv preprint arXiv:1906.05030, 2019.
    [4] M. Laskin, D. Yarats, H. Liu, K. Lee, A. Zhan, K. Lu, C. Cang, L. Pinto, and
    P. Abbeel, “Urlb: Unsupervised reinforcement learning benchmark,” arXiv
    preprint arXiv:2110.15191, 2021.
    [5] A. E. Sallab, M. Abdou, E. Perot, and S. Yogamani, “Deep reinforcement learning framework for autonomous driving,” arXiv preprint
    arXiv:1704.02532, 2017.
    [6] X. B. Peng, P. Abbeel, S. Levine, and M. Van de Panne, “Deepmimic:
    Example-guided deep reinforcement learning of physics-based character
    skills,” ACM Transactions On Graphics (TOG), vol. 37, no. 4, pp. 1–14,
    2018.
    28
    [7] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra,
    and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv
    preprint arXiv:1312.5602, 2013.
    [8] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell, “Curiosity-driven exploration by self-supervised prediction,” in International Conference on Machine
    Learning, pp. 2778–2787, PMLR, 2017.
    [9] S. Park, J. Choi, J. Kim, H. Lee, and G. Kim, “Lipschitz-constrained unsupervised skill discovery,” in International Conference on Learning Representations, 2021.
    [10] M. Laskin, H. Liu, X. B. Peng, D. Yarats, A. Rajeswaran, and P. Abbeel,
    “Unsupervised reinforcement learning with contrastive intrinsic control,” Advances in Neural Information Processing Systems, vol. 35, pp. 34478–34491,
    2022.
    [11] M. L. Puterman, Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
    [12] D. Barber and F. Agakov, “The im algorithm: a variational approach to information maximization,” Advances in Neural Information Processing Systems,
    vol. 16, no. 320, p. 201, 2004.
    [13] K. Gregor, D. J. Rezende, and D. Wierstra, “Variational intrinsic control,”
    arXiv preprint arXiv:1611.07507, 2016.
    [14] J. Achiam, H. Edwards, D. Amodei, and P. Abbeel, “Variational option discovery algorithms,” arXiv preprint arXiv:1807.10299, 2018.
    29
    [15] B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine, “Diversity is all you need:
    Learning skills without a reward function,” arXiv preprint arXiv:1802.06070,
    2018.
    [16] K. Zeng, Q. Zhang, B. Chen, B. Liang, and J. Yang, “Apd: Learning diverse behaviors for reinforcement learning through unsupervised active pretraining,” IEEE Robotics and Automation Letters, vol. 7, no. 4, pp. 12251–
    12258, 2022.
    [17] V. Campos, A. Trott, C. Xiong, R. Socher, X. Gir´o-i Nieto, and J. Torres,
    “Explore, discover and learn: Unsupervised discovery of state-covering skills,”
    in International Conference on Machine Learning, pp. 1317–1327, PMLR,
    2020.
    [18] R. Yang, C. Bai, H. Guo, S. Li, B. Zhao, Z. Wang, P. Liu, and X. Li, “Behavior contrastive learning for unsupervised skill discovery,” arXiv preprint
    arXiv:2305.04477, 2023.
    [19] Y. Yuan, J. Hao, F. Ni, Y. Mu, Y. Zheng, Y. Hu, J. Liu, Y. Chen, and
    C. Fan, “Euclid: Towards efficient unsupervised reinforcement learning with
    multi-choice dynamics model,” arXiv preprint arXiv:2210.00498, 2022.
    [20] P. Dayan, “Improving generalization for temporal difference learning: The
    successor representation,” Neural computation, vol. 5, no. 4, pp. 613–624,
    1993.
    [21] T. D. Kulkarni, A. Saeedi, S. Gautam, and S. J. Gershman, “Deep successor
    reinforcement learning,” arXiv preprint arXiv:1606.02396, 2016.
    [22] A. Barreto, W. Dabney, R. Munos, J. J. Hunt, T. Schaul, H. P. van Hasselt,
    and D. Silver, “Successor features for transfer in reinforcement learning,”
    Advances in Neural Information Processing Systems, vol. 30, 2017.
    30
    [23] L. Lehnert and M. L. Littman, “Successor features combine elements of modelfree and model-based reinforcement learning,” The Journal of Machine Learning Research, vol. 21, no. 1, pp. 8030–8082, 2020.
    [24] D. Borsa, A. Barreto, J. Quan, D. Mankowitz, R. Munos, H. Van Hasselt,
    D. Silver, and T. Schaul, “Universal successor features approximators,” arXiv
    preprint arXiv:1812.07626, 2018.
    [25] C. Hoang, S. Sohn, J. Choi, W. Carvalho, and H. Lee, “Successor feature landmarks for long-horizon goal-conditioned reinforcement learning,” Advances in
    Neural Information Processing Systems, vol. 34, pp. 26963–26975, 2021.
    [26] R. Ramesh, M. Tomar, and B. Ravindran, “Successor options: An
    option discovery framework for reinforcement learning,” arXiv preprint
    arXiv:1905.05731, 2019.
    [27] M. Mozifian, D. Fox, D. Meger, F. Ramos, and A. Garg, “Generalizing successor features to continuous domains for multi-task learning,” 2021.
    [28] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive
    predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
    [29] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,”
    arXiv preprint arXiv:1509.02971, 2015.
    [30] J. Kim, S. Park, and G. Kim, “Unsupervised skill discovery with bottleneck
    option learning,” arXiv preprint arXiv:2106.14305, 2021.
    [31] Y. Burda, H. Edwards, A. Storkey, and O. Klimov, “Exploration by random
    network distillation,” arXiv preprint arXiv:1810.12894, 2018.
    31
    [32] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “Highdimensional continuous control using generalized advantage estimation,”
    arXiv preprint arXiv:1506.02438, 2015.
    [33] T. Rajapakshe, R. Rana, S. Latif, S. Khalifa, and B. W. Schuller, “Pretraining in deep reinforcement learning for automatic speech recognition,”
    arXiv preprint arXiv:1910.11256, 2019.
    [34] A. Touati, J. Rapin, and Y. Ollivier, “Does zero-shot reinforcement learning
    exist?,” arXiv preprint arXiv:2209.14935, 2022.
    [35] A. Touati and Y. Ollivier, “Learning one representation to optimize all rewards,” Advances in Neural Information Processing Systems, vol. 34, pp. 13–
    23, 2021.

    QR CODE