簡易檢索 / 詳目顯示

研究生: 孫偉芳
Sun, Wei-Fang
論文名稱: 多智能強化學習中分解價值分布函數之統一框架
A Unified Framework for Factorizing Distributional Value Functions for Multi-Agent Reinforcement Learning
指導教授: 李濬屹
Lee, Chun-Yi
口試委員: 李正匡
Lee, Cheng-Kuang
周志遠
Chou, Jerry
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2023
畢業學年度: 111
語文別: 中文
論文頁數: 55
中文關鍵詞: 強化學習
外文關鍵詞: Reinforcement Learning
相關次數: 點閱:67下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在完全合作的多智能強化學習設定下,每個智能體僅能觀測到部分資訊且策略會不斷改變,因此導致環境具有高度隨機性。為了解決上述問題,我們提出了一個名為 DFAC 的統一框架,用於將分布式強化學習演算法和價值函數分解的方法相結合。該框架將價值函數分解方法推廣到能夠分解價值分布函數。為了驗證 DFAC 的有效性,我們首先展示它在具隨機獎勵的簡易矩陣遊戲中成功分解價值分布函數的能力。接著,我們在 StarCraft Multi-Agent Challenge 的所有 Super Hard 難度地圖,和六個自己設計的 Ultra Hard 地圖上進行實驗,結果顯示 DFAC 在分數上優於數個基準方法。

    這篇研究論文已經被《機器學習研究期刊》(JMLR 2023)接受發表。


    In fully cooperative multi-agent reinforcement learning (MARL) settings, environments are highly stochastic due to the partial observability of each agent and the continuously changing policies of other agents. To address the above issues, we proposed a unified framework, called DFAC, for integrating distributional RL with value function factorization methods. This framework generalizes expected value function factorization methods to enable the factorization of return distributions. To validate DFAC, we first demonstrate its ability to factorize the value functions of a simple matrix game with stochastic rewards. Then, we perform experiments on all Super Hard maps of the StarCraft Multi-Agent Challenge and six self-designed Ultra Hard maps, showing that DFAC is able to outperform a number of baselines.

    This research paper has been accepted by the Journal of Machine Learning Research (JMLR 2023).

    1 Introduction - 1 2 Related Works - 4 3 Background - 7 4 Methodology - 18 5 A Stochastic Matrix Game - 33 6 Experiment Results on SMAC - 38 7 Discussions and Outlook - 45 8 Conclusion - 48 Bibliography - 49

    [1] M. Tan. Multi-agent reinforcement learning: Independent versus cooperative
    agents. In Proc. Int. Conf. on Machine Learning (ICML), page 330–337, Jun.
    1993.
    [2] F. A. Oliehoek and C. Amato.
    A Concise Introduction to Decentralized
    POMDPs. Springer, 2016.
    [3] P. Sunehag et al. Value-decomposition networks for cooperative multi-agent
    learning based on team reward. In Proc. Int. Conf. on Autonomous Agents
    and MultiAgent Systems (AAMAS), pages 2085–2087, May 2018.
    [4] T. Rashid et al. QMIX: Monotonic value function factorisation for deep
    multi-agent reinforcement learning. In Proc. Int. Conf. on Machine Learning
    (ICML), pages 4295–4304, Jul. 2018.
    [5] K. Son, D. Kim, W. J. Kang, D. E. Hostallero, and Y. Yi. QTRAN: Learning
    to factorize with transformation for cooperative multi-agent reinforcement
    learning. In Proc. Int. Conf. on Machine Learning (ICML), pages 5887–5896,
    Jul. 2019.
    [6] M. Samvelyan et al. The StarCraft multi-agent challenge. In Proc. Int. Conf.
    on Autonomous Agents and MultiAgent Systems (AAMAS), pages 2186–2188,
    May 2019.
    [7] C. Guestrin, D. Koller, and R. Parr. Multiagent planning with factored MDPs.
    In Advances in Neural Information Processing Systems (NIPS), 2001.
    [8] C. Lyle, M. G. Bellemare, and P. S. Castro. A comparative analysis of ex-
    pected and distributional reinforcement learning. In Proc. AAAI Conf. on
    Artificial Intelligence (AAAI), pages 4504–4511, Feb. 2019.
    [9] M. G. Bellemare, W. Dabney, and R. Munos. A distributional perspective
    on reinforcement learning. In Proc. Int. Conf. on Machine Learning (ICML),
    pages 449–458, Jul. 2017.
    [10] W. Dabney, M. Rowland, M. G. Bellemare, and R. Munos. Distributional
    reinforcement learning with quantile regression. In Proc. AAAI Conf. on
    Artificial Intelligence (AAAI), pages 2892–2901, Feb. 2018.
    [11] W. Dabney, G. Ostrovski, D. Silver, and R. Munos. Implicit quantile networks
    for distributional reinforcement learning. In Proc. Int. Conf. on Machine
    Learning (ICML), pages 1096–1105, Jul. 2018.
    [12] M. Rowland et al. Statistics and samples in distributional reinforcement
    learning. In Proc. Int. Conf. on Machine Learning (ICML), pages 5528–5536,
    Jul. 2019.
    [13] D. Yang et al. Fully parameterized quantile function for distributional rein-
    forcement learning. In Proc. Conf. Advances in Neural Information Processing
    Systems (NeurIPS), pages 6190–6199, Dec. 2019.
    [14] J. Wang, Z. Ren, T. Liu, Y. Yu, and C. Zhang. QPLEX: Duplex dueling multi-
    agent Q-Learning. In Proc. Int. Conf. on Learning Representations (ICLR),
    May 2021.
    [15] W. F. Sun, C. K. Lee, and C. Y. Lee. DFAC framework: Factorizing the value
    function via quantile mixture for multi-agent distributional Q-Learning. In
    Proc. Int. Conf. on Machine Learning (ICML), pages 9945–9954, Jul. 2021.
    [16] F. L. Da Silva, A. H. R. Costa, and P. Stone. Distributional reinforcement
    learning applied to robot soccer simulation. In Workshop on Adaptive Learn-
    ing Agents (ALA) at AAMAS, Jun. 2019.
    [17] X. Lyu and C. Amato. Likelihood quantile networks for coordinating multi-
    agent reinforcement learning. In Proc. Int. Conf. on Autonomous Agents and
    MultiAgent Systems (AAMAS), pages 798–806, May 2020.
    [18] L. Matignon, G. Laurent, and N. Fort-Piat. Hysteretic Q-Learning: an al-
    gorithm for decentralized reinforcement learning in cooperative multi-agent
    teams. In Int. Conf. on Intelligent Robots and Systems (IROS), pages 64–69,
    Dec. 2007.
    [19] S. Omidshafiei, J. Pazis, C. Amato, J. P. How, and J. Vian. Deep decentralized
    multi-task multi-agent reinforcement learning under partial observability. In
    Proc. Int. Conf. on Machine Learning (ICML), pages 2681–2690, Aug. 2017.
    [20] M. Rowland et al. Temporal difference and return optimism in cooperative
    multi-agent reinforcement learning. In Workshop on Adaptive Learning Agents
    (ALA) at AAMAS, May 2021.
    [21] T. Rashid, G. Farquhar, B. Peng, and S. Whiteson. Weighted QMIX: Expand-
    ing monotonic value function factorisation. Advances in Neural Information
    Processing Systems (NIPS), pages 10199–10210, Dec. 2020.
    [22] Y. Yang, J. Hao, B. Liao, K. Shao, G. Chen, W. Liu, and H. Tang. Qatten: A
    general framework for cooperative multiagent reinforcement learning. arXiv
    preprint arXiv:2002.03939, 2020.
    [23] Y. Du et al. LIIR: Learning individual intrinsic reward in multi-agent rein-
    forcement learning. In Advances in Neural Information Processing Systems
    (NIPS), pages 4405–4416, Dec. 2019.
    [24] S. Q. Zhang, Q. Zhang, and J. Lin. Efficient communication in multi-agent
    reinforcement learning via variance based control. In Advances in Neural
    Information Processing Systems (NIPS), pages 3230–3239, Dec. 2019.
    [25] T. Wang, J. Wang, C. Zheng, and C. Zhang. Learning nearly decompos-
    able value functions via communication minimization. In Proc. Int. Conf. on
    Learning Representations (ICLR), Apr. 2020.
    [26] C. S. de Witt et al. Multi-agent common knowledge reinforcement learning.
    In Advances in Neural Information Processing Systems (NIPS), pages 9924–
    9935, Dec. 2019.
    [27] T. Wang, H. Dong, V. Lesser, and C. Zhang. Multi-agent reinforcement learn-
    ing with emergent roles. In Proc. Int. Conf. on Machine Learning (ICML),
    pages 9876–9886, Jul. 2020.
    [28] W. Wang et al. Action semantics network: Considering the effects of actions in
    multiagent systems. In Proc. Int. Conf. on Learning Representations (ICLR),
    Apr. 2020.
    [29] R. Lowe et al. Multi-agent actor-critic for mixed cooperative-competitive
    environments. In Advances in Neural Information Processing Systems (NIPS),
    pages 6379–6390, Dec. 2017.
    [30] J. N. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson. Coun-
    terfactual multi-agent policy gradients. In Proc. AAAI Conf. on Artificial
    Intelligence (AAAI), Feb. 2018.
    [31] S. Iqbal and F. Sha. Actor-attention-critic for multi-agent reinforcement learn-
    ing. In Proc. Int. Conf. on Machine Learning (ICML), pages 2961–2970, Jul.
    2019.
    [32] B. Peng et al. FACMAC: Factored multi-agent centralised policy gradients.
    In Advances in Neural Information Processing Systems (NIPS), pages 12208–
    12221, Dec. 2021.
    [33] J. Hu, S. A. Harding, H. Wu, and S. W. Liao. QR-MIX: Distributional
    value function factorisation for cooperative multi-agent reinforcement learn-
    ing. arXiv preprint arXiv:2009.04197, 2020.
    [34] W. Qiu et al. RMIX: Learning risk-sensitive policies for cooperative reinforce-
    ment learning agents. Advances in Neural Information Processing Systems
    (NIPS), pages 23049–23062, Dec. 2021.
    [35] V. Mnih et al. Human-level control through deep reinforcement learning.
    Nature, 518(7540):529–533, Feb. 2015.
    [36] J. Wang, Z. Ren, B. Han, J. Ye, and C. Zhang. Towards understanding
    cooperative multi-agent Q-Learning with value factorization. Advances in
    Neural Information Processing Systems (NIPS), Dec. 2021.
    [37] M. G. Bellemare, W. Dabney, and M. Rowland. Distributional Reinforcement
    Learning. MIT Press, 2022. http://www.distributional-rl.org.
    [38] M. G. Bellemare, N. Le Roux, P. S. Castro, and S. Moitra. Distributional
    reinforcement learning with linear function approximation. In Proc. Int. Conf.
    on Artificial Intelligence and Statistics (AISTATS), pages 2203–2211, Apr.
    2019.
    [39] T. T. Nguyen, S. Gupta, and S. Venkatesh. Distributional reinforcement
    learning via moment matching. In Proceedings of the AAAI Conference on
    Artificial Intelligence (AAAI), Feb. 2021.
    [40] N. Nikolov, J. Kirschner, F. Berkenkamp, and A. Krause.
    Information-
    directed exploration for deep reinforcement learning. In Proc. Int. Conf. on
    Learning Representations (ICLR), May 2019.
    [41] S. Zhang and H. Yao. QUOTA: The quantile option architecture for rein-
    forcement learning. In Proc. AAAI Conf. on Artificial Intelligence (AAAI),
    pages 5797–5804, Feb. 2019.
    [42] B. Mavrin, H. Yao, L. Kong, K. Wu, and Y. Yu. Distributional reinforcement
    learning for efficient exploration. In Proc. Int. Conf. on Machine Learning
    (ICML), pages 4424–4434, Jul. 2019.
    [43] L. Xia. Risk-sensitive Markov decision processes with combined metrics of
    mean and variance. Production and Operations Management, 29(12):2808–
    2827, 2020.
    [44] T. Schaul, D. Horgan, K. Gregor, and D. Silver. Universal value function
    approximators. In Proc. Int. Conf. on Machine Learning (ICML), pages 1312–
    1320, Jul. 2015.
    [45] N. Rahaman et al. On the spectral bias of neural networks. In Proc. Int.
    Conf. on Machine Learning (ICML), pages 5301–5310, Jul. 2019.
    [46] J. Karvanen. Estimation of quantile mixtures via L-moments and trimmed
    L-moments. Computational Statistics & Data Analysis, pages 947–959, Nov.
    2006.
    [47] Z. Lin et al. Distributional reward decomposition for reinforcement learning.
    In Advances in Neural Information Processing Systems (NIPS), pages 6212–
    6221, Dec. 2019.

    QR CODE