What is the Solution for State-Adversarial Multi-Agent Reinforcement Learning?

Transactions on Machine Learning Research (TMLR)

1School of Computing, University of Connecticut
2Department of Electrical and Computer Engineering, University of Illinois at Chicago
3Department of Mathematics and Department of Computer Science, University of Maryland, College Park
4Department of Electrical Engineering and Department of Computer Science and Engineering, University at Buffalo, The State University of New York

The agents' goal is to occupy and cover all landmarks, requiring cooperation to decide which landmark to cover. Figure a) illustrates the optimal target landmark for each agent without state perturbation. However, in figure b), an adversary perturbs the state observation of agents, causing agents to head in the wrong direction and leaving landmark 1 as uncovered. Our work demonstrates that traditional agent policies can be easily corrupted by adversarial state perturbations. To counter this, we propose a robust agent policy that maximizes average performance under worst-case state perturbations .


Various methods for Multi-Agent Reinforcement Learning (MARL) have been developed with the assumption that agents' policies are based on accurate state information. However, policies learned through Deep Reinforcement Learning (DRL) are susceptible to adversarial state perturbation attacks. In this work, we propose a State-Adversarial Markov Game (SAMG) and make the first attempt to investigate different solution concepts of MARL under state uncertainties. Our analysis shows that the commonly used solution concepts of optimal agent policy and robust Nash equilibrium do not always exist in SAMGs. To circumvent this difficulty, we consider a new solution concept called robust agent policy, where agents aim to maximize the worst-case expected state value. We prove the existence of robust agent policy for finite state and finite action SAMGs. Additionally, we propose a Robust Multi-Agent Adversarial Actor-Critic (RMA3C) algorithm to learn robust policies for MARL agents under state uncertainties. Our experiments demonstrate that our algorithm outperforms existing methods when faced with state perturbations and greatly improves the robustness of MARL policies.


  • We formulate a State-Adversarial Markov Game (SAMG) to study the fundamental properties of MARL under adversarial state perturbations. We prove that widely used solution concepts such as optimal agent policy or robust Nash equilibrium do not always exist .
  • We consider a new solution concept, robust agent policy, where each agent aims to maximize the worst-case expected state value. We prove the existence of a robust agent policy for SAMGs with finite state and action spaces. We propose a Robust Multi-Agent Adversarial Actor-Critic (RMA3C) algorithm to solve the challenge of training robust policies under adversarial state perturbations based on gradient descent ascent algorithm.
  • We empirically evaluate our proposed RMA3C algorithm. Our algorithm outperforms baselines with random or adversarial state perturbations and improves agent policies' robustness under state uncertainties.

  • State-Adversarial Markov Game (SAMG)

    Multi-agent reinforcement learning under adversarial state perturbations. Each agent is associated with an adversary to perturb its knowledge or observation of the true state. Agents want to find a policy \(\pi \) to maximize their total expected return while adversaries want to find a policy \( \chi \) to minimize agents' total expected return.

    Solution Concepts

    Solution concepts for the SAMGs. We first examine the widely used concepts (optimal agent policy and robust Nash Equilibrium) and demonstrate their non-existence under adversarial state perturbations. In response, we consider a new objective, the worst-case expected state value, and a new solution concept, the robust agent policy.


    Our RMA3C algorithm compared with several baseline algorithms during the training process. The results showed that our RMA3C algorithm outperforms the baselines, achieving higher mean episode rewards and displaying greater robustness to state perturbations. The baselines were trained under either random state perturbations or a well-trained adversary policy \(\chi^* \). It's worth noting that the MAPPO algorithm only works in fully cooperative tasks, and as such, its results are only reported in the cooperative navigation and exchange target scenarios. Overall, our RMA3C algorithm achieved up to 58.46% higher mean episode rewards than the baselines.


          author = {Han, Songyang and Su, Sanbao and He, Sihong and Han, Shuo and Yang, Haizhao and Miao, Fei},
          title = {What is the Solution for State-Adversarial Multi-Agent Reinforcement Learning?},
          archivePrefix={Transactions on Machine Learning Research (TMLR)},
          year = {2024},