A Multi-Agent Reinforcement Learning Approach For Safe and Efficient Behavior Planning Of Connected Autonomous Vehicles

Submitted to IEEE Transactions on Intelligent Transportation Systems

1Department of Computer Science and Engineering, University of Connecticut
2Department of Electrical and Computer Engineering, University of Connecticut
3Department of Electrical and Computer Engineering, University of Florida


We design a safe actor-critic multi-agent reinforcement learning (MARL) algorithm to learn a policy to select actions. We use two new techniques in our algorithm: truncated Q-function and safe action mapping. We also introduce a CBF-QP controller to generate control inputs for steering angle and acceleration with provable safety guarantees.


The recent advancements in wireless technology enable connected autonomous vehicles (CAVs) to gather information about their environment by vehicle-to-vehicle (V2V) communication. In this work, we design an information-sharing-based multi-agent reinforcement learning (MARL) framework for CAVs, to take advantage of the extra information when making decisions to improve traffic efficiency and safety. The safe actor-critic algorithm we propose has two new techniques: the truncated Q-function and safe action mapping. The truncated Q-function utilizes the shared information from neighboring CAVs such that the joint state and action spaces of the Q-function do not grow in our algorithm for a large-scale CAV system. We prove the bound of the approximation error between the truncated-Q and global Q-functions. The safe action mapping provides a provable safety guarantee for both the training and execution based on control barrier functions. Using the CARLA simulator for experiments, we show that our approach can improve the CAV system's efficiency in terms of average velocity and comfort under different CAV ratios and different traffic densities. We also show that our approach avoids the execution of unsafe actions and always maintains a safe distance from other vehicles. We construct an obstacle-at-corner scenario to show that the shared vision can help CAVs to observe obstacles earlier and take action to avoid traffic jams.


  • We propose a novel safe and efficient actor-critic algorithm for behavior planning of CAVs based on two new techniques: 1) Truncated Q-function: Each vehicle learns a truncated Q-function as a critic that only needs the states and actions from neighboring vehicles. The joint state and action spaces of the truncated Q-function do not grow in a large-scale CAV system. 2) Safe action mapping: We map any action in the action space to the safe action set so that the training and execution have provable safety guarantees.
  • To support the learning process of the truncated Q-function, we propose a weight-pruned convolutional neural network (CNN) technique to guarantee the images from the camera and point clouds from LIDAR can be processed fast enough such that the vision information is always available for the learning of the truncated Q-function.
  • We validate our algorithms in the CARLA simulator that can simulate complicated mixed traffic environments including both autonomous and human-driven vehicles. The experiments show that the safe actor-critic algorithm can improve traffic efficiency with safety guarantees. We also validate our MARL algorithm in challenging driving scenarios like obstacle-at-corner, and the shared vision with our algorithm helps vehicles to avoid traffic jams.

  • Problem Description

    We design a novel safe behavior planning and control framework with decentralized training and decentralized execution to tackle the new challenges for CAVs. One typical workflow for an autonomous vehicle includes perception, prediction, mapping and localization, routing, behavior planning, and control. We focus on the last two modules: the behavior planning module to determine whether to change or keep lanes; the control module to control the steering angle and the acceleration.

    Truncated Q-function

    The truncated Q-network for the behavior planning with the LSTM (long short-term memory) layer and FC (fully connected) layers. We use truncated Q-function to approximate the centralized critic such that the training process utilizes the information sharing capability of CAVs instead of relying on the global states and actions.

    Safe  Action  Mapping


    We use safe action mapping to guarantee the action explored in training and implemented in execution is safe with feasible control inputs. We use a control barrier function based quadratic programming (CBF-QP) to evaluate whether an action is safe or not. If the action is safe, then return it; if not, we will search other actions in descending order according to their action value and find a safe one. If all the actions are not safe in the worst case, then we will apply the emergency stop (ES) process.

    Shared  Vision

    Detailed structure of vision information processing. We combine the lane segmentation and 3D object detection results to extract neighboring vehicles' features. The processed information is included in the state of each CAV.


    The obstacle-at-corner scenario where there are obstacles in a left-turning corner. The vehicles on the road are autonomous vehicles. The coming vehicles' view is blocked such that they cannot observe the obstacles. Without any information sharing, there is a traffic jam in the left lane. With shared vision from A or B, using our safe MARL policy, vehicle C can change its lane before it enters the left-turning corner.


          author    = {Han, Songyang and Zhou, Shanglin and Wang, Jiangwei and Pepin, Lynn and Ding, Caiwen and Fu, Jie and Miao, Fei},
          title     = {A Multi-Agent Reinforcement Learning Approach For Safe and Efficient Behavior Planning Of Connected Autonomous Vehicles},