Details of the papers in README.md
(personal notes).
Rethinking Closed-loop Training for Autonomous Driving ECCV'22 Paper
Key Insight: There is a lack of understanding of how to build effective training benchmark for closed-loop training (What type of scenarios do we need to learn to drive safely). Popular RL algs cannot achieve satisfactory performance in the context of AD, as they lack long-term planning and take an extremely long time to train. So it proposes TRVAL, an RL-based driving agent that performs planning with multistep look-ahead and exploits cheaply generated imagined data for efficient learning.
Method: Direct output control signal may lack long-term reasoning. Explicit model rollout can be prohibitively expensive.
Combining aspects of model-fre and model-based approaches: 1. reason into the future by directly costing trajectories without explicit rollouts 2. learn from imagined (i.e. counterfactual) experiences from an approximate world model.
Define action as a trajectory, which navigates the ego-vehicle for the next
Using imagined counterfactual rollouts (non-reactive) as supervision only for short-term cost-to-come.
Experiment: Unknown highway dataset. More variations, target scenarios are important for closed-loop agent training.
UMBRELLA: Uncertainty-Aware Model-Based Offline Reinforcement Learning Leveraging Planning NeurIPS'21 Workshop (Best Paper) Paper
Key insight: Extention of MBOP, which allows for a simple extension of the reward function and the incorporation of state constraints. Besides, planning with the learned dynamics model enhances interpretability. However, MBOP uses a simple deterministic dynamics model, which ignores aleatoric uncertainty. Besides, it operates in a fully observable setting.
Method: It uses n-th order history to represent the state, transforming the POMDP into an MDP. To model different futures, it learns stochastic forward dynammics models
This works follows the nth-order history approach to account for states, which are not fully observable and merely estimated from the last observations
The behavior cloning model takes the current state and the
Experiments: NGSVM and CARLA.
Limitation of model-based offline planning: The reward function does not exactly represent the human driving style; the performance is limited by the unimodal BC policy (multi-modal BC policies).
Offline Reinforcement Learning for Autonomous Driving with Safety and Exploration Enhancement NeurIPS'21 Workshop Paper
Simple extension of batch-constrained Q-learning: Injecting parameter noises and lyapunov stability.
Motion Planning for Autonomous Vehicles in the Presence of Uncertainty Using Reinforcement Learning IROS'21 Paper
Key insight: Previous methods end up in conservative planning and expensive computation. This paper proposes a RL based solution to manage uncertainty by optimizing for the worst case outcome (using quantile regression). It is built on top of the distributional RL with its policy optimization maximizing the stochastic outcomes' lower bound.
Interpretable End-to-end Urban Autonomous Driving with Latent Deep Reinforcement Learning arXiv'20 Paper | Code
Key insight: A sequential latent environment model is introduced and learned jointly with the reinforcement learning process. With this latent model, a semantic birdeye mask can be generated, which is enforced to connect with a certain intermediate property for the purpose of explaining the behaviors of learned policy.
Method: Variational inference. Reconstruct BEV mask (only get sensor inputs) and sensor inputs. MaxEnt RL can be interpreted as learning as learning a PGM using optimal variable with soft probability
Experiments: Carla.
Model-free Deep Reinforcement Learning for Urban Autonomous Driving arXiv'19 Paper | Code
Key insight: Current RL methods generally do not work well on complex urban scenarios. This paper proposes a framework to enable model-free deep RL in challenging urban autonomous driving scenarios. It designs a specific input representation and use visual encoding to capture the low-dimensional latent states (BEV & VAE). Several tricks: modified exploration strategies, frame skip, network architectures and reward designs.
Experiments: Carla.
Learning to Drive in a Day arXiv'18 Paper
Guided Conditional Diffsuion for Controllable Traffic Simulation arXiv'22 Paper
Key insight: The control-realism tradeoff has long been a problem in AD simulators. This paper proposes to use diffusion model (similar to that in Diffuser) guided by signal temporal logic (STL) measure to manage the tradeoff. It further extends the denoising step (batched denoising) to the multi-agent setting and enable interaction-based rules like collision avoidance.
Method: It models the trajectory, but only the action trajectory, since the state trajectory is derived by the dynamics model (therefore bypass the start state inpainting problem). Then it uses the quantitative STL measure to guide the diffusion model during sampling to create realistic controllable traffic scenes.
Model-Based Imitation Learning for Urban Driving NeurIPS'22 Paper | Code
Key insight: The paper presents MILE: a model-based imitation learning approach to jointly learn a model of the world and a policy for autonomous driving. The model can predict diverse and plausible states and actions, that can be interpretably decoded to BEV semantic segmentation. It can also execute complex driving manoeuvres from plans entirely predicted in imagination. Do not assume access to GT physical states (position, velocity) or to an offline HD map for scene context. MILE is the first camera-only method that models static, dynamic scenes and ego-behavior in an urban driving environment.
Method: Variational inference. The goal is to infer the latent dynamics
Deteministic dynamics:
Experiment: Carla challenge.
Hierarchical Model-Based Imitation Learning for Planning in Autonomous Driving arXiv'22 Paper
Key insight: It demonstrates the first large-scale application of model-based generative adversarial imitation learning (MGAIL) to the task of dense urban self-driving. It augments the MGAIL using a hierarchical model to enable generalization to arbitrary goal routes, and measure performance using a closed-loop evaluation framework with simulated interactive agents.
Method: A common challenge with common IL is covariate shift, also known as "DAgger problem". The fundamental problem is that its open-loop training will incur compounding error at each time step. Therefore, the paper proposes to use MGAIL (model-based generative adversarial imitation learning) to conduct closed-loop training. Typically, it uses the deltas action model to enable differentiable policy update.
Another problem is that the planning problem should be goal-directed. So the paper proposes an hierarchical structure for planning: the high-level module uses bidirectional A* to produce a goal route; the low-level module uses Transformer and cross attention conditioned on the goal route, scene context, traffic light to output the discriminator score for GAIL as well as the action in the next timestep.
Experiment: Waymo's own dataset from a fleet of their vehicles operating in San Francisco. Propose to measure the average performance as well as the performance on difficult and uncommon situations (i.e. the long-tail performance).
ST-P3: End-to-end Vision-based Autonomous Driving via Spatial-Temporal Feature Learning ECCV'22 Paper | Code
Key insight: To better predict the control signals and enhance user safety, this paper proposes an end-to-end approach that benefits from joint spatial-temporal feature learning.
Method: After getting the context feature from the perceiving stage, ST-P3 adopts dual pathway probabilistic future modeling to get the context feature at future timesteps. Specifically, one branch models the future uncertainty as diagonal Gaussians, the other branch takes account into the historical features. Based on the future context features, a cost volume head is desiged to output the heatmap at future timesteps. Finally, it uses three costs to compose the total cost: safety cost for colliding with environment agents, a learned cost through the output heatmap, a trajectory cost for its comfort and smoothness. It uses sampler to output the final trajectory. Specifically, for the planning stage it provides the high-level command (goal position) and the front-view traffic lights into a GRU to refines the planned trajectory.
Experiment: Open-loop nuScenes and closed-loop Carla.
PlanT: Explainable Planning Transformers via Object-Level Representations CoRL'22 Paper | Code
Key insight: Existing learning-based planners typically extract features from dense, high-dimensional grid representations containing all vehicel and road context information. This paper proposes PlanT based on imitation learning with a compact object-level input representation. The experiment results indicate that PlanT can focus on the most relevant object in the scene, even when the object is geometrically distant.
Method: The most important part is the tokenization for the transformer input. Typically, it represents the scene using a set of objects, with vehicles and segments of the route each being assigned an oriented bounding box in BEV space. The vehicle set just contains all vehicles in the BEV, while the route segments boxes are got by sampled a dense set along the route the route ahead of the ego vehicle. For each object, it has 6 attributes associated with the relative position, length/width of the bounding box, the orientation, and the speed for the vehicle or the order for the route segments.
It introduces a [CLS] token to attentively sum from all the other agents. The input transformer then projects all the tokens. It uses the embedding of [CLS] into a GRU to generate multi-step future waypoints auto regressively. Besides, it also uses the embedding of the environment agents to regress their future bounding boxes accordingly as an auxiliary supervision.
Experiment: Carla Longest6.
End-to-End Urban Driving by Imitating a Reinforcement Learning Coach CVPR'21 Paper | Code
Key insight: On-policy demonstrations from humans is non-trivial. Labeling the targets given sensor measurements turns out to be a challenging task for humans. Only sparse events like human interventions are recorded, which is better suited for RL than IL methods. This paper focues on automated experts, in which the IL problem can be seen as a knowledge transfer problem. The paper proposes Roach, an RL expert that maps BEV images to continuous actions, which can provide action distributions, value estimations and latent features as supervision.
Method: RL expert: PPO + entropy and exploration loss. IL agent: DA-RB (CILRS + DAGGER). Image and measurement inputs.
Experiment: CARLA leaderboard.
Perceive, Predict, and Plan: Safe Motion Planning Through Interpretable Semantic Representations ECCV'20 Paper
DSDNet: Deep Structured self-Driving Network ECCV'20 Paper
Jointly Learnable Behavior and Trajectory Planning for Self-Driving Vehicles IROS'19 Paper
End-to-end Interpretable Neural Motion Planner CVPR'19 (Oral) Paper
ChauffeurNet: Learning to Drive by Imitating the Best and Synthesizing the Worst RSS'19 Paper
Key insight: It proposes exposing the learner to synthesized data in the form of perturbations to the expert's driving, which creates interesting situations such as collisions and/or going off the road. Rather than purely imitating all data, it augment the imitation loss with additional losses that penalized undesirable events and encourage progress.
Method: The AgentRNN is unrolled at training time for a fixed number of iterations, and the losses described below are summed together over the unrolled iterations.
End-to-end Driving via Conditional Imitation Learning ICRA'18 Paper
LEADER: Learning Attention over Driving Behaviors for Planning under Uncertainty CoRL'22 (Oral) Paper
Key insight: POMDPs offer a principled framework for planning under uncertainty. However, sampling also raises safety concerns by potentially missing critical events. This paper proposes LEADER, that learns to attend to critical human behaviors during planning. LEADER learns a neural network generator (by solving a min-max game) to provide attention over human behaviors, using importance sampling to bias reasoning towards critical events.
Method: Critic network:
Experiment: SUMMIT.
Closing the Planning-Learning Loop with Application to Autonomous Driving T-RO'22 Paper | Code
Key insight: Two challenges for autonomous driving: scalability and uncertainty (partial observable and complex world model). Current POMDP solvers like DESPOT still stuggles with very long horizons or large action spaces, producing highly sub-optimal decisions. Inspired by AlphaGO-Zero, this paper proposes LeTS-Drive, which integrates planning and learning in a closed loop, taking advantage of both self-supervised (learn from the planner improvement) and reinforcement learning (learn from the environment feedback).
Method: POMDP:
DESPOT reduces the complexity from
Experiment: SUMMIT. Given any urban map supported by the OpenStreetMap, it automatically generates realistic traffic.
KB-Tree: Learnable and Continuous Monte-Carlo Tree Search for Autonomous Driving Planning IROS'21 Paper
Key insight: Using kernel regression and bayesian optimization to enable MCTS in continuous space.
Driving Maneuvers Prediction Based Autonomous Driving Control by Deep Monte Carlo Tree T-VT'20 Paper | Code
Key insight: This paper develops a deep MCTS control method for vision-based autonomous driving.
Method: The MCTS module is composed of the Vehicle State Prediction network
Experiment: Simulator USS.
M2I: From Factored Marginal Trajectory Prediction to Interactive Prediction CVPR'22 Paper | [Code](https://tsinghua- mars- lab.github.io/M2I/)
Key insight: Existing models excel at predicting marginal trajectories for single agents, yet it remains an open probelm to jointly predict scene compliant trajectories over multiple agents. This paper exploits the underlying relations between interacting agents and decouple the joint prediction problem into marginal prediction problems. As causality in driving interaction remains an open problem, it pre-labels the influencer-reactor relation based on a heuristic, and proposes a relation predictor to classify interactive relations at inference time.
Method: Focus on two interactive agents:
For each marginal/conditional trajectory, it predicts
InterSim: Interactive Traffic Simulation via Explicit Relation Modeling IROS'22 Paper | Code
Key insight: Existing approaches learn an agent model from large-scale driving data to simulate realistic traffic scenarios, yet it remains an open question to produce consistent and diverse multi-agent interactive behaviors in crowded scenes.
Method: Five step procedure: conflict detection, relation-aware conflict resolution, goal driven trajectory prediction, conflict resolution between environment agents, and one-step simulation.
If the bounding boxes between two agents overlap at any time in the future, a conflict is detected and requires the simulator to update the colliding trajectories for better consistency and realism. Typically, it uses a relation predictor to identify influencer-reactor pairs. If an environment agent is chosen as the reactor, its goal point is reset at the colliding point, and the simulator uses DenseTNT to rollout its new trajectory. The iteration lasts until no environment agents and ego-vehicle are colliding with each other.
Comprehensive Reactive Safety: No Need for a Trajectory if You Have a Strategy IROS'22 Paper
Autonomous Driving Motion Planning With Constrained Iterative LQR T-IT'19 Paper
Tunable and Stable Real-Time Trajectory Planning for Urban Autonomous Driving IROS'15 Paper
Sampling-based Algorithms for Optimal Motion Planning IJRR'10 Paper
Practical Search Techniques in Path Planning for Autonomous Driving AAAI'08 Paper
Mismatched No More: Joint Model-Policy Optimization for Model-Based RL NeurIPS'22 Paper
Key insight: Models that achieve better training performance are not necessarily better for control: the objective mismatch problem. This paper proposes a single objective for jointly training the model and the policy (using an augmented reward), such that updates to either component increases a lower bound on expected return. The resulting algorithm is conceptually similar to a GAN: a classifier distinguishes between real and fake transitions, the model is updated to produce transitions that look realistic, and the policy is updated to avoid states where the model prediction are unrealistic. Typically, its model objective includes an additional value term and the policy objective includes an additional classifier term.
Planning with Diffusion for Flexible Behavior Synthesis ICML'22 Paper
Key insight: In model-based RL, the learned models may not be well-suited to standard trajectory optimization since they have different objectives. This paper proposes to fold as much of the trajectory optimization pipeline as possible into the modeling problem, such that sampling from the model and planning with it become nearly identical. The core of the paper is planning by iteratively denoising trajectories. Based on this, reinforcement learning can be interpreted as guided sampling, goal-conditioned RL can be interpreted as inpainting.
Method: It uses a diffusion model to model the trajectories
Since decision-making can be anti-causal (conditioned on the future), Diffuser predict all timesteps of a plan concurrently. As the input to diffuser, states and actions form a two-dimension array. The diffusion model is trained similar to that of DDPM:
Using
The goal-conditioned RL, as well as the start state constraint, can be interpreted as the inpainting problem. We hardly change the start/goal state at the end of each denoising step.
The In-Sample Softmax for Offline Reinforcement Learning ICLR'23 (submitted) Paper
Key insight: The critical challenge of offline RL is the insufficient action-coverage. Growing number of methods attempt to approximate an in-sample max, that only uses actions well-covered by the dataset. This paper highlights a simple fact: it is more straightforward to approximate an in-sample softmax using only actions in the dataset in the entropy-regularized setting. For some instances, batch-constrained Q learning uses in-sample max; IQL solution depends on the action distribution not just the support. In sample softmax relies primarily on sampling from the dataset, which is naturally in-sample, rather than requiring samples from an estimate of
Method: The soft Bellman optimality equations for maximum-entropy RL use the softmax in place of the max, as the temperature
Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning arXiv'22 Paper
Key insight: Simply model the policy
Know Your Boundaries: The Necessity of Explicit Behavioral Cloning in Offline RL arXiv'22 Paper
Key insight: The paper proves that the soft policy iteration with a penalized value function is equivalent to policy iteration regularized by
Mildly Conservative Q-Learning for Offline Reinforcement Learning NeurIPS'22 Paper | Code
Key insight: Existing offline RL methods that penalize the unseen actions or regularizing with the behavior policy are too pessimistic which supresses the generalization of the value function (Figure 1 is intuitive to see the problems). This paper explores mild but enough conservatism for offline learning while not harming generalization. It proposes Mildly Conservative Q-learning (MCQ), where OOD actions are actively trained by assigning them proper pseudo Q values. MCQ owns several superior properties: 1. the MCB operator is guaranteed to behave better than the behavior policy 2. it owns a tighter lower bound than existing policy constraint or value penalization methods 3. erroneous overestimation will not occur with it.
Method: It first proposes the mildly conservative bellman (MCB) operator and shows that no erroneous overestimation will occur with the MCB operator. The basic idea is of the operator is that if the learned policy outputs actions that lie in the support region of the behavior policy
In practice, it is intractable to acquire the maximum action value in the support set. Thus, it fit an empirical behavior policy with supervised learning on the static dataset. Then the pseudo values for the OOD actions are them computed by sampling
Bootstrapped Transformer for Offline Reinforcement Learning NeurIPS'22 Paper | Code
Key insight: The paper follows Trajectory Transformer and deals with the insufficient distribution coverage problem of the offline dataset. It proposes Bootstrapped Transformer, which incorporates the idea of bootstrapping and leverages the learned model to self-generate more offline data to further boost the sequence model learning.
Method: BooT treats each input trajectory as a sequence and add reward-to-go
BooT utilizes self-generated trajectories as auxiliary data to further train the model, which is the general idea of bootstrapping. The trajectory generation is to resample the last
A Unified Framework for Alternating Offline Model Training and Policy Learning NeurIPS'22 Paper | Code
Key insight: Offline MBRL algorithms can improve the efficiency and stability of policy learning over the model-free algorithms. However, in most existing offline MBRL algorithms, the learning objectives for the dynamic models and the policies are isolated from each other. This paper addresses the objective mismatch problem by developing an iterative offline MBRL framework, where it maximizes a lower bound of the true expected return, by alternating between dynamical-model training and policy learning.
A Policy-Guided Imitation Approach for Offline Reinforcement Learning NeurIPS'22 Paper | Code
Key insight: Offline RL methods can be categorized into two types: RL-based and imitation-based. RL-based methods enjoy OOD generalization but suffer from off-policy evaluation problem. Imitation-based methods avoid off-policy evaluation but are too conservative to surpass the dataset. This paper proposes an alternative approach, inheriting the training stability of imitation-style methods while still allowing logical OOD generalization. It decomposes the conventional reward-maximizing policy in offline RL into a guide-policy and an execute-policy. The algorithm allows state-compositionality (choose the state with the highest value in the dataset) from the dataset, rather than action-compositionality (choose the action with the highest value in the dataset, conservative, should take OOD actions to improve), conducted in prior imitation-style methods. It can also adapt to new tasks by changing the guide-policy.
Method: The job of the guide-policy is to learn the optimal next state given the current state, and the job of the execute-policy is to learn how different actions can produce different next states, given current states.
The guide-policy is to guide the execute-policy about which state it should go to. We train a state value function
The job of the execute-policy is to have a strong generalization ability. It adopts the RL via Supervised Learning framework by conditioning the execute policy on
MOReL: Model-Based Offline Reinforcement Learning arXiv'21 Paper | Code
Key insight: The paper proposes MORel, an algorithm for model-based offline RL. The framework consists of two steps: (a) learning a pessimistic MDP using the offline dataset; (b) learning a near-optimal policy in this P-MDP. The leared P-MDP has the property that for any policy, the performance in the real environment is approximateky lower-bounded by the performance in the P-MDP.
Offline Reinforcement Learning with Implicit Q-Learning arXiv'21 Paper | [Code](https://github.com/ikostrikov/implicit q learning)
Key insight: It proposes a new offline RL method that never needs to evaluate actions outside the dataset, but still enables the learned policy to improve substantially over the best behavior in the data through generalization. It approximates the policy improvement step implicitly by treating the state value function as a random variable, with randomness defined by the action, then taking a state conditional upper expectile to estimate the value of the best actions in that state. The algorithm alternates between fitting the supper expectile value and backing it up to the Q-function. Then the policy is extracted via advantage-weighted behavior Q-learning, which avoids querying out-of-sample actions.
Online and Offline Reinforcement Learning by Planning with a Learned Model NeurIPS'21 (Spotlight) Paper
Key insight: Alternating the tree search planner to improve the policy and value and the network to learn from it and guide the planner in turn. This iterative operation can help the networks continuously learn from the offline dataset (Reanalyze).
Offline Reinforcement Learning as One Big Sequence Modeling Problem NeurIPS'21 Paper | Code
Key insight: RL is typically concerned with estimating stationary policies or single-step models. This paper views RL as a generic sequence modeling problem, with the goal being to produce a sequence of actions that lead to a sequence of high rewards. Typically, it uses a Transformer architecture to model distributions over trajectories and repurposes beam search as a planning algorithm.
Method: A trajectory can be discreted:
Imitation learning just uses the previous trajectory as condition and uses beam search to find the successive trajectories with the largest probability. Goal-conditioned RL also uses the last state as the goal. Offline RL replaces the log-probabilities as the predicted reward signal during beam search.
Model-Based Offline Planning ICLR'21 Paper
Key insight: Learn an ensembly world model and a behavior policy to guide planning. The planning result is the reward-reweighted action sequence.
Off-Policy Deep Reinforcement Learning without Exploration ICML'19 Paper | Code
Key insight: Using CVAE to model the state-action distribution in the offline dataset, then using approximately in-sample max in a batch to conduct Q-learning.
Planning for Sample Efficient Imitation Learning NeurIPS'22 Paper | Code
Key insight: Imitation learning is free from many issues with reinforcement learning such as reward design and the exploration hardness. However, current IL struggles to achieve both high performance and high in-environment sample efficiency simultaneously. Behavior cloning does not need in-environment interactions, but it suffers from the covariate shift problem which harms its performance. Adversarial imitation learing turns imitation learning into a distribution matching problem. It can achieves better performance on some tasks but it requires a large number of in-environment interactions. This paper proposes EfficientImitate, a planning-based imitation learning method that can achieve high in-environment sample efficiency and performance simultaneously. It first extends AIL into the MCTS-based RL, then sow the seemingly incompatible two classes of imitation learning algorithms (BC and AIL) can be naturally unified under one framework.
Method: Sample efficiency in RL: one line of work finds that the reward signal is not a good data source for representation learning in RL and resorts to SSL or pretrained representations; another line of works focues on a learned model, which is promising for sample-efficient learning (imagine additional rollouts). EI combines both of these.
The AIL algorithm trains a discriminator
The behavior cloning function is integrated to guide the MCTS search,
Generative Adversarial Imitation Learning arXiv'16 Paper