Reinforcement Learning is tasked with the goal of finding a policy that maximizes cumulative reward. Off-policy algorithms form an important part of RL algorithms as they are by definition able to leverage data from arbitrary policies in order to optimize the current policy. In practice, most off-policy deep RL algorithms suffer from issues like training instablity and overestimation.
Experiment: SAC fails when a little expert data is added to replay buffer
We do a simple experiment to verify a phenomena already known in literature. We take SAC, a stable, well-known and a goto methods for off-policy deep RL and before the start of training we add a few expert trajectories to the replay buffer. A good off-policy agent should be able to leverage the information provided by expert trajectories to direct exploration and learn faster. Instead, we see that SAC learning crashes (even with specialized fixes). This is likely due to the deadly-triad of learning off-policy with function approximation and bootstrapping. Sneak Peek: Dual-RL algorithms are able to leverage the expert data to actually learn faster.
A number of performant off-policy methods have emerged recently (CQL, ATAC, XQL, IQ-Learn, etc) that attribute their success to various different aspects of the algorithm and have widely differing mathematical derivations. We ask ourselves in this work: Is their a common thread that unifies a these performant algorithms?
Our work not only answer the question in affirmative, but also provides a way to unify reinforcement learning and imitation learning algorithms as well as offers new insights into algorithms to make them even better!
In this work, we rely of
We extend the definition and present a convex conjugate with positivity constraint:
For readability, we also use the two backup operators below later in the blog:
and
Essentially, Q is the backup of reward plus expected next-Q and V is the backup of reward plus expected next V.
Visitation distributions provide an easy way to think about the RL objective. A visitation distribution is a distribution over state-action space specifying the average (discounted) probability that a policy will reach that state and take the action when starting from the initial state distribution. Visitation distributions are denoted by
A general objective for regularized reinforcement learning can be written in terms of maximizing expected reward under the visitation distribution while being regularized to some offline vistation distribution denoted by
Searching over the space of visitation distributions is hard, because not all visitation distributions are possible in the environment. A policy when acting in the world has to respect the worlds dynamics! Well, let’s make the problem a little more formal by specifying the exact optimization problem we wish to solve:
All we have done above is added a constraint, popularly known as the bellman flow constraints which specify the condition which a visitation distribution should follow to be valid under the environment dynamics and initial state distribution. It is actually possible to make the constraint only depend on state-visitation distributions without changing the optimal solution:
Solving the RL problem in the primal land is hard as the constraints have to be accounted for. Convex optimization provides us with tools to create a dual problem for these objectives which share the same optimal solutions and are unconstrained.
Our primal problems are convex with linear constraints. This means we can attempt computing a Lagrangian dual and analyzing if they help us solving the optimization more easily. Taking the Lagrangian dual and using the convex conjugate definition we get the following two duals corresponding to primal-Q and primal-V.
and the other dual:
Indeed the duals look easier to optimize. They are unconstrained, naturally off-policy as they require only sampling from
Dual RL helps us get a clear understanding of offline RL algorithmic landscape. These approaches have been developed using widely differing derivations and we show in the paper that they are essentially the same approach under the lens of dual RL. Our hope is that this unification allows the researchers to analyze a number of RL algorithms with a common foundation and for practioners to carry over design choices across methods that are instantiated under the same setup.
The common idea in offline RL is to penalize the policy to go away from the offline dataset distribution. This is enforced by a policy constraint or a state-action visitation constraint. More recently, a number of approaches have emerged that use the idea of pessimistic value function (CQL, ATAC, etc) or implicit maximization (IQL, XQL, etc) to develop more performant offline RL methods. Pessimistic value function aims to achieve a lower bound on value functions and implicit maximization only uses observed samples to compute the maximum. It remains unclear how the two new classes of methods relate to the popular policy/visitation constraint in offline RL.
Dual RL helps us connects the dots by showing that the pessimistic value methods are dual-Q variant of the policy/visitation matching primal and implicit maximization methods are dual-V variant of policy/visitation matching methods.
The
Extreme Q-learning (XQL), a implicit maximization based offline RL method, was derived using the hypothesis that Bellman errors are Gumbel distributed. In face, XQL can simply be seen as a special case of
Interestingly, from the perspective of Dual RL, there is not really any difference between imitation learning and reinforcement learning. Simply setting the reward function to zero gives us the imitation learning objective and all the previous derived methods follow again with reward function=0. We can see that some recent algorithms shown in the right table (notably implicit behavior cloning and IQLearn) are just dual methods in disguise.
A limitation of existing imitation methods is their inability to handle off-policy data. We should be able to handle the more practical situation where we have a large amount of possibly suboptimal data
We propose to use the following mixture distribution matching objective.
It is a valid imitation learning objective as it is only minimized when the visitation of agent matches that of expert. The mixture matching objective allows use to leverage off-policy data in imitation learning. The dual of the above mixture matching objective gives us the ReCOIL objective:
where
We don’t require any restrictive assumptions on coverage, lower bounds or even a discriminator to solve the imitation learning objective. Although the above objective might appear complex, the intuition becomes clear when the function
The above objective learns a Q-function that ensures expert state-action pairs have higher score than other state-actions pairs. The Q-function is also learned to minimize the bellman error with 0 reward, ensuring that expert state-action pairs have higher score than other state-actions pairs.
ReCOIL outperforms baselines by a wide margin as shown in figure above. This effect is particularly pronounced when the expert coverage is low or when the state is high-dimensional and methods with discriminators easily overfit. The figure below demonstrates how the previous baseline suffers from compounding errors and cannot recover in practive when trying to solve the task.
Our paper unifies many more popular algorithms (as shown in table below) in the RL literature and creates a solid foundation for understanding and developing new methods in RL. Check out our paper for more information.
If you find the work useful in your research, please consider using the following citation:
@article{sikchi2023dual, title={Dual rl: Unification and new methods for reinforcement and imitation learning}, author={Sikchi, Harshit and Zheng, Qinqing and Zhang, Amy and Niekum, Scott}, journal={arXiv preprint arXiv:2302.08560}, year={2023} }