Note: The blogs serves to promote informal understanding of the ideas in our work. For a more formal read, check out our paper.
We propose
Why do we need a new framework?
Expert data is: ✔ Very informative of desired behavior ✘ Can be difficult to obtain ✘ Even in limit of infinite data, does not give complete information if one suboptimal behavior is better than the other
On the other hand, preferences are: ✔ Easier to obtain ✘ Required in large numbers to infer reward function
The classical inverse reinforcement learning (IRL) formulation learns from expert demonstrations but provides no mechanism to incorporate learning from offline preferences and vice versa. Our ranking game framework utilizes a novel ranking loss giving an algorithm that can simultaneously learn from expert demonstrations and preferences, gaining the advantages of both modalities.
Imitation learning (IL) involves learning from expert demonstrated behaviors. Inverse Reinforcement Learning (IRL) is a method commonly used for IL and often provides the strongest guarantees in matching expert behavior. These behaviors can be specified as sequence of observations + actions, termed as the Learning from Demonstrations (LfD) setting or they can be observations-only sequence, termed as the Learning from Observations setting (LfO). IL allows us to bypass reward specification which is in general a hard-task that often result in misaligned (with human intention) agent. While IL has been mostly limited to learning from expert demonstrations, a parallel line of work has explored using preferences between suboptimal trajectories for reward inference.
IRL objective misses a key point - incorporating suboptimal data. Why do we care about suboptimal data? Expert data is very informative but usually hard to obtain (eg. obtaining a running behavior from a dog, image on the right). Moreover, in the LfO setting where expert actions are missing, IRL is faced with an exploration difficulty to select actions that induce the expert observations. There are usually multiple reward hypotheses that are consistent with expert behaviors (IRL is an underspecifed problem), and this can result in reward functions that are able to imitate expert but do not capture their true intentions in parts of state-space the expert hasnt visited.
Preferences over suboptimal data can resolve these burdens:
a. They can resolve reward ambiguity
b. They can guide policy optimization by learning a shaped reward
c. They are easy to obtain!
Our work creates a unified algorithmic framework for IRL that incorporates both expert and suboptimal information for imitation learning.
The policy agent
The
The inner optimization learns a reward function that ensures that the return gap under the reward function is maximized between the current policy’s behavior and the expert behavior. Thus in this case, our ranking dataset is effectively the preference that expert is better than the current agent and the ranking loss used is what we term as supremum loss (maxmizes the performance gap).
Now, when the ranking dataset only contains the offline preferences between suboptimal trajectories, and the ranking loss is set to be the Luce-Shepard loss,
Our proposed framework require a loss function that can ensure that the rankings are satisfied. A large class of loss function exists that accomplish this – Luce Shepard, Lovasz-Bregman Divergencs and the earlier mentioned supremum loss. While we can use any loss function that learns reward function satisfying the rankings in the dataset we would like one with nice properties. To that end, we propose a loss function called
Simply put the lower preferred behavior is regressed to a return of 0 and more preferred behavior is regressed to a return of user-defined parameter
As we have seen
It turns out we can significantly improve learning efficiency for imitation by using a dataset augmentation procedure on
This strategy of dataset augmentation is similar to Mixup regularization commonly used in supervised learning to increase generalization and adversarial robustness. Note that here Mixup is performed in trajectory space.
In situtations we have access to additional ranking — provided by humans or obtained from offline reward-annotated datasets, we can simply add them in our ranking dataset! This can significantly help in imitation by shaping the reward function, guiding policy optimization and easing exploration. We find, consistent with prior works, that LfO setting provides exploration challenges making it hard for any LfO methods to solve complex manipulation tasks. We show in our work how a handful of offline rankings can help solve such tasks.
We optimize the
Policy as Leader (PAL):
Reward as Leader (RAL):
We investigate the performance of
Our experiments reveal that none of the prior LfO methods are able to solve complex manipulation tasks like: door opening with a parallel jaw gripper and pen manipulation with a dextrous adroit hand. This failure is potentially attributed to the exploration requirements of LfO compared to LfD coupled with the fact that in these tasks observing successes is rare which leads to poorly guided policy gradients.
In this setting, we show that using only a handful of offline-annotated preferences in the rank-game framework can allow us to solve these tasks.
PAL learns a reward function that is consistent with the ranking: current-agent
We observe that PAL adapt faster if the intent of the demonstrator changes and RAL adapts faster if the dynamics of the environment changes.
Equipping agents to learn from different sources of information present in the world is a promising direction towards truly intelligent agents. Our framework casts imitation as a two-player ranking game unifying previous approaches of learning from expert and suboptimal behaviors.
Preferences obtained in the real world are usually noisy and one limitation of
If you find the work useful in your research, please consider using the following citation:
@article{sikchi2022ranking, title={A Ranking Game for Imitation Learning}, author={Sikchi, Harshit and Saran, Akanksha and Goo, Wonjoon and Niekum, Scott}, journal={arXiv preprint arXiv:2202.03481}, year={2022} }