Note: The blogs serves to promote informal understanding of the ideas in our work. For a more formal read, check out our paper.
DILO is a powerful approach to offline imitation learning with action-free expert demonstrations. In particular, DILO presents the following advantages over existing offline imitation learning methods:
Typically, imitation learning (IL) has been studied in the setting where the expert demonstrations are available and the agent repeatedly interacts with the environment to learn a policy that imitates the expert. But is this setting practical? Humans dont start learning from scratch, they use their past experiences in the environment as useful priors to bootstrap learning of new skills. Moreover, often times, the expert actions are unavailable as the agent can seek to learn from a dataset collected by another agent in a cross-embodiment setting or by watching a human expert through tutorial videos.
LfO is a well-studied problem in RL literature. Offline LfO aims to learn a policy that imitates an expert policy from a given offline dataset without interacting with the environment. Overall, there are two main approaches to tackle this problem: (a) Infer expert actions and use them for imitation learning (b) Infer a reward function from the expert demonstrations and use it for imitation learning.
The first approach involved learning an inverse dynamics model to infer the expert actions from the observed states. The method learns
This approach involves learning a reward model that captures the expert’s preferences over states. The reward model is then used to learn a policy that imitates the expert by solving a reinforcement learning problem. Prior works try to learn this reward model on states only or state next-state pairs by using discriminative models. This approach is still limited by:
Our method DILO is a combination of three ideas. We show below how to derive the DILO objective in three steps below:
Formulate the imitation learning problem as a distribution matching problem between the expert’s state next-state visitation distribution and the learned policy’s state next-state visitation distribution.
This distribution matching problem allows us to use next-states as a replacement for actions in the offline imitation learning but still requires: (a) Requiring next-actions to be known (b) Computing the learned policy’s state next-state visitation distribution. (c) Does not allow for the use of arbitrary offline datasets. We get around these issues with the following steps.
Instead of matching agent’s visitation distribution with the expert’s visitation distribution, we mix both agent and expert distributions individually with offline dataset distribution (
The final step is to derive an action-free off-policy objective for offline imitation learning. We show that the above distribution matching problem can be converted to the following a dual objective using Lagrangian duality.
This completes the derivation of the DILO objective. The objective only require sampling states and next-states from the offline dataset and the expert dataset.
DILO achieves state-of-the-art performance on a variety of offline imitation learning benchmarks and scales seamlessly to image observation settings. Below we show some results on an real-world airhockey setup where DILO is the only method that can learn performant behaviors from action-free demonstrations.
If you find the work useful in your research, please consider using the following citation:
@article{sikchi2024dual, title={A Dual Approach to Imitation Learning from Observations with Offline Datasets}, author={Sikchi, Harshit and Chuck, Caleb and Zhang, Amy and Niekum, Scott}, journal={arXiv preprint arXiv:2406.08805}, year={2024} }