Harshit Sikchi^θ† , Caleb Chuck^θ†,Amy Zhang^{θ φ}, Scott Niekum^ψ

^θ UT Austin, ^φ Meta AI,^ψ UMass Amherst

^† Equal Contribution

Conference on Robot Learning (CoRL) 2024

Note: The blogs serves to promote informal understanding of the ideas in our work. For a more formal read, check out our paper.

TL;DR

DILO is a powerful approach to offline imitation learning with action-free expert demonstrations. In particular, DILO presents the following advantages over existing offline imitation learning methods:

Learns from action-free expert demonstrations
Can utilize arbitrary suboptimal offline datasets to help with imitation
No loose upper bounds to distribution matching to expert
Sample computational complexity to traditional offline RL methods like IQL
Outperforms all prior methods in various benchmarks

Stepping towards a more practical setting for imitation learning

Typically, imitation learning (IL) has been studied in the setting where the expert demonstrations are available and the agent repeatedly interacts with the environment to learn a policy that imitates the expert. But is this setting practical? Humans dont start learning from scratch, they use their past experiences in the environment as useful priors to bootstrap learning of new skills. Moreover, often times, the expert actions are unavailable as the agent can seek to learn from a dataset collected by another agent in a cross-embodiment setting or by watching a human expert through tutorial videos.

The landscape of offline imitation learning from observations (Offline LfO)

LfO is a well-studied problem in RL literature. Offline LfO aims to learn a policy that imitates an expert policy from a given offline dataset without interacting with the environment. Overall, there are two main approaches to tackle this problem: (a) Infer expert actions and use them for imitation learning (b) Infer a reward function from the expert demonstrations and use it for imitation learning.

Approach (a): Infer expert actions

The first approach involved learning an inverse dynamics model to infer the expert actions from the observed states. The method learns $f_{\theta}: S \times A \rightarrow S$ to infer the expert actions by regressing the observed action $a_{t} = f_{\theta}(s_{t}, s_{t+1})$. This approach is intuitive and easy to implement, but it suffers from the following drawbacks:

Distribution Mismatch: The inferred actions may not exactly match the true expert actions, leading to performance degradation. During deployment, this may lead to the well known compounding errors issue in imitation learning.
IDM errors: Errors in the inverse dynamics model can get amplified as the agent takes erroneous actions in the environment. function.

Approach (b): Infer a reward function

This approach involves learning a reward model that captures the expert’s preferences over states. The reward model is then used to learn a policy that imitates the expert by solving a reinforcement learning problem. Prior works try to learn this reward model on states only or state next-state pairs by using discriminative models. This approach is still limited by:

Reward Inference Error: Errors in the reward model can lead to compounding errors in policy learning.
Loose Upper-bound to distribution matching: The learned reward function is only an imperfect proxy for the true expert reward function.

DILO: Dual Imitation Learning from Observations

Our method DILO is a combination of three ideas. We show below how to derive the DILO objective in three steps below:

Step 1: Next-states leak information about expert actions

Formulate the imitation learning problem as a distribution matching problem between the expert’s state next-state visitation distribution and the learned policy’s state next-state visitation distribution.

\[\min_\pi D_f ({d}^\pi(s,s',a')\|{d}^E(s,s',a')),\]

This distribution matching problem allows us to use next-states as a replacement for actions in the offline imitation learning but still requires: (a) Requiring next-actions to be known (b) Computing the learned policy’s state next-state visitation distribution. (c) Does not allow for the use of arbitrary offline datasets. We get around these issues with the following steps.

Step 2: Mixture distribution matching to leverage arbitrary offline datasets

Instead of matching agent’s visitation distribution with the expert’s visitation distribution, we mix both agent and expert distributions individually with offline dataset distribution ($\rho$) and match the mixture distributions. In principle, the objective is minimized when agent’s distribution perfectly matches the expert’s distribution so this objective is a principled imitation learning objective.

\[\max_{ {d}\ge0} - D_f(\texttt{Mix}_\beta({d}, \rho) \| \texttt{Mix}_\beta({d}^E, \rho)) \\\] \[~~\text{s.t}~\textstyle \sum_{a''} {d}(s',s'',a'')=(1-\gamma){d}_0(s',s'')+\gamma \sum_{s,a' \in \mathcal{S} \times \mathcal{A} } {d}(s,s',a') p(s''|s',a'), \; \forall s',s'' \in \mathcal{S}\times \mathcal{S}.\]

Step 3: Action-free Off-Policy dual objective for LfO

The final step is to derive an action-free off-policy objective for offline imitation learning. We show that the above distribution matching problem can be converted to the following a dual objective using Lagrangian duality.

\[\texttt{DILO:}~~\min_{V} \beta (1-\gamma) \mathbb{E}_{\tilde{d}_0}{[V(s,s')]} +\mathbb{E}_{s,s'\sim \texttt{Mix}_\beta(\tilde{d}^E, \rho)}{[f^*_p(\gamma \mathbb{E}_{s''\sim p(\cdot|s',a')}{V(s',s'')}-V(s,s'))]}\] \[-(1-\beta) \mathbb{E}_{s,s'\sim\rho}{[\gamma \mathbb{E}_{s''\sim p(\cdot|s',a')}{V(s',s'')}-V(s,s')]}\]

This completes the derivation of the DILO objective. The objective only require sampling states and next-states from the offline dataset and the expert dataset. $V(s,s’)$ represents the long-term divergence of agent with the expert if it decides to move to state $s’$ from state $s$. Thus $V(s,s’)$ can be used to extract the policy using weighted regression.

DILO in action

DILO achieves state-of-the-art performance on a variety of offline imitation learning benchmarks and scales seamlessly to image observation settings. Below we show some results on an real-world airhockey setup where DILO is the only method that can learn performant behaviors from action-free demonstrations.

Citation

If you find the work useful in your research, please consider using the following citation:

@article{sikchi2024dual, title={A Dual Approach to Imitation Learning from Observations with Offline Datasets}, author={Sikchi, Harshit and Chuck, Caleb and Zhang, Amy and Niekum, Scott}, journal={arXiv preprint arXiv:2406.08805}, year={2024} }