RHyME enables one-shot imitation from human videos with mismatches in embodiment and execution

Abstract

Human demonstrations as prompts are a powerful way to program robots to do long-horizon manipulation tasks. However, translating these demonstrations into robot-executable actions presents significant challenges due to execution mismatches in movement styles and physical capabilities. Existing methods either depend on robot-demonstrator paired data, which is infeasible to scale, or rely too heavily on frame-level visual similarities that often break down in practice. To address these challenges, we propose RHyME, a novel framework that automatically aligns robot and demonstrator task executions using optimal transport costs. Given long-horizon robot demonstrations, RHyME synthesizes semantically equivalent demonstrator videos by retrieving and composing short-horizon demonstrator clips. This approach facilitates effective policy training without the need for paired data. We demonstrate that RHyME outperforms a range of baselines across cross-embodiment datasets, showing a 52% increase in task recall over prior cross-embodiment learning methods.

Approach Overview

RHyME Approach Overview

Real World Evaluations

We compare RHyME to XSkill for one-shot robot imitation from human videos. The key difference: XSkill trains policies using only robot data and struggles with execution mismatches, while RHyME creates synthetic paired human-robot data by pairing robot videos with similar human segments from unpaired data. Our approach better handles differences in embodiment and execution style, resulting in more accurate task imitation as shown in the videos below.

Real World Evaluation Results

Task Sequence #1: Move pot Close drawer Drop cloth

Human Demonstration
(Input)

RHyME Execution
(Our Method)

XSkill Execution
(Baseline)

Key Observation: As demonstrated in the videos above, when conditioned on the same human demonstration (input), RHyME's policy enables the robot to more accurately complete the sequence of tasks.

Task Sequence #2: Turn on light Move pot Close drawer

Human Demonstration
(Input)

RHyME Execution
(Our Method)

XSkill Execution
(Baseline)

Key Observation: XSkill attempts tasks that weren't specified in the human demonstration, showing that it lacks precision in following the intended sequence. RHyME maintains higher fidelity to the exact tasks demonstrated by the human.

Simulation (Franka Kitchen) Evaluations

The Franka Kitchen simulation environment allows us to precisely control execution mismatch levels. In our experiments, we created three demonstrators that are sphere agents, which look visually different from the robot, and established three distinct levels of mismatch:

  • Sphere-Easy: Different appearance but similar execution style
  • Sphere-Medium: Different appearance and different execution style
  • Sphere-Hard: Different appearance, different execution style, and different physical capabilities (bimanual vs. uni-manual)

The results show RHyME's robustness to increasing mismatch levels, outperforming XSkill especially in the most challenging scenarios where the sphere agents' execution differs significantly from the robot's capabilities.

Franka Kitchen Simulation Results

Key Observation: RHyME successfully imitates tasks from sphere agents despite their significant visual and execution differences from the robot, enabling more robust generalization across varying levels of embodiment mismatch.

Visualizing Human-Robot Image Embeddings

Real-World Task Embeddings

We use t-SNE to visualize cross-embodiment latent embeddings from the human and robot completing various tasks. Note how RHyME's learned representations group similar tasks together regardless of the embodiment (human or robot), showing the model's ability to recognize task semantics across different agents.
Real-world T-SNE visualization of task embeddings

Simulation Task Embeddings Across Mismatch Levels

T-SNE visualization of task embeddings across our three simulation datasets with increasing levels of execution mismatch. The plots demonstrate how task embeddings start to diverge as mismatch in execution levels increase.
Simulation T-SNE visualization across different mismatch levels

Paper

BibTex

@misc{kedia2024oneshotimitationmismatchedexecution,
  title={One-Shot Imitation under Mismatched Execution}, 
  author={Kushal Kedia and Prithwish Dan and Angela Chao and Maximus Adrian Pace and Sanjiban Choudhury},
  year={2024},
  eprint={2409.06615},
  archivePrefix={arXiv},
  primaryClass={cs.RO},
  url={https://arxiv.org/abs/2409.06615}, 
}