Human demonstrations as prompts are a powerful way to program robots to do long-horizon manipulation tasks. However, translating these demonstrations into robot-executable actions presents significant challenges due to execution mismatches in movement styles and physical capabilities. Existing methods either depend on robot-demonstrator paired data, which is infeasible to scale, or rely too heavily on frame-level visual similarities that often break down in practice. To address these challenges, we propose RHyME, a novel framework that automatically aligns robot and demonstrator task executions using optimal transport costs. Given long-horizon robot demonstrations, RHyME synthesizes semantically equivalent demonstrator videos by retrieving and composing short-horizon demonstrator clips. This approach facilitates effective policy training without the need for paired data. We demonstrate that RHyME outperforms a range of baselines across cross-embodiment datasets, showing a 52% increase in task recall over prior cross-embodiment learning methods.
We compare our approach to the baseline of XSkill, which employs clustering to group similar visual features for aligning human and robot video representations.
RHyME instead uses retrieval to match robot videos to the most similar human video segments from unpaired play data, creating a synthetic dataset. This allows the model to handle significant differences in how tasks are performed.
We present results on three datasets. As the demonstrator's actions visually and physically deviate further from those of the robot, policies trained with our framework RHyME consistently outperform XSkill in a simulation setting.