Human videos offer a scalable way to train robot manipulation policies, but lack the action labels needed by standard imitation learning algorithms. Existing cross-embodiment approaches try to map human motion to robot actions, but often fail when the embodiments differ significantly. We propose \( \texttt{X-Sim} \), a real-to-sim-to-real framework that uses object motion as a dense and transferable signal for learning robot policies. \( \texttt{X-Sim} \) starts by reconstructing a photorealistic simulation from an RGBD human video and tracking object trajectories to define object-centric rewards. These rewards are used to train a reinforcement learning (RL) policy in simulation. The learned policy is then distilled into an image-conditioned diffusion policy using synthetic rollouts rendered with varied viewpoints and lighting. To transfer to the real world, \( \texttt{X-Sim} \) introduces an online domain adaptation technique that aligns real and simulated observations during deployment. Importantly, \( \texttt{X-Sim} \) does not require any robot teleoperation data. We evaluate it across 5 manipulation tasks in 2 environments and show that it: (1) improves task progress by 30% on average over hand-tracking and sim-to-real baselines, (2) matches behavior cloning with 10x less data collection time, and (3) generalizes to new camera viewpoints and test-time changes.
We propose \( \texttt{X-Sim} \), a real-to-sim-to-real framework that uses object motion as a dense and transferable signal for learning robot policies.
(1) Real-to-Sim: \( \texttt{X-Sim} \) first reconstructs a photorealistic simulation from an RGBD human video and tracks object trajectories to define object-centric rewards. These rewards are used to train a reinforcement learning (RL) policy in simulation which learns to induce the same object motion as the human video.
(2) Sim-to-Real: \( \texttt{X-Sim} \) trains an RL policy with privileged state, then rolls out the policy and renders the scene under varied robot poses, object states, viewpoints, and lighting conditions to collect a synthetic dataset of image-action pairs. This visuomotor dataset is used to train a purely image-conditioned diffusion policy that can be deployed in the real world.
(3) Auto-Calibration: To improve real-world transfer, \( \texttt{X-Sim} \) introduces an online domain adaptation technique that collects real image observations from closed-loop rollouts and automatically pairs them with simulated views of the same robot trajectories, which are used to minimize the sim-to-real visual gap. Notably, this procedure does not require any robot teleoperation data.
Real-World Performance.
We report Avg. Task Progress on 5 tasks across 2 environments, and find that \( \texttt{X-Sim} \) both with and without calibration consistently outperforms hand-tracking baselines that attempt to retarget human hand motion.
We additionally visualize the primary modes of failure for the hand-tracking baselines, which stem from the underlying theme that natural human execution is mismatched from that of robots.
(a) Hand Mask: Applies a black mask over the human hand in demonstration videos to train an image-conditioned behavior cloning policy. At inference time, the robot arm is similarly masked.
(b) Object-Aware IK: Extracts hand trajectories relative to nearby objects, and replays them by applying IK to move the robot end-effector along the same path.
Sim-to-Real Calibration. \( \texttt{X-Sim} \) \((\texttt{Calibrated})\) aligns real and simulated observations online using closed-loop rollouts to overcome visual discrepancies that remain due to imperfections in 3D reconstruction and rendering. \( \texttt{X-Sim} \) \((\texttt{Calibrated})\) better aligns image embeddings compared to \( \texttt{X-Sim} \), ensuring that the policy avoids overfitting to domain-specific attributes with its calibration loss while still encoding task relevant features with its action prediction loss.
Data Efficiency. By using human videos and RL for robustness, \( \texttt{X-Sim} \) is more data efficient than behavior cloning from robot teleoperation, achieving comparable success on a more challenging variant of Mustard Place with 10x less time. Each human video takes 20 seconds of effort, while each robot demonstration takes 1 minute to collect.
Test-time Robustness. We show that we can flexibly collect image-action data in simulation from multiple viewpoints (Side and Frontal) with \( \texttt{X-Sim} \) and train robust policies that improve performance on seen viewpoints and also generalize to novel viewpoints.