X-Diffusion: Training Diffusion Policies on Cross-Embodiment Human Demonstrations

Cornell University
X-Diffusion selectively incorporates noised human actions into Diffusion Policy training

Robots and Humans Behave Differently

Adding Noise Makes Human and Robot Actions Indistinguishable

X-Diffusion Pipeline

We convert human videos into robot-aligned state-action pairs using 3D hand-pose estimation and lightweight retargeting, and reduce the visual gap with object masks and keypoint overlays.

Adding noise suppresses embodiment-specific details while preserving task intent. We use a human-vs-robot classifier to find the earliest noise level where human actions are indistinguishable from robot actions and treat them as safe supervision.

Key Results

1) X-Diffusion can train on all human data, while naively training on all human data leads to infeasible robot motions

We compare X-Diffusion with FILTERED (robot-verified human demos only), NAIVE (all human data), and ROBOT (robot-only) and find consistent gains on every task.

Naive vs X-Diffusion

2) X-Diffusion outperforms prior cross-embodiment learning baselines

We demonstrate empirically that X-Diffusion outperforms other cross-embodiment learning baselines across all tasks, which see either small performance gains or degraded performance when using all human data.
Naive vs X-Diffusion

Paper

BibTex

@unpublished{pace2026xdiffusion,
  title        = {X-Diffusion: Training Diffusion Policies on Cross-Embodiment Human Demonstrations},
  author       = {Pace, Maximus A. and Dan, Prithwish and Ning, Chuanruo and Bhardwaj, Atiksh and Du, Audrey and Duan, Edward W. and Ma, Wei-Chiu and Kedia, Kushal},
  note         = {Manuscript under review},
  year         = {2026}
}