
We convert human videos into robot-aligned state-action pairs using 3D hand-pose estimation and lightweight retargeting, and reduce the visual gap with object masks and keypoint overlays.
Adding noise suppresses embodiment-specific details while preserving task intent. We use a human-vs-robot classifier to find the earliest noise level where human actions are indistinguishable from robot actions and treat them as safe supervision.
We compare X-Diffusion with FILTERED (robot-verified human demos only), NAIVE (all human data), and ROBOT (robot-only) and find consistent gains on every task.
@misc{pace2025xdiffusiontrainingdiffusionpolicies,
title={X-Diffusion: Training Diffusion Policies on Cross-Embodiment Human Demonstrations},
author={Maximus A. Pace and Prithwish Dan and Chuanruo Ning and Atiksh Bhardwaj and Audrey Du and Edward W. Duan and Wei-Chiu Ma and Kushal Kedia},
year={2025},
eprint={2511.04671},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2511.04671},
}