Sparse2Act (original) (raw)
Learning Action-Aligned Sparse 3D Representations for Cross-Domain Robot Manipulation
Paper Videos
Sparse2Act uses task-space actions as geometric supervision for a masked point-cloud encoder, then reuses only the encoder initialization for downstream policies.
Abstract
Sparse 3D manipulation policies represent object geometry, spatial relations, and end-effector motion in a shared metric workspace. This structure suggests a simple pretraining signal: the demonstrated action should be inferable from the current 3D geometry. We introduceSparse2Act, an action-aligned pretraining framework for sparse point-cloud encoders. Sparse2Act groups each point cloud into local sparse 3D tokens, masks a subset of tokens, and trains the encoder to regress the demonstrated action from the remaining visible tokens. After pretraining, the lightweight regression head is discarded, and only the pretrained encoder is used to initialize downstream policies, allowing downstream control to operate in its preferred action space. In simulated manipulation benchmarks, our pretraining enables downstream diffusion policies to achieve 86.9% average success rates on LIBERO-10 after 500 fine-tuning steps, and the same pretrained encoder transfers to Meta-World-5 with 73.4% average success rates. We further demonstrate sim-to-real deployment by pretraining the encoder on simulated workspace actions and fine-tuning with a small set of real-robot joint-space demonstrations. These results show that our method provides a simple and effective pretraining signal for learning control-relevant sparse 3D representations for robot manipulation.
Project Overview Video
Your browser does not support the video element.
Method Overview
Framework overview. Sparse2Act converts raw point clouds into local sparse 3D tokens, encodes the visible tokens with 3D positional structure, and pretrains the encoder to predict task-space alignment actions from masked observations. After pretraining, only the encoder initialization is reused by downstream policies, allowing each policy to keep its own action parameterization.
Simulation Results
In-Domain Adaptation
In-domain adaptation on LIBERO-10 Success rates: DP3 29.1, DP 50.5, SpatialVLA 55.5, pi zero 85.2, and Sparse2Act 86.9 percent. 0 25 50 75 100 Success Rate (%) 29.1 50.5 55.5 85.2 86.9 DP3 DP SpatialVLA π₀ Ours
LIBERO-10. Sparse2Act reaches 86.9% average success after 500 fine-tuning steps.
Cross-Domain Transfer
Cross-domain transfer to Meta-World-5 Success rates: FVP 28.4, DP3 42.4, AFRO 48.8, LIBERO pretraining 73.4, and Meta-World pretraining 85.6 percent. 0 25 50 75 100 Success Rate (%) 28.4 42.4 48.8 73.4 85.6 FVP DP3 AFRO LIBERO pretrain MW pretrain
LIBERO to Meta-World-5. The transferred encoder reaches 73.4% average success.
Action-aligned pretraining provides a strong in-domain initialization while preserving substantial performance under cross-domain transfer.
Sim-to-Real Transfer
Comparison with DP3
Your browser does not support the video element.
Real-Robot Tasks
Simulation and Real Observations
Additional Videos
All videos are shown at 4× speed.