GitHub - jinkun-hao/EgoSim: EgoSim: Egocentric World Simulator for Embodiment Interaction Generation (original) (raw)
Overview
EgoSim is an egocentric world simulator for embodiment interaction generation. Given an initial 3D state and a sequence of actions, EgoSim generates temporally and spatially consistent egocentric observations with high-quality dexterous interactions. EgoSim also persistently updates a 3D state for continuous simulation.
Key features:
- Controllable egocentric video generation conditioned on 3D scene state and action sequences
- Updatable 3D memory for long-horizon continuous simulation
- Scalable data curation pipeline for scene-interaction pairs
- Few-shot generalization to in-the-wild real scenes and multiple embodiments
TODO
- Inference — run EgoSim-14B on Egodex and EgoVid datasets.
- Continuous simulation — multi-clip incremental generation with an updatable 3D scene state. Coming soon.
- Data preparation — annotate raw egocentric videos to produce inference-ready assets; see data_process/README.md.
- Training — coming soon.
Installation
Requires Python 3.10+, CUDA 12.1+.
git clone https://github.com/your-org/egosim.git cd egosim-opensource
conda create -n egosim python=3.10 -y conda activate egosim
Install PyTorch
pip install torch torchvision
Install flash attention
pip install flash-attn --no-build-isolation pip install -r requirements.txt
Model weights
Download EgoSim-14B from HuggingFace:
huggingface-cli download wuzhi-hao/EgoSim --local-dir ./EgoSim-14B
Place the downloaded directory under the project root so the structure looks like:
EgoSim/
├── EgoSim-14B/
│ ├── diffusion_pytorch_model.safetensors
│ ├── Wan2.1_VAE.pth
│ ├── models_t5_umt5-xxl-enc-bf16.pth
│ ├── models_clip_open-clip-xlm-roberta-large-vit-huge-14.pth
│ └── google/umt5-xxl/
├── egowm/
├── data_process/
└── ...
The VAE, T5, and CLIP weights are the same as Wan2.1-Fun-14B-InP. If you already have that model, you can symlink or copy those files.
Data preparation
Each sample requires three condition inputs alongside the source video, plus a text prompt:
| Input | Filename | Description |
|---|---|---|
| Ego prior video | rendered_scene.mp4 | Point cloud rendered from the first-frame scene, driven by per-frame camera poses |
| Ego prior mask | pc_mask_video.mp4 | Binary mask version of the point cloud (black points, white background) |
| Hand skeleton video | skeleton_3d.mp4 | 3D hand keypoint skeleton overlaid on the clip |
| First frame | hand_inpaint.png | First frame with hands inpainted (clean background) |
| Prompt | caption.txt → CSV prompt column | Natural-language description generated by Qwen2.5-VL |
All inputs are produced by the annotation pipeline in data_process/, which also generates the metadata.csv required by runner.py. See that README for environment setup, model checkpoints, and step-by-step instructions.
For quick testing, download the demo data from Google Drive and extract:
Download mini_sample.zip and place it in the project root, then:
unzip demo_data.zip -d tests/samples/
Inference
All commands below assume you are inside the egosim-opensource/ directory:
Egodex — quick smoke test with bundled mini samples
PYTHONPATH=. python egowm/inference/runner.py
--dataset egodex
--model_root ./EgoSim-14B
--dataset_root tests/samples/demo_data/egodex
--metadata_path tests/samples/demo_data/egodex_metadata.csv
--output_dir output_egodex
--num_inference_steps 50
--gpu_id 0
EgoVid — quick smoke test with bundled mini samples
PYTHONPATH=. python egowm/inference/runner.py
--dataset egovid
--model_root ./EgoSim-14B
--dataset_root tests/samples/demo_data/egovid
--metadata_path tests/samples/demo_data/egovid_metadata.csv
--output_dir output_egovid
--num_inference_steps 50
--gpu_id 0
Each sample produces two files in --output_dir:
{id}.mp4— generated video{id}_cmp.mp4— side-by-side comparison: ego_prior | hand_keypoint | generated
For full datasets, replace dataset_root and metadata_path with your actual paths.
Key options:
| Option | Default | Description |
|---|---|---|
| --num_inference_steps | 50 | Denoising steps |
| --num_frames | 61 | Frames per clip |
| --height / --width | 480 / 832 | Output resolution |
| --fps | 16 | Output FPS |
| --max_samples | — | Limit number of samples (useful for testing) |
| --skip_existing | — | Skip already-generated videos |
Acknowledgements
This codebase is built upon the following open-source projects. We sincerely thank the authors for their contributions.
Citation
@article{hao2026egosim, title={EgoSim: Egocentric World Simulator for Embodied Interaction Generation}, author={Hao, Jinkun and Jia, Mingda and Wang, Ruiyan and Liu, Xihui and Yi, Ran and Ma, Lizhuang and Pang, Jiangmiao and Xu, Xudong}, journal={arXiv preprint arXiv:2604.01001}, year={2026} }
