GitHub - jinkun-hao/EgoSim: EgoSim: Egocentric World Simulator for Embodiment Interaction Generation (original) (raw)

Overview

EgoSim is an egocentric world simulator for embodiment interaction generation. Given an initial 3D state and a sequence of actions, EgoSim generates temporally and spatially consistent egocentric observations with high-quality dexterous interactions. EgoSim also persistently updates a 3D state for continuous simulation.

Key features:

Controllable egocentric video generation conditioned on 3D scene state and action sequences
Updatable 3D memory for long-horizon continuous simulation
Scalable data curation pipeline for scene-interaction pairs
Few-shot generalization to in-the-wild real scenes and multiple embodiments

TODO

Inference — run EgoSim-14B on Egodex and EgoVid datasets.
Continuous simulation — multi-clip incremental generation with an updatable 3D scene state. Coming soon.
Data preparation — annotate raw egocentric videos to produce inference-ready assets; see data_process/README.md.
Training — coming soon.

Installation

Requires Python 3.10+, CUDA 12.1+.

git clone https://github.com/your-org/egosim.git cd egosim-opensource

conda create -n egosim python=3.10 -y conda activate egosim

Install PyTorch

pip install torch torchvision

Install flash attention

pip install flash-attn --no-build-isolation pip install -r requirements.txt

Model weights

Download EgoSim-14B from HuggingFace:

huggingface-cli download wuzhi-hao/EgoSim --local-dir ./EgoSim-14B

Place the downloaded directory under the project root so the structure looks like:

EgoSim/
├── EgoSim-14B/
│   ├── diffusion_pytorch_model.safetensors
│   ├── Wan2.1_VAE.pth
│   ├── models_t5_umt5-xxl-enc-bf16.pth
│   ├── models_clip_open-clip-xlm-roberta-large-vit-huge-14.pth
│   └── google/umt5-xxl/
├── egowm/
├── data_process/
└── ...

The VAE, T5, and CLIP weights are the same as Wan2.1-Fun-14B-InP. If you already have that model, you can symlink or copy those files.

Data preparation

Each sample requires three condition inputs alongside the source video, plus a text prompt:

Input	Filename	Description
Ego prior video	rendered_scene.mp4	Point cloud rendered from the first-frame scene, driven by per-frame camera poses
Ego prior mask	pc_mask_video.mp4	Binary mask version of the point cloud (black points, white background)
Hand skeleton video	skeleton_3d.mp4	3D hand keypoint skeleton overlaid on the clip
First frame	hand_inpaint.png	First frame with hands inpainted (clean background)
Prompt	caption.txt → CSV prompt column	Natural-language description generated by Qwen2.5-VL

All inputs are produced by the annotation pipeline in data_process/, which also generates the metadata.csv required by runner.py. See that README for environment setup, model checkpoints, and step-by-step instructions.

For quick testing, download the demo data from Google Drive and extract:

Download mini_sample.zip and place it in the project root, then:

unzip demo_data.zip -d tests/samples/

Inference

All commands below assume you are inside the egosim-opensource/ directory:

Egodex — quick smoke test with bundled mini samples

PYTHONPATH=. python egowm/inference/runner.py
--dataset egodex
--model_root ./EgoSim-14B
--dataset_root tests/samples/demo_data/egodex
--metadata_path tests/samples/demo_data/egodex_metadata.csv
--output_dir output_egodex
--num_inference_steps 50
--gpu_id 0

EgoVid — quick smoke test with bundled mini samples

PYTHONPATH=. python egowm/inference/runner.py
--dataset egovid
--model_root ./EgoSim-14B
--dataset_root tests/samples/demo_data/egovid
--metadata_path tests/samples/demo_data/egovid_metadata.csv
--output_dir output_egovid
--num_inference_steps 50
--gpu_id 0

Each sample produces two files in --output_dir:

{id}.mp4 — generated video
{id}_cmp.mp4 — side-by-side comparison: ego_prior | hand_keypoint | generated

For full datasets, replace dataset_root and metadata_path with your actual paths.

Key options:

Option	Default	Description
--num_inference_steps	50	Denoising steps
--num_frames	61	Frames per clip
--height / --width	480 / 832	Output resolution
--fps	16	Output FPS
--max_samples	—	Limit number of samples (useful for testing)
--skip_existing	—	Skip already-generated videos

Acknowledgements

This codebase is built upon the following open-source projects. We sincerely thank the authors for their contributions.

Citation

@article{hao2026egosim, title={EgoSim: Egocentric World Simulator for Embodied Interaction Generation}, author={Hao, Jinkun and Jia, Mingda and Wang, Ruiyan and Liu, Xihui and Yi, Ran and Ma, Lizhuang and Pang, Jiangmiao and Xu, Xudong}, journal={arXiv preprint arXiv:2604.01001}, year={2026} }