GitHub - jinkun-hao/EgoSim: EgoSim: Egocentric World Simulator for Embodiment Interaction Generation (original) (raw)


Overview

Teaser

EgoSim is an egocentric world simulator for embodiment interaction generation. Given an initial 3D state and a sequence of actions, EgoSim generates temporally and spatially consistent egocentric observations with high-quality dexterous interactions. EgoSim also persistently updates a 3D state for continuous simulation.

Key features:

TODO

Installation

Requires Python 3.10+, CUDA 12.1+.

git clone https://github.com/your-org/egosim.git cd egosim-opensource

conda create -n egosim python=3.10 -y conda activate egosim

Install PyTorch

pip install torch torchvision

Install flash attention

pip install flash-attn --no-build-isolation pip install -r requirements.txt

Model weights

Download EgoSim-14B from HuggingFace:

huggingface-cli download wuzhi-hao/EgoSim --local-dir ./EgoSim-14B

Place the downloaded directory under the project root so the structure looks like:

EgoSim/
├── EgoSim-14B/
│   ├── diffusion_pytorch_model.safetensors
│   ├── Wan2.1_VAE.pth
│   ├── models_t5_umt5-xxl-enc-bf16.pth
│   ├── models_clip_open-clip-xlm-roberta-large-vit-huge-14.pth
│   └── google/umt5-xxl/
├── egowm/
├── data_process/
└── ...

The VAE, T5, and CLIP weights are the same as Wan2.1-Fun-14B-InP. If you already have that model, you can symlink or copy those files.

Data preparation

Each sample requires three condition inputs alongside the source video, plus a text prompt:

Input Filename Description
Ego prior video rendered_scene.mp4 Point cloud rendered from the first-frame scene, driven by per-frame camera poses
Ego prior mask pc_mask_video.mp4 Binary mask version of the point cloud (black points, white background)
Hand skeleton video skeleton_3d.mp4 3D hand keypoint skeleton overlaid on the clip
First frame hand_inpaint.png First frame with hands inpainted (clean background)
Prompt caption.txt → CSV prompt column Natural-language description generated by Qwen2.5-VL

All inputs are produced by the annotation pipeline in data_process/, which also generates the metadata.csv required by runner.py. See that README for environment setup, model checkpoints, and step-by-step instructions.

For quick testing, download the demo data from Google Drive and extract:

Download mini_sample.zip and place it in the project root, then:

unzip demo_data.zip -d tests/samples/

Inference

All commands below assume you are inside the egosim-opensource/ directory:

Egodex — quick smoke test with bundled mini samples

PYTHONPATH=. python egowm/inference/runner.py
--dataset egodex
--model_root ./EgoSim-14B
--dataset_root tests/samples/demo_data/egodex
--metadata_path tests/samples/demo_data/egodex_metadata.csv
--output_dir output_egodex
--num_inference_steps 50
--gpu_id 0

EgoVid — quick smoke test with bundled mini samples

PYTHONPATH=. python egowm/inference/runner.py
--dataset egovid
--model_root ./EgoSim-14B
--dataset_root tests/samples/demo_data/egovid
--metadata_path tests/samples/demo_data/egovid_metadata.csv
--output_dir output_egovid
--num_inference_steps 50
--gpu_id 0

Each sample produces two files in --output_dir:

For full datasets, replace dataset_root and metadata_path with your actual paths.

Key options:

Option Default Description
--num_inference_steps 50 Denoising steps
--num_frames 61 Frames per clip
--height / --width 480 / 832 Output resolution
--fps 16 Output FPS
--max_samples Limit number of samples (useful for testing)
--skip_existing Skip already-generated videos

Acknowledgements

This codebase is built upon the following open-source projects. We sincerely thank the authors for their contributions.

Citation

@article{hao2026egosim, title={EgoSim: Egocentric World Simulator for Embodied Interaction Generation}, author={Hao, Jinkun and Jia, Mingda and Wang, Ruiyan and Liu, Xihui and Yi, Ran and Ma, Lizhuang and Pang, Jiangmiao and Xu, Xudong}, journal={arXiv preprint arXiv:2604.01001}, year={2026} }