GitHub - 3dlg-hcvc/video2articulation: Code for Generalizable Articulated Object Reconstruction from Casually Captured RGBD Videos (original) (raw)
iTACO
iTACO: Interactable Digital Twins of Articulated Objects from Casually Captured RGBD Videos
Weikun Peng, Jun Lv, Cewu Lu, Manolis Savva
3DV 2026
Environment Setup
Our code is tested on python=3.10. We recommend using conda to manage python environemnt.
- Create a conda environment
conda create -n video_articulation python=3.10
conda activate video_articulation - Install
pytorch==2.4.0+cu124. You can change the cuda version according to your hardware setup.
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu124 - Install
pytorch3d.
pip install "git+https://github.com/facebookresearch/pytorch3d.git@stable" - Install other dependencies.
pip install opencv-python kornia open3d scipy Pillow trimesh yourdfpy wandb - We use
wandbto log optimization statistics in the refinement process. Make sure you have a wandb account and export wandb API key to the environment
export WANDB_API_KEY=YOUR_WANDB_API_KEY
Prepare Dataset
Our synthetic dataset is public available on huggingface. Please follow the instructions to download the dataset. You also need to download PartNet-Mobility Dataset for geometry reconstruction evaluation. Please follow their term of use to download the dataset and place it in the project directory. The final project file structure should look like this:
project_root_directory
|__docs
|__partnet-mobility-v0
|__148
|__149
......
|__sim_data
|__partnet_mobility
|__exp_results
|__preprocessing
|__real_data
|__raw_data
|__exp_results
|__preprocessing
|__joint_coarse_prediction.py
|__joint_refinement.py
|__launch_joint_refinement.py
|__new_partnet_mobility_dataset_correct_intr_meta.json
|__partnet_mobility_data_split.yaml
......
(Optional) Synthetic Data Generation
Click to expand
We also provide the template for generating synthetic dataset. Note that not all synthetic data in our dataset are generated with exactly the same script and same parameters.
python render_interaction_sim.py
(Optional) Preprocessing
Click to expand
This step will compute the video moving map with MonST3R and video part segmentation with automatic part segmentation. For real data, we also scale the depth map with PromptDA and mask out hands from the interaction video with Grounded-SAM-2. It's a computational intensive work to process all the test videos in our synthetic dataset. Therefore, you can download the preprocessed data on huggingface to skip this step. Otherwise, please continue.
- Update submodules
git submodule init
git submodule update - Compute video moving map with MonST3RFollow the instruction in
monst3rto prepare the environment. Inside themonst3rdirectory, run
python demo.py \
--input ../sim_data/partnet_mobility/Microwave/7265/joint_0_bg/view_0/sample_rgb/ \
--output_dir ../sim_data/exp_results/preprocessing/Microwave/7265/joint_0_bg/view_0/ \
--seq_name monst3r \
--motion_mask_thresh 0.35
You can change themotion_mask_threshnumber to see different video moving map segmentation results. In our paper, we use 0.35. - Compute video part segmentation with automatic part segmentation
Follow the instruction inAutoSeg-SAM2to prepare the environment. Inside theAutoSeg-SAM2directory, run
python auto-mask-batch.py \
--video_path ../sim_data/partnet_mobility/Microwave/7265/joint_0_bg/view_0/rgb_reverse \
--output_dir ../sim_data/exp_results/preprocessing/Microwave/7265/joint_0_bg/view_0/video_segment_reverse \
--batch_size 10 \
--detect_stride 5 \
--level small \
--pred_iou_thresh 0.9 \
--stability_score_thresh 0.95 \
Results are saved inside{--output_dir}. You can also visualize the results for debug purpose.
python visulization.py \
--video_path ../sim_data/partnet_mobility/Microwave/7265/joint_0_bg/view_0/rgb_reverse \
--output_dir ../sim_data/exp_results/preprocessing/Microwave/7265/joint_0_bg/view_0/
--level small - Real Data Only. Scale up original depth maps with PromptDA. In our paper, we use iPhone 12 pro to capture real data. The original depth map is in 192 times\timestimes 256, which is a very low resolution. Naively scale up the depth map via bilinear interpolation will produce noisy depth map. Therefore, we leverage PromptDA to scale up the depth map with relatively high quality. Inside the
PromptDA/directory, run
python scale_depth.py \
--image_dir ../real_data/raw_data/book/surface/keyframes/corrected_images/ \
--depth_dir ../real_data/raw_data/book/surface/keyframes/depth/ \
--save_dir ../real_data/exp_results/preprocessing/book/prompt_depth_surface
python scale_depth.py \
--image_dir ../real_data/raw_data/book/rgb/ \
-depth_dir ../real_data/raw_data/book/depth/ \
--save_dir ../real_data/exp_results/preprocessing/book/prompt_depth_video - Real Data Only. Mask out hands and arms in theinteraction video with Grounded-SAM-2. In our paper, we discard hand information from the input video. Inside the
Grounded-SAM-2/directory, run
python mask_hand.py \
--video_frame_dir ../real_data/raw_data/book/rgb/ \
--save_dir ../real_data/exp_results/preprocessing/book/hand_mask/ - Real Data Only. Align camera coordinates of the video to the coordinate for object surface reconstruction. Theoretically, we can add the initial frame of the video to the set of images for surface reconstruction. In that case, we can unify the coordinate for surface reconstruction and interaction video without extra effort. However, in our paper we use Polycam for surface reconstruction and Record3D to record interaction video. Therefore, we need to align them with a few more steps. Here we adopt a very simple strategy. Since we have both RGB images and depth maps for surface reconstruction and interaction video, we just compute feature matching between the first video frame and images used for surface reconstruction. We use the image pair with most reliable matches and compute SE3 transformation between them. This provides the transformation from camera poses in the interaction video to the surface reconstruction coordinate.
python align_surface_video.py \
--view_dir real_data/raw_data/book/ \
--preprocess_dir real_data/exp_results/preprocessing/book/
Coarse Prediction
Our pipeline starts with coarse prediction.
python joint_coarse_prediction.py
--data_type sim
--view_dir sim_data/partnet_mobility/Microwave/7265/joint_0_bg/view_0/
--preprocess_dir sim_data/exp_results/preprocessing/Microwave/7265/joint_0_bg/view_0/
--prediction_dir sim_data/exp_results/prediction/Microwave/7265/joint_0_bg/view_0/
--mask_type monst3r
Here the view_dir refers to the directory containing data of a specific test video. preprocess_dir refers to the directory containing preprocessed data by MonST3R and automatic part segmentation. prediction_dir is the path you want to save the results. mask_type refers to the video moving map. You can run python joint_coarse_prediction.py -h to see different options.
After running the coarse prediction module, the results are saved inside sim_data/exp_results/prediction/ folder.
Refinement
The second stage is refinement. Our refinement module attempts to optimize joint parameters of a single type of joint. Therefore, you need to run this module twice to get final prediction results.
python launch_joint_refinement.py
--data_type sim
--exp_name refinement
--view_dir sim_data/partnet_mobility/Microwave/7265/joint_0_bg/view_0/
--preprocess_dir sim_data/exp_results/preprocessing/Microwave/7265/joint_0_bg/view_0/
--prediction_dir sim_data/exp_results/prediction/Microwave/7265/joint_0_bg/view_0/
--mask_type monst3r
--loss chamfer
Results are saved inside sim_data/exp_results/prediction/ folder as well. You can add --vis option to visualize results in wandb panel during optimization. But please be aware that this visualization occpies a lot of storage.
We use NKSR for mesh reconstruction. Please follow their instructions to prepare the environment. The pytorch and cuda versions to run NKSR are different from our method. Therefore, you probably need a new conda environment.
In the NKSR environment, run
python extract_mesh.py --data_type sim
--view_dir sim_data/partnet_mobility/Microwave/7265/joint_0_bg/view_0/ \
--refinement_results_dir sim_data/exp_results/prediction/Microwave/7265/joint_0_bg/view_0/refinement/monst3r/chamfer/0/
It will reconstruct the whole mesh, the mesh of the moving part and static part of the object. It also samples 10000 points from the surface of the mesh for evaluating geometric reconstruction accuracy against the ground truth mesh. Results are saved inside sim_data/exp_results/prediction/. Note that sometimes this step needs large CUDA memory. We recommend using GPU with CUDA memory equal or larger than 48GB, such as RTX A6000.
Evaluation
Finally, you can run evaluate.py to evaluate all the prediction results. This is only for synthetic data.
python evaluate.py
--view_dir sim_data/partnet_mobility/Microwave/7265/joint_0_bg/view_0/ \
--refinement_results_dir sim_data/exp_results/prediction/Microwave/7265/joint_0_bg/view_0/refinement/monst3r/chamfer/0/
Citation
If you find our work to be helpful, please consider cite our paper
@inproceedings{peng2025itaco, booktitle = {3DV 2026}, author = {Weikun Peng and Jun Lv and Cewu Lu and Manolis Savva}, title = {{iTACO: Interactable Digital Twins of Articulated Objects from Casually Captured RGBD Videos}}, year = {2025} }
Acknowledgments
This work was funded in part by a Canada Research Chair, NSERC Discovery Grant, and enabled by support from the Digital Research Alliance of Canada. The authors would like to thank Jiayi Liu, Xingguang Yan, Austin T. Wang, Hou In Ivan Tam, Morteza Badali for valuable discussions, and Yi Shi for proofreading.
