GitHub - sv-pp/SceneVersepp: Official implementation of CVPR26 paper "Lifting Unlabeled Internet-level Data for 3D Scene Understanding" (original) (raw)
TL;DR
Annotated 3D scene data is scarce. We build an automated data engine that lifts web videos into structured 3D supervision — instance-level point clouds, object layouts, spatial VQA, and vision-language navigation — and show through experiments that this generated data has strong potential to supplement the broad 3D scene understanding.
What's in this repo
This is the public release of the training code and data pipeline from the paper.
| Directory | Purpose |
|---|---|
| PQ3D/ | 3D instance segmentation training |
| SpatialLM/ | 3D object detection training |
| data_processing/ | Video download, frame extraction, camera-pose visualization for the SVPP dataset |
Quick start
1. Get the dataset
huggingface-cli download bigai/SceneVersepp --repo-type dataset --local-dir ./svpp
2. Set up the data-processing environment
The scripts in data_processing/ (video download, frame extraction, pose visualization) use a light-weight environment defined by requirements.txt:
conda create -n svpp python=3.10 -y conda activate svpp pip install -r requirements.txt
The training stacks under
PQ3D/andSpatialLM/each have their own heavier environments. See their respective READMEs.
3. Process the raw videos
Download YouTube videos referenced by each scene's data_info.json
python data_processing/download_videos.py ./svpp
Extract raw and cropped frames into images/ and crop_images/
python data_processing/extract_images.py ./svpp
(Optional) Visualize camera poses for one scene with Open3D
python data_processing/view_camera_poses.py ./svpp --scene-name bedroom_100_3o5KSzfdOSE
4. Train
Each training stack is independent and ships with its own README.md:
- PQ3D/README.md — segmentation data generation and two-stage training
- SpatialLM/README.md — layout generation, pretraining, fine-tuning, inference, and evaluation
Citation
@inproceedings{chen2026lifting, title = {Lifting Unlabeled Internet-level Data for 3D Scene Understanding}, author = {Chen, Yixin and Zhang, Yaowei and Yu, Huangyue and He, Junchao and Wang, Yan and Huang, Jiangyong and Shen, Hongyu and Ni, Junfeng and Wang, Shaofei and Jia, Baoxiong and Zhu, Song-Chun and Huang, Siyuan}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year = {2026} }
Acknowledgements
This repository builds on: