GitHub - NVlabs/FoundationStereo: [CVPR 2025 Best Paper Nomination] FoundationStereo: Zero-Shot Stereo Matching (original) (raw)

This is the official implementation of our paper accepted by CVPR 2025 Oral (Best Paper Nomination)

[Website] [Paper] [Video]

Authors: Bowen Wen, Matthew Trepte, Joseph Aribido, Jan Kautz, Orazio Gallo, Stan Birchfield

Abstract

Tremendous progress has been made in deep stereo matching to excel on benchmark datasets through per-domain fine-tuning. However, achieving strong zero-shot generalization — a hallmark of foundation models in other computer vision tasks — remains challenging for stereo matching. We introduce FoundationStereo, a foundation model for stereo depth estimation designed to achieve strong zero-shot generalization. To this end, we first construct a large-scale (1M stereo pairs) synthetic training dataset featuring large diversity and high photorealism, followed by an automatic self-curation pipeline to remove ambiguous samples. We then design a number of network architecture components to enhance scalability, including a side-tuning feature backbone that adapts rich monocular priors from vision foundation models to mitigate the sim-to-real gap, and long-range context reasoning for effective cost volume filtering. Together, these components lead to strong robustness and accuracy across domains, establishing a new standard in zero-shot stereo depth estimation.

TLDR: Our method takes as input a pair of stereo images and outputs a dense disparity map, which can be converted to a metric-scale depth map or 3D point cloud.

Changelog

Date Description
2025/12/15 Checkout our real-time model Fast-FoundationStereo
2025/08/05 Our commercial model is available now at here!
2025/07/03 Improve ONNX and TRT support. Add support for Jetson

Leaderboards 🏆

We obtained the 1st place on the world-wide Middlebury leaderboard and ETH3D leaderboard.


Comparison with Monocular Depth Estimation

Our method outperforms existing approaches in zero-shot stereo matching tasks across different scenes.

Installation

We've tested on Linux with GPU 3090, 4090, A100, V100, Jetson Orin. Other GPUs should also work, but make sure you have enough memory

conda env create -f environment.yml
conda run -n foundation_stereo pip install flash-attn
conda activate foundation_stereo

Note that flash-attn needs to be installed separately to avoid errors during environment creation.

Model Weights

Model Description
23-51-11 Our best performing model for general use, based on Vit-large
11-33-40 Slightly lower accuracy but faster inference, based on Vit-small
NVIDIA-TAO For commercial usage (adapted from Vit-small model)

Run demo

python scripts/run_demo.py --left_file ./assets/left.png --right_file ./assets/right.png --ckpt_dir ./pretrained_models/23-51-11/model_best_bp2.pth --out_dir ./test_outputs/

You can see output point cloud.

Tips:

ONNX/TensorRT(TRT) Inference

We only support docker setup for ONNX/TRT version.

export DIR=$(pwd) cd docker && docker build --network host -t foundation_stereo . bash run_container.sh cd / git clone https://github.com/onnx/onnx-tensorrt.git cd onnx-tensorrt python3 setup.py install apt-get install -y libnvinfer-dispatch10 libnvinfer-bin tensorrt cd $DIR

XFORMERS_DISABLED=1 python scripts/make_onnx.py --save_path ./pretrained_models/foundation_stereo.onnx --ckpt_dir ./pretrained_models/23-51-11/model_best_bp2.pth --height 448 --width 672 --valid_iters 20
trtexec --onnx=pretrained_models/foundation_stereo.onnx --verbose --saveEngine=pretrained_models/foundation_stereo.plan --fp16
python scripts/run_demo_tensorrt.py \
        --left_img ${PWD}/assets/left.png \
        --right_img ${PWD}/assets/right.png \
        --save_path ${PWD}/output \
        --pretrained pretrained_models/foundation_stereo.plan \
        --height 448 \
        --width 672 \
        --pc \
        --z_far 100.0

We have observed 6X speed on the same GPU 3090 with TensorRT FP16. Although how much it speeds up depends on various factors, we recommend trying it out if you care about faster inference. Also remember to adjust the args setting based on your need.

Running on Jetson

Please refer to readme_jetson.md.

FSD Dataset

You can download the whole dataset here (>1TB). We also provide a small sample data (3GB) to peek. The whole dataset contains ~1M data points, where each consists of:

You can check how to read data by using our example with the sample data:

python scripts/vis_dataset.py --dataset_path ./DATA/sample/manipulation_v5_realistic_kitchen_2500_1/dataset/data/

It will produce:

For dataset license, please check this.

FAQ

BibTeX

@article{wen2025stereo,
  title={FoundationStereo: Zero-Shot Stereo Matching},
  author={Bowen Wen and Matthew Trepte and Joseph Aribido and Jan Kautz and Orazio Gallo and Stan Birchfield},
  journal={CVPR},
  year={2025}
}

Acknowledgement

We would like to thank Gordon Grigor, Jack Zhang, Karsten Patzwaldt, Hammad Mazhar and other NVIDIA Isaac team members for their tremendous engineering support and valuable discussions. Thanks to the authors of DINOv2, DepthAnything V2, Selective-IGEV and RAFT-Stereo for their code release. Finally, thanks to CVPR reviewers and AC for their appreciation of this work and constructive feedback.

Contact

For commercial inquiries, additional technical support, and other questions, please reach out to Bowen Wen (bowenw@nvidia.com).