Benchmark Results - E3D-Bench (original) (raw)

Benchmark Results - E3D-Bench

End-to-End 3D Geometric Foundation Models

1The University of Texas at Austin 2Brown University
3University of Central Florida 4NVIDIA Research 5Stanford University

Abstract

Spatial intelligence, encompassing 3D reconstruction, perception, and reasoning, is fundamental to applications such as robotics, aerial imaging, and extended reality. A key enabler is the real-time, accurate estimation of core 3D attributes (camera parameters, point clouds, depth maps, and 3D point tracks) from unstructured or streaming imagery. Inspired by the success of large foundation models in language and 2D vision, a new class of end-to-end 3D geometric foundation models (GFMs) has emerged, directly predicting dense 3D representations in a single feed-forward pass, eliminating the need for slow or unavailable precomputed camera parameters. Since late 2023, the field has exploded with diverse variants. With the rapid proliferation of 3D GFMs, we ask:

Q1 Can GFMs serve as an effective and robust foundation for diverse 3D tasks and scenarios?

Q2 Can GFMs serve as an efficient foundation, especially for latency-constrained 3D applications?

In this work, we present the first comprehensive benchmark for 3D GFMs, covering five core tasks: sparse-view depth estimation, video depth estimation, 3D reconstruction, multi-view pose estimation, novel view synthesis, and spanning both standard and challenging out-of-distribution datasets. Our standardized toolkit automates dataset handling, evaluation protocols, and metric computation to ensure fair, reproducible comparisons. We evaluate 16 state-of-the-art GFMs, revealing their strengths and limitations across tasks and domains, and derive key insights to guide future model scaling and optimization. All code, evaluation scripts, and processed data will be publicly released to accelerate research in 3D spatial AI.

Effectiveness

Extremely Sparse Dense

Method DTU 7-Scenes NRGBD ScanNet TUM-RGBD
ACC ↓ Comp ↓ NC ↑ ACC ↓ Comp ↓ NC ↑ ACC ↓ Comp ↓ NC ↑ ACC ↓ Comp ↓ NC ↑ ACC ↓ Comp ↓ NC ↑
DUS3R/LSM 1.731 1.936 0.786 0.146 0.181 0.744 0.144 0.154 0.867 0.474 0.420 0.714 1.108 0.746 0.724
MASt3R 1.895 2.003 0.788 0.262 0.254 0.732 0.113 0.102 0.810 0.467 0.389 0.701 0.738 0.747 0.739
Spann3R 6.275 5.460 0.705 0.255 0.188 0.653 0.262 0.262 0.628 0.487 0.408 0.617 1.561 1.002 0.621
FLARE 3.406 3.950 0.491 0.152 0.154 0.704 0.060 0.056 0.839 0.357 0.302 0.561 0.515 0.486 0.677
CUT3R 6.885 5.022 0.727 0.118 0.142 0.717 0.104 0.078 0.828 0.260 0.238 0.692 0.587 0.553 0.683
VGGT 2.716 2.301 0.765 0.077 0.080 0.762 0.069 0.071 0.903 0.063 0.079 0.798 0.385 0.331 0.747
Fast3R 4.493 3.681 0.735 0.149 0.116 0.692 0.361 0.201 0.782 0.546 0.306 0.621 0.955 0.630 0.627
MonST3R 20.145 10.322 0.603 0.276 0.277 0.677 0.471 0.458 0.659 0.623 0.541 0.594 1.688 1.031 0.670
DUS3R/LSM 1.284 1.349 0.720 0.022 0.029 0.709 0.035 0.024 0.838 0.026 0.022 0.784 0.620 0.474 0.718
MASt3R 1.374 1.409 0.723 0.025 0.028 0.697 0.043 0.042 0.809 0.035 0.020 0.757 0.209 0.211 0.708
Spann3R 6.505 3.110 0.668 0.176 0.087 0.599 0.343 0.073 0.661 0.262 0.118 0.606 0.635 0.930 0.662
CUT3R 4.710 2.413 0.699 0.025 0.028 0.665 0.076 0.029 0.782 0.042 0.030 0.693 0.740 0.595 0.665
VGGT 2.103 1.925 0.748 0.019 0.032 0.659 0.015 0.012 0.874 0.016 0.017 0.728 0.065 0.091 0.692
Fast3R 3.647 2.319 0.725 0.046 0.057 0.636 0.059 0.028 0.772 0.200 0.097 0.625 0.711 0.337 0.610
MonST3R 14.455 7.508 0.636 0.100 0.091 0.648 0.336 0.246 0.665 0.346 0.293 0.599 1.138 0.948 0.591

Object-Centric Indoor Scenes

Method CO3Dv2 ScanNet & ADT & TUM-Dyn. KITTI Odometry Bonn & Sintel & Rel10k ACID & Syndrone ULTRRA
ATE ↓ RPEtrans ↓ RPErot ↓ ATE ↓ RPEtrans ↓ RPErot ↓ ATE ↓ RPEtrans ↓ RPErot ↓ ATE ↓ RPEtrans ↓ RPErot ↓ ATE ↓ RPEtrans ↓ RPErot ↓ RPEtrans ↓ RPErot ↓
DUSt3R/LSM 0.903 1.325 4.312 0.139 0.102 2.394 2.935 1.135 2.832 0.077 0.557 1.657 0.126 0.379 2.836 70.350 70.390
MASt3R 0.987 1.407 3.999 0.131 0.098 2.889 1.492 0.399 0.407 0.058 0.559 1.305 0.130 0.376 2.601 71.519 78.036
Spann3R 0.915 1.295 6.352 0.294 0.164 3.778 15.848 5.031 4.645 0.083 0.102 1.297 0.117 0.149 1.484 40.503 38.366
CUT3R 0.847 1.209 6.361 0.185 0.133 4.471 2.421 0.747 0.669 0.033 0.039 0.500 0.071 0.090 0.914 55.135 54.395
VGGT 0.478 0.704 2.264 0.113 0.086 1.535 0.955 0.315 0.335 0.062 0.111 0.580 0.280 0.461 0.802 63.451 77.281
Fast3R 0.698 1.035 4.352 0.499 0.391 23.739 22.109 7.573 7.366 0.111 0.170 2.017 0.436 0.518 1.979 51.149 54.150
MonST3R 2.456 3.327 23.458 0.448 0.286 12.817 2.426 0.782 0.949 0.098 0.152 0.830 0.335 0.504 1.514 70.388 77.325
Align3R 1.027 1.550 6.499 0.425 0.215 9.430 4.611 0.817 0.600 0.076 0.091 1.083 0.150 0.179 0.977 72.010 70.638
Easi3R 0.857 1.271 5.052 0.174 0.103 2.872 3.625 0.919 0.615 0.075 0.094 1.361 0.119 0.138 1.733 62.061 71.060
Geo4D 0.798 1.264 5.692 0.436 0.175 10.565 1.662 0.497 0.696 0.573 0.472 3.779 0.384 0.329 1.395 - -
Aether 3.168 2.366 21.643 0.644 0.273 14.804 1.553 0.744 0.744 0.195 0.122 1.610 0.152 0.097 0.796 - -

In Distribution Long Sequence Street Driving Indoor-Outdoor Drone Air-Ground

Normalized Metric

Method DTU ScanNet KITTI ETH3D T&T
AbsRel ↓ δ<1.03 ↑ AbsRel ↓ δ<1.03 ↑ AbsRel ↓ δ<1.03 ↑ AbsRel ↓ δ<1.03 ↑ AbsRel ↓ δ<1.03 ↑
Robust MVD 2.490 80.056 7.468 35.651 9.419 30.505 9.302 42.909 6.379 58.409
DUSt3R/LSM 2.741 75.685 4.732 61.337 9.113 39.495 3.132 74.851 3.106 77.033
MASt3R 3.343 68.301 5.949 54.516 9.542 46.805 2.471 81.291 2.381 82.262
Spann3R 6.431 38.339 7.779 33.713 10.195 30.858 5.121 54.708 5.580 52.812
CUT3R 6.200 47.421 8.231 39.464 23.849 12.087 5.224 59.864 4.594 56.773
VGGT 1.085 94.305 4.386 64.968 9.436 41.309 1.782 86.337 2.075 85.174
Fast3R 3.940 62.120 6.271 50.283 13.390 26.734 4.692 62.663 4.423 64.873
MonST3R 5.346 67.977 5.557 53.309 10.191 40.274 3.368 72.624 3.289 72.491
Robust MVD 2.242 84.574 8.016 35.924 10.846 25.534 10.944 35.526 6.982 60.643
MASt3R 84.904 0.000 93.584 0.000 99.069 0.000 97.021 0.000 98.234 0.000
CUT3R 84.904 0.000 93.584 0.000 99.069 0.000 97.022 0.000 98.234 0.000

Object-Centric Indoor Scene Outdoor Scene Mixed Scene

Normalized Metric

Method Bonn TUM Dyn KITTI PointOdyssey Syndrone Sintel
AbsRel ↓ δ<1.25 ↑ AbsRel ↓ δ<1.25 ↑ AbsRel ↓ δ<1.25 ↑ AbsRel ↓ δ<1.25 ↑ AbsRel ↓ δ<1.25 ↑ AbsRel ↓ δ<1.25 ↑
DepthAnyVideo 0.515 25.3 0.184 84.6 0.074 95.3 0.417 61.7 0.299 83.1 0.455 47.9
VideoDepthAnything 0.268 48.3 1.101 89.0 0.060 98.2 0.283 70.3 0.138 92.5 1.691 45.4
DepthCrafter 0.107 88.3 0.159 79.5 0.120 86.2 0.144 81.3 0.380 87.5 0.354 58.2
Marigold 0.329 52.2 0.600 32.8 0.332 43.3 0.346 47.5 1.331 16.8 0.417 45.4
DUSt3R/LSM 0.174 83.5 0.187 79.2 0.124 84.9 0.168 77.8 0.063 96.9 0.475 59.1
MASt3R 0.160 81.5 0.162 83.1 0.082 93.2 0.150 79.3 0.046 97.5 0.374 63.9
Spann3R 0.205 77.4 0.204 70.6 0.449 49.1 0.303 58.4 0.241 74.5 0.587 43.3
CUT3R 0.068 95.0 0.108 84.7 0.104 89.9 0.095 88.4 0.111 89.5 0.466 56.0
VGGT 0.056 96.3 0.068 93.9 0.051 96.6 0.026 99.0 0.075 95.9 0.242 65.9
Fast3R 0.232 69.4 0.221 71.1 0.308 46.8 0.271 66.2 0.368 44.8 0.565 48.7
MonST3R 0.061 95.4 0.197 72.6 0.083 93.4 0.066 92.3 0.110 89.7 0.343 59.4
Align3R 0.062 96.8 0.107 90.1 0.105 89.2 0.077 93.3 0.097 92.9 0.237 69.0
Easi3R 0.061 95.8 0.192 76.9 0.150 76.2 0.143 82.1 0.095 94.0 0.323 53.9
Geo4D 0.060 97.8 0.096 93.2 0.086 93.8 0.082 93.0 0.105 93.1 0.205 73.2
Aether 0.582 61.2 0.192 80.6 0.065 96.2 0.123 87.9 0.145 91.1 0.343 69.4
GeometryCrafter 0.061 96.8 0.115 87.7 0.410 53.8 0.124 83.6 0.123 90.8 0.280 72.4
MASt3R 0.549 4.6 0.633 0.9 0.754 6.4 0.749 0.2 0.967 0 0.701 2.3
CUT3R 0.097 90.3 0.135 80.6 0.118 87.4 0.127 88.1 0.824 0 1.020 23.6

Indoor Scene Outdoor Scene Large Dynamic Motion Drone Scene Mixed Scene

Method DTU RealEstate10k ScanNet++ ACID
PSNR ↑ SSIM ↑ LPIPS ↓ PSNR ↑ SSIM ↑ LPIPS ↓ PSNR ↑ SSIM ↑ LPIPS ↓ PSNR ↑ SSIM ↑ LPIPS ↓
LSM 11.68 0.3294 0.5218 14.04 0.4388 0.4873 12.39 0.4596 0.5479 16.73 0.4562 0.4567
NoPoSplat 17.91 0.6306 0.2810 24.53 0.8450 0.1634 22.15 0.7988 0.2359 25.35 0.7774 0.1875
FLARE 17.01 0.5672 0.2901 22.15 0.7126 0.2363 23.19 0.8117 0.2201 22.44 0.6229 0.2818

Object-Centric Indoor Scenes Drone Scenes

Inference Efficiency

Method 2 4 8 16 32 64 128 256
Time ↓ GPU ↓ Time ↓ GPU ↓ Time ↓ GPU ↓ Time ↓ GPU ↓ Time ↓ GPU ↓ Time ↓ GPU ↓ Time ↓ GPU ↓ Time ↓ GPU ↓
DUST3R 0.35 ± 0.19 2.49 6.00 ± 0.30 2.6 13.96 ± 0.86 3.65 50.37 ± 2.28 8.38 196.81 ± 6.38 27.52 OOM OOM OOM OOM OOM OOM
MASt3R 9.43 ± 0.28 2.61 14.63 ± 0.52 2.68 21.38 ± 2.26 2.78 42.28 ± 9.06 3.35 117.77 ± 40.83 6.87 392.23 ± 184.36 28.78 OOM OOM OOM OOM
Spann3R 0.16 ± 0.12 2.79 0.28 ± 0.01 2.8 0.65 ± 0.00 2.81 1.38 ± 0.01 2.84 2.81 ± 0.07 2.89 5.51 ± 0.03 2.99 11.25 ± 0.16 3.19 23.64 ± 0.70 3.55
CUT3R 0.19 ± 0.07 3.33 0.26 ± 0.04 3.38 0.42 ± 0.03 3.48 0.78 ± 0.03 3.65 1.50 ± 0.03 4.28 3.12 ± 0.31 5.54 5.76 ± 0.12 11.68 11.65 ± 0.16 17.36
VGGT 0.32 ± 0.41 7.11 0.29 ± 0.40 7.72 0.24 ± 0.01 9.06 0.72 ± 0.49 10.29 2.35 ± 0.04 12.75 4.23 ± 0.07 17.66 11.76 ± 0.41 28.65 34.21 ± 2.51 50.92
Fast3R 0.13 ± 0.14 4.05 0.11 ± 0.03 4.26 0.15 ± 0.02 4.75 0.30 ± 0.01 5.8 0.69 ± 0.02 7.25 1.78 ± 0.03 8.43 5.13 ± 0.06 10.91 16.55 ± 0.12 15.75
MonST3R 0.32 ± 0.25 2.79 14.78 ± 0.52 4.8 18.77 ± 0.20 7.84 35.76 ± 0.35 8.9 73.19 ± 0.37 16.15 148.17 ± 0.99 32.99 605.83 ± 25.24 66.66 OOM OOM
Easi3R 0.35 ± 0.19 2.49 17.35 ± 1.10 3.41 24.18 ± 0.76 4.15 60.12 ± 2.67 7.69 137.16 ± 10.86 15.96 273.78 ± 2.08 32.53 901.05 ± 5.29 65.68 OOM OOM

Findings and Takeaways

What Is the Impact of Tasks with Different Difficulties?

Takeaway 1: Current GFMs are promising but face significant challenges when learning from overly complex tasks. Recommendation: Carefully decomposing difficult tasks (e.g., jointly predicting geometry, pose, depth, and tracking) into simpler sub-problems can facilitate more effective learning, especially under limited 3D data.

Do GFMs Generalize Well on Different Data Domains?

Takeaway 2: Diverse, high-quality data is critical for strong generalization. To improve robustness in underrepresented domains, GFMs must be trained on data that covers broader distributions and metric-scale annotations.

Hints for Model Architecture Design, ViT or Diffusion? Strong 2D Feature Extractor?

Takeaway 3: No single backbone—feed -forward ViT or diffusion, dominates; architecture choice should align with task needs. Moreover, leveraging strong 2D feature extractors (e.g., DINO) substantially boosts 3D performance.

Are Current GFMs Ready for Real-Time Perception Systems?

Takeaway 4: As GFMs scale to handle more views and complex tasks, efficiency becomes as critical as accuracy for enabling real-time 3D perception.

Citation

@article{cong2025e3dbench,
  title={E3D-Bench: An End-to-End Benchmark for 3D Geometric Foundation Models},
  author={Cong, Wenyan and Liang, Yiqing and Zhang, Yancheng and Yang, Ziyi and Wang, Yan and Ivanovic, Boris and Pavone, Marco and Chen, Chen and Wang, Zhangyang and Fan, Zhiwen},
  journal={arXiv preprint arXiv:2506.01933},
  year={2025}
}