Benchmark Results - E3D-Bench (original) (raw)
Benchmark Results - E3D-Bench
End-to-End 3D Geometric Foundation Models
1The University of Texas at Austin 2Brown University
3University of Central Florida 4NVIDIA Research 5Stanford University
Abstract
Spatial intelligence, encompassing 3D reconstruction, perception, and reasoning, is fundamental to applications such as robotics, aerial imaging, and extended reality. A key enabler is the real-time, accurate estimation of core 3D attributes (camera parameters, point clouds, depth maps, and 3D point tracks) from unstructured or streaming imagery. Inspired by the success of large foundation models in language and 2D vision, a new class of end-to-end 3D geometric foundation models (GFMs) has emerged, directly predicting dense 3D representations in a single feed-forward pass, eliminating the need for slow or unavailable precomputed camera parameters. Since late 2023, the field has exploded with diverse variants. With the rapid proliferation of 3D GFMs, we ask:
Q1 Can GFMs serve as an effective and robust foundation for diverse 3D tasks and scenarios?
Q2 Can GFMs serve as an efficient foundation, especially for latency-constrained 3D applications?
In this work, we present the first comprehensive benchmark for 3D GFMs, covering five core tasks: sparse-view depth estimation, video depth estimation, 3D reconstruction, multi-view pose estimation, novel view synthesis, and spanning both standard and challenging out-of-distribution datasets. Our standardized toolkit automates dataset handling, evaluation protocols, and metric computation to ensure fair, reproducible comparisons. We evaluate 16 state-of-the-art GFMs, revealing their strengths and limitations across tasks and domains, and derive key insights to guide future model scaling and optimization. All code, evaluation scripts, and processed data will be publicly released to accelerate research in 3D spatial AI.
Effectiveness
Extremely Sparse Dense
Method | DTU | 7-Scenes | NRGBD | ScanNet | TUM-RGBD | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ACC ↓ | Comp ↓ | NC ↑ | ACC ↓ | Comp ↓ | NC ↑ | ACC ↓ | Comp ↓ | NC ↑ | ACC ↓ | Comp ↓ | NC ↑ | ACC ↓ | Comp ↓ | NC ↑ | |
DUS3R/LSM | 1.731 | 1.936 | 0.786 | 0.146 | 0.181 | 0.744 | 0.144 | 0.154 | 0.867 | 0.474 | 0.420 | 0.714 | 1.108 | 0.746 | 0.724 |
MASt3R | 1.895 | 2.003 | 0.788 | 0.262 | 0.254 | 0.732 | 0.113 | 0.102 | 0.810 | 0.467 | 0.389 | 0.701 | 0.738 | 0.747 | 0.739 |
Spann3R | 6.275 | 5.460 | 0.705 | 0.255 | 0.188 | 0.653 | 0.262 | 0.262 | 0.628 | 0.487 | 0.408 | 0.617 | 1.561 | 1.002 | 0.621 |
FLARE | 3.406 | 3.950 | 0.491 | 0.152 | 0.154 | 0.704 | 0.060 | 0.056 | 0.839 | 0.357 | 0.302 | 0.561 | 0.515 | 0.486 | 0.677 |
CUT3R | 6.885 | 5.022 | 0.727 | 0.118 | 0.142 | 0.717 | 0.104 | 0.078 | 0.828 | 0.260 | 0.238 | 0.692 | 0.587 | 0.553 | 0.683 |
VGGT | 2.716 | 2.301 | 0.765 | 0.077 | 0.080 | 0.762 | 0.069 | 0.071 | 0.903 | 0.063 | 0.079 | 0.798 | 0.385 | 0.331 | 0.747 |
Fast3R | 4.493 | 3.681 | 0.735 | 0.149 | 0.116 | 0.692 | 0.361 | 0.201 | 0.782 | 0.546 | 0.306 | 0.621 | 0.955 | 0.630 | 0.627 |
MonST3R | 20.145 | 10.322 | 0.603 | 0.276 | 0.277 | 0.677 | 0.471 | 0.458 | 0.659 | 0.623 | 0.541 | 0.594 | 1.688 | 1.031 | 0.670 |
DUS3R/LSM | 1.284 | 1.349 | 0.720 | 0.022 | 0.029 | 0.709 | 0.035 | 0.024 | 0.838 | 0.026 | 0.022 | 0.784 | 0.620 | 0.474 | 0.718 |
MASt3R | 1.374 | 1.409 | 0.723 | 0.025 | 0.028 | 0.697 | 0.043 | 0.042 | 0.809 | 0.035 | 0.020 | 0.757 | 0.209 | 0.211 | 0.708 |
Spann3R | 6.505 | 3.110 | 0.668 | 0.176 | 0.087 | 0.599 | 0.343 | 0.073 | 0.661 | 0.262 | 0.118 | 0.606 | 0.635 | 0.930 | 0.662 |
CUT3R | 4.710 | 2.413 | 0.699 | 0.025 | 0.028 | 0.665 | 0.076 | 0.029 | 0.782 | 0.042 | 0.030 | 0.693 | 0.740 | 0.595 | 0.665 |
VGGT | 2.103 | 1.925 | 0.748 | 0.019 | 0.032 | 0.659 | 0.015 | 0.012 | 0.874 | 0.016 | 0.017 | 0.728 | 0.065 | 0.091 | 0.692 |
Fast3R | 3.647 | 2.319 | 0.725 | 0.046 | 0.057 | 0.636 | 0.059 | 0.028 | 0.772 | 0.200 | 0.097 | 0.625 | 0.711 | 0.337 | 0.610 |
MonST3R | 14.455 | 7.508 | 0.636 | 0.100 | 0.091 | 0.648 | 0.336 | 0.246 | 0.665 | 0.346 | 0.293 | 0.599 | 1.138 | 0.948 | 0.591 |
Object-Centric Indoor Scenes
Method | CO3Dv2 | ScanNet & ADT & TUM-Dyn. | KITTI Odometry | Bonn & Sintel & Rel10k | ACID & Syndrone | ULTRRA | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ATE ↓ | RPEtrans ↓ | RPErot ↓ | ATE ↓ | RPEtrans ↓ | RPErot ↓ | ATE ↓ | RPEtrans ↓ | RPErot ↓ | ATE ↓ | RPEtrans ↓ | RPErot ↓ | ATE ↓ | RPEtrans ↓ | RPErot ↓ | RPEtrans ↓ | RPErot ↓ | |
DUSt3R/LSM | 0.903 | 1.325 | 4.312 | 0.139 | 0.102 | 2.394 | 2.935 | 1.135 | 2.832 | 0.077 | 0.557 | 1.657 | 0.126 | 0.379 | 2.836 | 70.350 | 70.390 |
MASt3R | 0.987 | 1.407 | 3.999 | 0.131 | 0.098 | 2.889 | 1.492 | 0.399 | 0.407 | 0.058 | 0.559 | 1.305 | 0.130 | 0.376 | 2.601 | 71.519 | 78.036 |
Spann3R | 0.915 | 1.295 | 6.352 | 0.294 | 0.164 | 3.778 | 15.848 | 5.031 | 4.645 | 0.083 | 0.102 | 1.297 | 0.117 | 0.149 | 1.484 | 40.503 | 38.366 |
CUT3R | 0.847 | 1.209 | 6.361 | 0.185 | 0.133 | 4.471 | 2.421 | 0.747 | 0.669 | 0.033 | 0.039 | 0.500 | 0.071 | 0.090 | 0.914 | 55.135 | 54.395 |
VGGT | 0.478 | 0.704 | 2.264 | 0.113 | 0.086 | 1.535 | 0.955 | 0.315 | 0.335 | 0.062 | 0.111 | 0.580 | 0.280 | 0.461 | 0.802 | 63.451 | 77.281 |
Fast3R | 0.698 | 1.035 | 4.352 | 0.499 | 0.391 | 23.739 | 22.109 | 7.573 | 7.366 | 0.111 | 0.170 | 2.017 | 0.436 | 0.518 | 1.979 | 51.149 | 54.150 |
MonST3R | 2.456 | 3.327 | 23.458 | 0.448 | 0.286 | 12.817 | 2.426 | 0.782 | 0.949 | 0.098 | 0.152 | 0.830 | 0.335 | 0.504 | 1.514 | 70.388 | 77.325 |
Align3R | 1.027 | 1.550 | 6.499 | 0.425 | 0.215 | 9.430 | 4.611 | 0.817 | 0.600 | 0.076 | 0.091 | 1.083 | 0.150 | 0.179 | 0.977 | 72.010 | 70.638 |
Easi3R | 0.857 | 1.271 | 5.052 | 0.174 | 0.103 | 2.872 | 3.625 | 0.919 | 0.615 | 0.075 | 0.094 | 1.361 | 0.119 | 0.138 | 1.733 | 62.061 | 71.060 |
Geo4D | 0.798 | 1.264 | 5.692 | 0.436 | 0.175 | 10.565 | 1.662 | 0.497 | 0.696 | 0.573 | 0.472 | 3.779 | 0.384 | 0.329 | 1.395 | - | - |
Aether | 3.168 | 2.366 | 21.643 | 0.644 | 0.273 | 14.804 | 1.553 | 0.744 | 0.744 | 0.195 | 0.122 | 1.610 | 0.152 | 0.097 | 0.796 | - | - |
In Distribution Long Sequence Street Driving Indoor-Outdoor Drone Air-Ground
Normalized Metric
Method | DTU | ScanNet | KITTI | ETH3D | T&T | |||||
---|---|---|---|---|---|---|---|---|---|---|
AbsRel ↓ | δ<1.03 ↑ | AbsRel ↓ | δ<1.03 ↑ | AbsRel ↓ | δ<1.03 ↑ | AbsRel ↓ | δ<1.03 ↑ | AbsRel ↓ | δ<1.03 ↑ | |
Robust MVD | 2.490 | 80.056 | 7.468 | 35.651 | 9.419 | 30.505 | 9.302 | 42.909 | 6.379 | 58.409 |
DUSt3R/LSM | 2.741 | 75.685 | 4.732 | 61.337 | 9.113 | 39.495 | 3.132 | 74.851 | 3.106 | 77.033 |
MASt3R | 3.343 | 68.301 | 5.949 | 54.516 | 9.542 | 46.805 | 2.471 | 81.291 | 2.381 | 82.262 |
Spann3R | 6.431 | 38.339 | 7.779 | 33.713 | 10.195 | 30.858 | 5.121 | 54.708 | 5.580 | 52.812 |
CUT3R | 6.200 | 47.421 | 8.231 | 39.464 | 23.849 | 12.087 | 5.224 | 59.864 | 4.594 | 56.773 |
VGGT | 1.085 | 94.305 | 4.386 | 64.968 | 9.436 | 41.309 | 1.782 | 86.337 | 2.075 | 85.174 |
Fast3R | 3.940 | 62.120 | 6.271 | 50.283 | 13.390 | 26.734 | 4.692 | 62.663 | 4.423 | 64.873 |
MonST3R | 5.346 | 67.977 | 5.557 | 53.309 | 10.191 | 40.274 | 3.368 | 72.624 | 3.289 | 72.491 |
Robust MVD | 2.242 | 84.574 | 8.016 | 35.924 | 10.846 | 25.534 | 10.944 | 35.526 | 6.982 | 60.643 |
MASt3R | 84.904 | 0.000 | 93.584 | 0.000 | 99.069 | 0.000 | 97.021 | 0.000 | 98.234 | 0.000 |
CUT3R | 84.904 | 0.000 | 93.584 | 0.000 | 99.069 | 0.000 | 97.022 | 0.000 | 98.234 | 0.000 |
Object-Centric Indoor Scene Outdoor Scene Mixed Scene
Normalized Metric
Method | Bonn | TUM Dyn | KITTI | PointOdyssey | Syndrone | Sintel | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
AbsRel ↓ | δ<1.25 ↑ | AbsRel ↓ | δ<1.25 ↑ | AbsRel ↓ | δ<1.25 ↑ | AbsRel ↓ | δ<1.25 ↑ | AbsRel ↓ | δ<1.25 ↑ | AbsRel ↓ | δ<1.25 ↑ | |
DepthAnyVideo | 0.515 | 25.3 | 0.184 | 84.6 | 0.074 | 95.3 | 0.417 | 61.7 | 0.299 | 83.1 | 0.455 | 47.9 |
VideoDepthAnything | 0.268 | 48.3 | 1.101 | 89.0 | 0.060 | 98.2 | 0.283 | 70.3 | 0.138 | 92.5 | 1.691 | 45.4 |
DepthCrafter | 0.107 | 88.3 | 0.159 | 79.5 | 0.120 | 86.2 | 0.144 | 81.3 | 0.380 | 87.5 | 0.354 | 58.2 |
Marigold | 0.329 | 52.2 | 0.600 | 32.8 | 0.332 | 43.3 | 0.346 | 47.5 | 1.331 | 16.8 | 0.417 | 45.4 |
DUSt3R/LSM | 0.174 | 83.5 | 0.187 | 79.2 | 0.124 | 84.9 | 0.168 | 77.8 | 0.063 | 96.9 | 0.475 | 59.1 |
MASt3R | 0.160 | 81.5 | 0.162 | 83.1 | 0.082 | 93.2 | 0.150 | 79.3 | 0.046 | 97.5 | 0.374 | 63.9 |
Spann3R | 0.205 | 77.4 | 0.204 | 70.6 | 0.449 | 49.1 | 0.303 | 58.4 | 0.241 | 74.5 | 0.587 | 43.3 |
CUT3R | 0.068 | 95.0 | 0.108 | 84.7 | 0.104 | 89.9 | 0.095 | 88.4 | 0.111 | 89.5 | 0.466 | 56.0 |
VGGT | 0.056 | 96.3 | 0.068 | 93.9 | 0.051 | 96.6 | 0.026 | 99.0 | 0.075 | 95.9 | 0.242 | 65.9 |
Fast3R | 0.232 | 69.4 | 0.221 | 71.1 | 0.308 | 46.8 | 0.271 | 66.2 | 0.368 | 44.8 | 0.565 | 48.7 |
MonST3R | 0.061 | 95.4 | 0.197 | 72.6 | 0.083 | 93.4 | 0.066 | 92.3 | 0.110 | 89.7 | 0.343 | 59.4 |
Align3R | 0.062 | 96.8 | 0.107 | 90.1 | 0.105 | 89.2 | 0.077 | 93.3 | 0.097 | 92.9 | 0.237 | 69.0 |
Easi3R | 0.061 | 95.8 | 0.192 | 76.9 | 0.150 | 76.2 | 0.143 | 82.1 | 0.095 | 94.0 | 0.323 | 53.9 |
Geo4D | 0.060 | 97.8 | 0.096 | 93.2 | 0.086 | 93.8 | 0.082 | 93.0 | 0.105 | 93.1 | 0.205 | 73.2 |
Aether | 0.582 | 61.2 | 0.192 | 80.6 | 0.065 | 96.2 | 0.123 | 87.9 | 0.145 | 91.1 | 0.343 | 69.4 |
GeometryCrafter | 0.061 | 96.8 | 0.115 | 87.7 | 0.410 | 53.8 | 0.124 | 83.6 | 0.123 | 90.8 | 0.280 | 72.4 |
MASt3R | 0.549 | 4.6 | 0.633 | 0.9 | 0.754 | 6.4 | 0.749 | 0.2 | 0.967 | 0 | 0.701 | 2.3 |
CUT3R | 0.097 | 90.3 | 0.135 | 80.6 | 0.118 | 87.4 | 0.127 | 88.1 | 0.824 | 0 | 1.020 | 23.6 |
Indoor Scene Outdoor Scene Large Dynamic Motion Drone Scene Mixed Scene
Method | DTU | RealEstate10k | ScanNet++ | ACID | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
PSNR ↑ | SSIM ↑ | LPIPS ↓ | PSNR ↑ | SSIM ↑ | LPIPS ↓ | PSNR ↑ | SSIM ↑ | LPIPS ↓ | PSNR ↑ | SSIM ↑ | LPIPS ↓ | |
LSM | 11.68 | 0.3294 | 0.5218 | 14.04 | 0.4388 | 0.4873 | 12.39 | 0.4596 | 0.5479 | 16.73 | 0.4562 | 0.4567 |
NoPoSplat | 17.91 | 0.6306 | 0.2810 | 24.53 | 0.8450 | 0.1634 | 22.15 | 0.7988 | 0.2359 | 25.35 | 0.7774 | 0.1875 |
FLARE | 17.01 | 0.5672 | 0.2901 | 22.15 | 0.7126 | 0.2363 | 23.19 | 0.8117 | 0.2201 | 22.44 | 0.6229 | 0.2818 |
Object-Centric Indoor Scenes Drone Scenes
Inference Efficiency
Method | 2 | 4 | 8 | 16 | 32 | 64 | 128 | 256 | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Time ↓ | GPU ↓ | Time ↓ | GPU ↓ | Time ↓ | GPU ↓ | Time ↓ | GPU ↓ | Time ↓ | GPU ↓ | Time ↓ | GPU ↓ | Time ↓ | GPU ↓ | Time ↓ | GPU ↓ | |
DUST3R | 0.35 ± 0.19 | 2.49 | 6.00 ± 0.30 | 2.6 | 13.96 ± 0.86 | 3.65 | 50.37 ± 2.28 | 8.38 | 196.81 ± 6.38 | 27.52 | OOM | OOM | OOM | OOM | OOM | OOM |
MASt3R | 9.43 ± 0.28 | 2.61 | 14.63 ± 0.52 | 2.68 | 21.38 ± 2.26 | 2.78 | 42.28 ± 9.06 | 3.35 | 117.77 ± 40.83 | 6.87 | 392.23 ± 184.36 | 28.78 | OOM | OOM | OOM | OOM |
Spann3R | 0.16 ± 0.12 | 2.79 | 0.28 ± 0.01 | 2.8 | 0.65 ± 0.00 | 2.81 | 1.38 ± 0.01 | 2.84 | 2.81 ± 0.07 | 2.89 | 5.51 ± 0.03 | 2.99 | 11.25 ± 0.16 | 3.19 | 23.64 ± 0.70 | 3.55 |
CUT3R | 0.19 ± 0.07 | 3.33 | 0.26 ± 0.04 | 3.38 | 0.42 ± 0.03 | 3.48 | 0.78 ± 0.03 | 3.65 | 1.50 ± 0.03 | 4.28 | 3.12 ± 0.31 | 5.54 | 5.76 ± 0.12 | 11.68 | 11.65 ± 0.16 | 17.36 |
VGGT | 0.32 ± 0.41 | 7.11 | 0.29 ± 0.40 | 7.72 | 0.24 ± 0.01 | 9.06 | 0.72 ± 0.49 | 10.29 | 2.35 ± 0.04 | 12.75 | 4.23 ± 0.07 | 17.66 | 11.76 ± 0.41 | 28.65 | 34.21 ± 2.51 | 50.92 |
Fast3R | 0.13 ± 0.14 | 4.05 | 0.11 ± 0.03 | 4.26 | 0.15 ± 0.02 | 4.75 | 0.30 ± 0.01 | 5.8 | 0.69 ± 0.02 | 7.25 | 1.78 ± 0.03 | 8.43 | 5.13 ± 0.06 | 10.91 | 16.55 ± 0.12 | 15.75 |
MonST3R | 0.32 ± 0.25 | 2.79 | 14.78 ± 0.52 | 4.8 | 18.77 ± 0.20 | 7.84 | 35.76 ± 0.35 | 8.9 | 73.19 ± 0.37 | 16.15 | 148.17 ± 0.99 | 32.99 | 605.83 ± 25.24 | 66.66 | OOM | OOM |
Easi3R | 0.35 ± 0.19 | 2.49 | 17.35 ± 1.10 | 3.41 | 24.18 ± 0.76 | 4.15 | 60.12 ± 2.67 | 7.69 | 137.16 ± 10.86 | 15.96 | 273.78 ± 2.08 | 32.53 | 901.05 ± 5.29 | 65.68 | OOM | OOM |
Findings and Takeaways
What Is the Impact of Tasks with Different Difficulties?
- Multi-view geometry inference is inherently harder than pair-view inference.
- Directly predicting dense 3D scene representations is much more challenging than estimating individual 3D attributes like depth and camera poses.
- Metric-scale depth estimation remains a key challenge for GFMs.
- Joint prediction of multiple geometric attributes (e.g., pose, depth, matching) may underlie recent performance gains.
Takeaway 1: Current GFMs are promising but face significant challenges when learning from overly complex tasks. Recommendation: Carefully decomposing difficult tasks (e.g., jointly predicting geometry, pose, depth, and tracking) into simpler sub-problems can facilitate more effective learning, especially under limited 3D data.
Do GFMs Generalize Well on Different Data Domains?
- GFMs struggle to generalize in domains with extreme data scarcity.
Takeaway 2: Diverse, high-quality data is critical for strong generalization. To improve robustness in underrepresented domains, GFMs must be trained on data that covers broader distributions and metric-scale annotations.
Hints for Model Architecture Design, ViT or Diffusion? Strong 2D Feature Extractor?
- No single design, feed-forward ViT or diffusion, is universally superior.
- Stronger 2D foundation models can significantly enhance 3D GFMs.
Takeaway 3: No single backbone—feed -forward ViT or diffusion, dominates; architecture choice should align with task needs. Moreover, leveraging strong 2D feature extractors (e.g., DINO) substantially boosts 3D performance.
Are Current GFMs Ready for Real-Time Perception Systems?
- Despite progress, GFMs still lack the efficiency required for real-time 3D applications.
Takeaway 4: As GFMs scale to handle more views and complex tasks, efficiency becomes as critical as accuracy for enabling real-time 3D perception.
Citation
@article{cong2025e3dbench,
title={E3D-Bench: An End-to-End Benchmark for 3D Geometric Foundation Models},
author={Cong, Wenyan and Liang, Yiqing and Zhang, Yancheng and Yang, Ziyi and Wang, Yan and Ivanovic, Boris and Pavone, Marco and Chen, Chen and Wang, Zhangyang and Fan, Zhiwen},
journal={arXiv preprint arXiv:2506.01933},
year={2025}
}