Direct Voxel Grid Optimiztion (original) (raw)

Direct Voxel Grid Optimization

Super-fast Convergence for Radiance Fields Reconstruction

CVPR 2022 (Oral)

Results on custom casual capturing

A short guide to support custom forward-facing capturing and fly-through video rendering.

Results on real-world captured data

Features

Speedup NeRF by replacing the MLP with the voxel grid.
Simple scene representation:
- Volume densities:
  dense voxel grid (3D).
- View-dependent colors:
  dense feature grid (4D) + shallow MLP.
Pytorch implementation.
†Pytorch cuda extention built just-in-time for another 2--3x speedup.
†O(N) realization for the distortion loss proposed by mip-nerf 360.
- The loss improves our training time and quality.
- We have released a self-contained pytorch package: torch_efficient_distloss.
- Consider a batch of 8192 rays X 256 points.
  - GPU memory consumption: 6192MB => 96MB.
  - Run times for 100 iters: 20 sec => 0.2sec.
Supported datasets:
- Bounded inward-facing:
- †Unbounded inward-facing:
- †Foward-facing:

† means new stuff after publication.

Post-activation

Observation. To produce sharp surface, we have to activate density into alpha

after

interpolation.

Proof. Post-activation can be arbitrarily close to a surface beyond linear. Detail in paper.

Toy example 1. Fitting a surface with a single 2D grid cell.

Toy example 2. Fitting a binary (occupancy) image with a 2D grid.

Ablation study. Up to 2.88 PSNR difference for novel-view synthesis.

Low-density initialization

Observation. The initial alpha values (activated from the volume densities) should be close to 0. We introduce a hyperparameter alpha-init to control it.

Ablation study. The alpha-init should be small enough to achieve good quality and avoid floater.

Caveat. We empirically find that the qualities and the training times are sensitive to the alpha-init. We set alpha-init to 3 different values for bounded, unbounded inward-facing, and forward-facining datasets respectively. You may want to try a few different values for new datasets.

🤔 It seems that the explicit (grid-based) representation needs careful regularizations, while the implicit (MLP network) doesn't. We still don't know the root cause for this empirical finding at this moment.