Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MVSplat: Efficient 3D Gaussian Splatting from S...

Kazuki Ozeki
October 23, 2024
16

MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi-View Images

Keio University Aoki Lab's Paper Reading on October 23, 2024

Kazuki Ozeki

October 23, 2024
Tweet

Transcript

  1. MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi-View Images Yuedong

    Chen et al. (ECCV 2024, Oral Presentation) Presenter: Keio Univ. M1 Kazuki Ozeki
  2. 2 Objectives 1. What is cross-scene feed-forward inference? 2. How

    does MVSplat learn feed-forward 3D Gaussians? 3. Why is MVSplat important?
  3. 3 Abstract Predicts feed-forward 3D Gaussians from sparse multi-view images

    • Efficiently • With high-fidelity pixelSplat (CVPR2024) MVSplat 10× Fewer 2× Faster Better Geometry Input
  4. 4 Sparse View Scene Reconstruction Par-Scene Optimization • Mainly design

    effective regularization terms • Slow inference due to the per-scene gradient back-propagation Cross-Scene Feed-Forward Inference • Learn priors from large-scale datasets • Infer 3D scenes in a single feed-forward pass (faster)
  5. 5 Related Works: MuRF MuRF: Multi-Baseline Radiance Fields (CVPR 2024)

    • Feed-forward NeRF models with a target view frustum volume • Suffer from expensive training time MuRF: Multi-Baseline Radiance Fields
  6. 6 Related Works: pixelSplat pixelSplat (CVPR 2024, Best Paper Runner-Up)

    • First feed-forward Gaussian model • Predict the probability distribution of Gaussian positions pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction
  7. 7 3D Gaussian Splatting Represent 3D scenes with 3D Gaussians

    • Position 𝝁 • Opacity 𝛼 • Covariance 𝜮 • Color 𝒄 Recent Trends in 3D Reconstruction of General Non-Rigid Scenes (Computer Graphics Forum 2024) A Survey on 3D Gaussian Splatting (TPAMI 2024)
  8. 9 Feature Extraction Obtain cross-view aware features 𝐅! !"# $

    with Transformers (for 𝐾 input views) SuperGlue (CVPR2020) (Visualization of attention)
  9. 10 Cost Volume Construction Obtain view 𝑖’s (matching) cost volume

    • Warp view 𝑗’s feature using depth candidates 𝑑% %"# & • Compute the correlation 𝑪'! ! 𝑑" 𝑷# view 𝑗 view 𝑖 𝑷$ 𝑭# 𝑭%! #→$ camera projection matrices channel dimension High cost = A surface point
  10. 11 Cost Volume Refinement Concatenate 𝑭! and 𝑪! → refined

    cost volume , 𝑪! • Using 2D U-Net with cross-view attention layers • Enhance quality for which content is only visible from one view here
  11. 12 Depth Estimation Obtain per-view depth 𝐻 𝑊 𝐷 Normalized

    cost All depth candidates × ↓ A weighted average Cost volume * 𝑪$ matching confidence
  12. 13 Gaussian Parameters Prediction Gaussian position 𝝁: Unproject predicted depth

    𝑽! Opacity 𝛼: Input the matching confidence to 2 conv. Covariance 𝜮 and color 𝒄: Input 𝑭!, , 𝑪!, multi-view images to 2 conv. here
  13. 14 Training Loss Supervise with only rendering loss • A

    linear combination of 𝑙( and LPIPS losses • End-to-end differentiable learning here GT image
  14. 15 Experimental Setting Datasets • RealEstate10K (left) • ACID (right)

    Metrics • PSNR (pixel-level), SSIM (patch-level), LPIPS (feature-level) • The Inference (rendering) time • The number of model parameters
  15. 16 Quantitative Results Comparison with SOTA of feed-forward methods →

    Surpass all SOTA models in terms of visual quality (Note that MuRF is expensive to train)
  16. 20 Contributions Learn feed-forward Gaussians with a cost volume Outperform

    pixelSplat on quality & efficiency significantly Can be further explored • Training with large-scale datasets • Adaptation to dynamic scenes