MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi-View Images

MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi-View Images Yuedong
Chen et al. (ECCV 2024, Oral Presentation) Presenter: Keio Univ. M1 Kazuki Ozeki

2 Objectives 1. What is cross-scene feed-forward inference? 2. How
does MVSplat learn feed-forward 3D Gaussians? 3. Why is MVSplat important?

3 Abstract Predicts feed-forward 3D Gaussians from sparse multi-view images
• Efficiently • With high-fidelity pixelSplat (CVPR2024) MVSplat 10× Fewer 2× Faster Better Geometry Input

4 Sparse View Scene Reconstruction Par-Scene Optimization • Mainly design
effective regularization terms • Slow inference due to the per-scene gradient back-propagation Cross-Scene Feed-Forward Inference • Learn priors from large-scale datasets • Infer 3D scenes in a single feed-forward pass (faster)

5 Related Works: MuRF MuRF: Multi-Baseline Radiance Fields (CVPR 2024)
• Feed-forward NeRF models with a target view frustum volume • Suffer from expensive training time MuRF: Multi-Baseline Radiance Fields

6 Related Works: pixelSplat pixelSplat (CVPR 2024, Best Paper Runner-Up)
• First feed-forward Gaussian model • Predict the probability distribution of Gaussian positions pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction

7 3D Gaussian Splatting Represent 3D scenes with 3D Gaussians
• Position 𝝁 • Opacity 𝛼 • Covariance 𝜮 • Color 𝒄 Recent Trends in 3D Reconstruction of General Non-Rigid Scenes (Computer Graphics Forum 2024) A Survey on 3D Gaussian Splatting (TPAMI 2024)

8 Overview 1. Cost Volume Construction via Feature Matching 2.
Gaussian Parameters Prediction 1 2

9 Feature Extraction Obtain cross-view aware features 𝐅! !"# $
with Transformers (for 𝐾 input views) SuperGlue (CVPR2020) (Visualization of attention)

10 Cost Volume Construction Obtain view 𝑖’s (matching) cost volume
• Warp view 𝑗’s feature using depth candidates 𝑑% %"# & • Compute the correlation 𝑪'! ! 𝑑" 𝑷# view 𝑗 view 𝑖 𝑷$ 𝑭# 𝑭%! #→$ camera projection matrices channel dimension High cost = A surface point

11 Cost Volume Refinement Concatenate 𝑭! and 𝑪! → refined
cost volume , 𝑪! • Using 2D U-Net with cross-view attention layers • Enhance quality for which content is only visible from one view here

12 Depth Estimation Obtain per-view depth 𝐻 𝑊 𝐷 Normalized
cost All depth candidates × ↓ A weighted average Cost volume * 𝑪$ matching confidence

13 Gaussian Parameters Prediction Gaussian position 𝝁: Unproject predicted depth
𝑽! Opacity 𝛼: Input the matching confidence to 2 conv. Covariance 𝜮 and color 𝒄: Input 𝑭!, , 𝑪!, multi-view images to 2 conv. here

14 Training Loss Supervise with only rendering loss • A
linear combination of 𝑙( and LPIPS losses • End-to-end differentiable learning here GT image

15 Experimental Setting Datasets • RealEstate10K (left) • ACID (right)
Metrics • PSNR (pixel-level), SSIM (patch-level), LPIPS (feature-level) • The Inference (rendering) time • The number of model parameters

16 Quantitative Results Comparison with SOTA of feed-forward methods →
Surpass all SOTA models in terms of visual quality (Note that MuRF is expensive to train)

17 Qualitative Results Comparison with top three best models (Input
only two views!)

18 Geometry Reconstruction Reconstruct much higher-quality 3D Gaussians (Trained solely
with photometric supervision!)

19 Cross-dataset Generalization Zero-shot test w/o any fine-tuning → Surpass
pixelSplat (but still low appearance quality)

20 Contributions Learn feed-forward Gaussians with a cost volume Outperform
pixelSplat on quality & efficiency significantly Can be further explored • Training with large-scale datasets • Adaptation to dynamic scenes

21 Supplementary Material Ablations

MVSplat: Efficient 3D Gaussian Splatting from S...

MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi-View Images

Kazuki Ozeki

More Decks by Kazuki Ozeki

Featured

Transcript

MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi-View Images Yuedong

2 Objectives 1. What is cross-scene feed-forward inference? 2. How

3 Abstract Predicts feed-forward 3D Gaussians from sparse multi-view images

4 Sparse View Scene Reconstruction Par-Scene Optimization • Mainly design

5 Related Works: MuRF MuRF: Multi-Baseline Radiance Fields (CVPR 2024)

6 Related Works: pixelSplat pixelSplat (CVPR 2024, Best Paper Runner-Up)

7 3D Gaussian Splatting Represent 3D scenes with 3D Gaussians

8 Overview 1. Cost Volume Construction via Feature Matching 2.

9 Feature Extraction Obtain cross-view aware features 𝐅! !"# $

10 Cost Volume Construction Obtain view 𝑖’s (matching) cost volume

11 Cost Volume Refinement Concatenate 𝑭! and 𝑪! → refined

12 Depth Estimation Obtain per-view depth 𝐻 𝑊 𝐷 Normalized

13 Gaussian Parameters Prediction Gaussian position 𝝁: Unproject predicted depth

14 Training Loss Supervise with only rendering loss • A

15 Experimental Setting Datasets • RealEstate10K (left) • ACID (right)

16 Quantitative Results Comparison with SOTA of feed-forward methods →

17 Qualitative Results Comparison with top three best models (Input

18 Geometry Reconstruction Reconstruct much higher-quality 3D Gaussians (Trained solely

19 Cross-dataset Generalization Zero-shot test w/o any fine-tuning → Surpass

20 Contributions Learn feed-forward Gaussians with a cost volume Outperform

21 Supplementary Material Ablations