MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi-View Images

Slide 1

Slide 1 text

MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi-View Images Yuedong Chen et al. (ECCV 2024, Oral Presentation) Presenter: Keio Univ. M1 Kazuki Ozeki

Slide 2

Slide 2 text

2 Objectives 1. What is cross-scene feed-forward inference? 2. How does MVSplat learn feed-forward 3D Gaussians? 3. Why is MVSplat important?

Slide 3

Slide 3 text

3 Abstract Predicts feed-forward 3D Gaussians from sparse multi-view images • Efficiently • With high-fidelity pixelSplat (CVPR2024) MVSplat 10× Fewer 2× Faster Better Geometry Input

Slide 4

Slide 4 text

4 Sparse View Scene Reconstruction Par-Scene Optimization • Mainly design effective regularization terms • Slow inference due to the per-scene gradient back-propagation Cross-Scene Feed-Forward Inference • Learn priors from large-scale datasets • Infer 3D scenes in a single feed-forward pass (faster)

Slide 5

Slide 5 text

5 Related Works: MuRF MuRF: Multi-Baseline Radiance Fields (CVPR 2024) • Feed-forward NeRF models with a target view frustum volume • Suffer from expensive training time MuRF: Multi-Baseline Radiance Fields

Slide 6

Slide 6 text

6 Related Works: pixelSplat pixelSplat (CVPR 2024, Best Paper Runner-Up) • First feed-forward Gaussian model • Predict the probability distribution of Gaussian positions pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction

Slide 7

Slide 7 text

7 3D Gaussian Splatting Represent 3D scenes with 3D Gaussians • Position 𝝁 • Opacity 𝛼 • Covariance 𝜮 • Color 𝒄 Recent Trends in 3D Reconstruction of General Non-Rigid Scenes (Computer Graphics Forum 2024) A Survey on 3D Gaussian Splatting (TPAMI 2024)

Slide 8

Slide 8 text

8 Overview 1. Cost Volume Construction via Feature Matching 2. Gaussian Parameters Prediction 1 2

Slide 9

Slide 9 text

9 Feature Extraction Obtain cross-view aware features 𝐅! !"# $ with Transformers (for 𝐾 input views) SuperGlue (CVPR2020) (Visualization of attention)

Slide 10

Slide 10 text

10 Cost Volume Construction Obtain view 𝑖’s (matching) cost volume • Warp view 𝑗’s feature using depth candidates 𝑑% %"# & • Compute the correlation 𝑪'! ! 𝑑" 𝑷# view 𝑗 view 𝑖 𝑷$ 𝑭# 𝑭%! #→$ camera projection matrices channel dimension High cost = A surface point

Slide 11

Slide 11 text

11 Cost Volume Refinement Concatenate 𝑭! and 𝑪! → refined cost volume , 𝑪! • Using 2D U-Net with cross-view attention layers • Enhance quality for which content is only visible from one view here

Slide 12

Slide 12 text

12 Depth Estimation Obtain per-view depth 𝐻 𝑊 𝐷 Normalized cost All depth candidates × ↓ A weighted average Cost volume * 𝑪$ matching confidence

Slide 13

Slide 13 text

13 Gaussian Parameters Prediction Gaussian position 𝝁: Unproject predicted depth 𝑽! Opacity 𝛼: Input the matching confidence to 2 conv. Covariance 𝜮 and color 𝒄: Input 𝑭!, , 𝑪!, multi-view images to 2 conv. here

Slide 14

Slide 14 text

14 Training Loss Supervise with only rendering loss • A linear combination of 𝑙( and LPIPS losses • End-to-end differentiable learning here GT image

Slide 15

Slide 15 text

15 Experimental Setting Datasets • RealEstate10K (left) • ACID (right) Metrics • PSNR (pixel-level), SSIM (patch-level), LPIPS (feature-level) • The Inference (rendering) time • The number of model parameters

Slide 16

Slide 16 text

16 Quantitative Results Comparison with SOTA of feed-forward methods → Surpass all SOTA models in terms of visual quality (Note that MuRF is expensive to train)

Slide 17

Slide 17 text

17 Qualitative Results Comparison with top three best models (Input only two views!)

Slide 18

Slide 18 text

18 Geometry Reconstruction Reconstruct much higher-quality 3D Gaussians (Trained solely with photometric supervision!)

Slide 19

Slide 19 text

19 Cross-dataset Generalization Zero-shot test w/o any fine-tuning → Surpass pixelSplat (but still low appearance quality)

Slide 20

Slide 20 text

20 Contributions Learn feed-forward Gaussians with a cost volume Outperform pixelSplat on quality & efficiency significantly Can be further explored • Training with large-scale datasets • Adaptation to dynamic scenes

Slide 21

Slide 21 text

21 Supplementary Material Ablations