MVSplat: Efficient 3D Gaussian Splatting
from Sparse Multi-View Images
Yuedong Chen et al. (ECCV 2024, Oral Presentation)
Presenter: Keio Univ. M1 Kazuki Ozeki
Slide 2
Slide 2 text
2
Objectives
1. What is cross-scene feed-forward inference?
2. How does MVSplat learn feed-forward 3D Gaussians?
3. Why is MVSplat important?
Slide 3
Slide 3 text
3
Abstract
Predicts feed-forward 3D Gaussians from sparse multi-view images
• Efficiently
• With high-fidelity
pixelSplat
(CVPR2024)
MVSplat
10× Fewer 2× Faster
Better Geometry
Input
Slide 4
Slide 4 text
4
Sparse View Scene Reconstruction
Par-Scene Optimization
• Mainly design effective regularization terms
• Slow inference due to the per-scene gradient back-propagation
Cross-Scene Feed-Forward Inference
• Learn priors from large-scale datasets
• Infer 3D scenes in a single feed-forward pass (faster)
Slide 5
Slide 5 text
5
Related Works: MuRF
MuRF: Multi-Baseline Radiance Fields (CVPR 2024)
• Feed-forward NeRF models with a target view frustum volume
• Suffer from expensive training time
MuRF: Multi-Baseline Radiance Fields
Slide 6
Slide 6 text
6
Related Works: pixelSplat
pixelSplat (CVPR 2024, Best Paper Runner-Up)
• First feed-forward Gaussian model
• Predict the probability distribution of Gaussian positions
pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction
Slide 7
Slide 7 text
7
3D Gaussian Splatting
Represent 3D scenes with 3D Gaussians
• Position 𝝁
• Opacity 𝛼
• Covariance 𝜮
• Color 𝒄
Recent Trends in 3D Reconstruction of General Non-Rigid Scenes (Computer Graphics Forum 2024)
A Survey on 3D Gaussian Splatting (TPAMI 2024)
Slide 8
Slide 8 text
8
Overview
1. Cost Volume Construction via Feature Matching
2. Gaussian Parameters Prediction
1 2
Slide 9
Slide 9 text
9
Feature Extraction
Obtain cross-view aware features 𝐅!
!"#
$
with Transformers
(for 𝐾 input views)
SuperGlue (CVPR2020)
(Visualization of attention)
Slide 10
Slide 10 text
10
Cost Volume Construction
Obtain view 𝑖’s (matching) cost volume
• Warp view 𝑗’s feature using depth candidates 𝑑% %"#
&
• Compute the correlation 𝑪'!
!
𝑑"
𝑷#
view 𝑗
view 𝑖
𝑷$
𝑭#
𝑭%!
#→$
camera projection matrices channel dimension
High cost = A surface point
Slide 11
Slide 11 text
11
Cost Volume Refinement
Concatenate 𝑭! and 𝑪! → refined cost volume ,
𝑪!
• Using 2D U-Net with cross-view attention layers
• Enhance quality for which content is only visible from one view
here
Slide 12
Slide 12 text
12
Depth Estimation
Obtain per-view depth
𝐻
𝑊
𝐷
Normalized cost
All depth candidates
×
↓
A weighted average
Cost volume *
𝑪$
matching confidence
Slide 13
Slide 13 text
13
Gaussian Parameters Prediction
Gaussian position 𝝁: Unproject predicted depth 𝑽!
Opacity 𝛼: Input the matching confidence to 2 conv.
Covariance 𝜮 and color 𝒄: Input 𝑭!, ,
𝑪!, multi-view images to 2 conv.
here
Slide 14
Slide 14 text
14
Training Loss
Supervise with only rendering loss
• A linear combination of 𝑙(
and LPIPS losses
• End-to-end differentiable learning
here
GT image
Slide 15
Slide 15 text
15
Experimental Setting
Datasets
• RealEstate10K (left)
• ACID (right)
Metrics
• PSNR (pixel-level), SSIM (patch-level), LPIPS (feature-level)
• The Inference (rendering) time
• The number of model parameters
Slide 16
Slide 16 text
16
Quantitative Results
Comparison with SOTA of feed-forward methods
→ Surpass all SOTA models in terms of visual quality
(Note that MuRF is expensive to train)
Slide 17
Slide 17 text
17
Qualitative Results
Comparison with top three best models
(Input only two views!)
Slide 18
Slide 18 text
18
Geometry Reconstruction
Reconstruct much higher-quality 3D Gaussians
(Trained solely with photometric supervision!)
Slide 19
Slide 19 text
19
Cross-dataset Generalization
Zero-shot test w/o any fine-tuning
→ Surpass pixelSplat (but still low appearance quality)
Slide 20
Slide 20 text
20
Contributions
Learn feed-forward Gaussians with a cost volume
Outperform pixelSplat on quality & efficiency significantly
Can be further explored
• Training with large-scale datasets
• Adaptation to dynamic scenes