1.8k

# Optimal Transport and Deep Generative Models

Talk at "Rencontre Maths-Industrie" June 20, 2017

## Transcript

1. Optimal Transport
and
Deep Generative Models
Gabriel Peyré
www.numerical-tours.com
Joint works with:
Aude Genevay and Marco Cuturi
É C O L E N O R M A L E
S U P É R I E U R E

2. Optimal Transport: Theory to Applications
Monge Kantorovich Dantzig Brenier Otto McCann Villani
ansport framework Sliced Wasserstein projection Applications
lication to Color Transfer
Source image (X)
Style image (Y)
Sliced Wasserstein projection of X to style
image color statistics Y
Source image after color transfer
J. Rabin Wasserstein Regularization

3. Overview
Density Fitting vs. Auto Encoders
Discriminative vs. Generative Models
g✓
Z X
g✓ µ✓

d⇠

4. Discriminative vs Generative Models
Z
x
z
X
Z X
Low dimension
High dimension
Generative
Discriminative
g✓
d⇠
g✓
d⇠

5. Discriminative vs Generative Models
Z
x
z
X
Z X
Low dimension
High dimension
Generative
Discriminative
g✓
d⇠
g✓
d⇠
classiﬁcation,
z
=class probability
Supervised:
!
Learn d⇠ from labeled data (xi, zi)i.

6. Discriminative vs Generative Models
Z
x
z
X
Z X
Low dimension
High dimension
Generative
Discriminative
g✓
d⇠
g✓
d⇠
classiﬁcation,
z
=class probability
Supervised:
!
Learn d⇠ from labeled data (xi, zi)i.
Compression: z = d⇠(x) is a representation.
Generation: x = g✓(z) is a synthesis.
Un-supervised:
!
Learn (g✓, d⇠) from data (xi)i.

7. Discriminative vs Generative Models
Z
x
z
X
Z X
Low dimension
High dimension
Generative
Discriminative
g✓
d⇠
g✓
d⇠
classiﬁcation,
z
=class probability
Supervised:
!
Learn d⇠ from labeled data (xi, zi)i.
Compression: z = d⇠(x) is a representation.
Generation: x = g✓(z) is a synthesis.
Un-supervised:
!
Learn (g✓, d⇠) from data (xi)i.
Density ﬁtting
g✓({
zi
}i) ⇡ {
xi
}i
Auto-encoders
g✓(
d⇠(
xi)) ⇡
xi

8. Discriminative vs Generative Models
Z
x
z
X
Z X
Low dimension
High dimension
Generative
Discriminative
g✓
d⇠
g✓
d⇠
classiﬁcation,
z
=class probability
Supervised:
!
Learn d⇠ from labeled data (xi, zi)i.
Compression: z = d⇠(x) is a representation.
Generation: x = g✓(z) is a synthesis.
Un-supervised:
!
Learn (g✓, d⇠) from data (xi)i.
Density ﬁtting
g✓({
zi
}i) ⇡ {
xi
}i
Auto-encoders
g✓(
d⇠(
xi)) ⇡
xi
Optimal transport
map d⇠

9. Deep Discriminative vs Generative Models
✓1
✓2

z
x
g✓
⇢ ⇢
z
x
Discriminative
Generative
d⇠
⇠1 ⇠2
g✓(z) = ⇢(✓K(. . . ⇢(✓2(⇢(✓1(z) . . .)
Deep networks:
d⇠(
x
) =

(
⇠K(
. . . ⇢
(
⇠2(

(
⇠1(
x
)
. . .
)

10. Deep Discriminative vs Generative Models
✓1
✓2

z
x
g✓
⇢ ⇢
z
x
Discriminative
Generative
z1
z2
g✓
Z
x
z
X
d⇠
d⇠
⇠1 ⇠2
g✓(z) = ⇢(✓K(. . . ⇢(✓2(⇢(✓1(z) . . .)
Deep networks:
d⇠(
x
) =

(
⇠K(
. . . ⇢
(
⇠2(

(
⇠1(
x
)
. . .
)

11. Examples of Image Generation
[Credit ArXiv:1511.06434]
g✓
Z X

12. Overview
Density Fitting vs. Auto Encoders
Discriminative vs. Generative Models
g✓
Z X
g✓ µ✓

d⇠

13. Density Fitting vs. Generative Models
Parametric model:
✓ 7! µ✓
µ✓

Observations: ⌫ = 1
n
P
n
i
=1 xi

14. Density Fitting vs. Generative Models
Parametric model:
✓ 7! µ✓
µ✓

min

c
KL(
µ✓
|⌫
)
def.
=
X
j
log(
f✓(
yj))
dµ✓(y) = f✓(y)dy
Density ﬁtting:
Maximum
likelihood (MLE)
Observations: ⌫ = 1
n
P
n
i
=1 xi

15. Density Fitting vs. Generative Models
Parametric model:
✓ 7! µ✓
µ✓

min

c
KL(
µ✓
|⌫
)
def.
=
X
j
log(
f✓(
yj))
dµ✓(y) = f✓(y)dy
Density ﬁtting:
Maximum
likelihood (MLE)
g✓ µ✓
X
Z

Generative model ﬁt:
µ✓ = g✓,]⇣
c
KL(µ✓
|⌫) = +1
! MLE undeﬁned.
! Need a weaker metric.
Observations: ⌫ = 1
n
P
n
i
=1 xi

16. Comparing Measures and Spaces
Source image (X)
Style image (Y)
Sliced Wasserstein projection of X to style
image color statistics Y
Source image after color transfer
J. Rabin Wasserstein Regularization
!
images, vision, graphics and machine learning, . . .

Probability distributions and histograms

17. Comparing Measures and Spaces
Source image (X)
Style image (Y)
Sliced Wasserstein projection of X to style
image color statistics Y
Source image after color transfer
J. Rabin Wasserstein Regularization
!
images, vision, graphics and machine learning, . . .

Probability distributions and histograms
Optimal transport mean
L2 mean
• Optimal transport

18. Comparing Measures and Spaces
Source image (X)
Style image (Y)
Sliced Wasserstein projection of X to style
image color statistics Y
Source image after color transfer
J. Rabin Wasserstein Regularization
!
images, vision, graphics and machine learning, . . .

Probability distributions and histograms
Optimal transport mean
L2 mean
• Optimal transport
!
well deﬁned for discrete
or singular distributions
(“weak” metric).
µ

19. Probability Measures and Couplings
Couplings:
P1]⇡(S) def.
= ⇡(S, X) P2]⇡(S) def.
= ⇡(X, S)
Marginals:
⇧(µ, ⌫) def.
= {⇡ 2 M+(X ⇥ X) ; P1]⇡ = µ, P2]⇡ = ⌫}

Discrete

20. Probability Measures and Couplings
Couplings:
P1]⇡(S) def.
= ⇡(S, X) P2]⇡(S) def.
= ⇡(X, S)
Marginals:
⇧(µ, ⌫) def.
= {⇡ 2 M+(X ⇥ X) ; P1]⇡ = µ, P2]⇡ = ⌫}

Discrete

Semi-discrete

21. Probability Measures and Couplings
Couplings:
P1]⇡(S) def.
= ⇡(S, X) P2]⇡(S) def.
= ⇡(X, S)
Marginals:
⇧(µ, ⌫) def.
= {⇡ 2 M+(X ⇥ X) ; P1]⇡ = µ, P2]⇡ = ⌫}

Discrete

Continuous

Semi-discrete

22. Optimal Transport
Optimal transport: [Kantorovitch 1942]
W
p
p
(
µ, ⌫
) def.
= min

h
d
p
, ⇡
i =
Z
X⇥X
d
(
x, y
)pd

(
x, y
) ;

2 ⇧(
µ, ⌫
)
d
(
x, y
)
x y

23. Optimal Transport
Optimal transport: [Kantorovitch 1942]
W
p
p
(
µ, ⌫
) def.
= min

h
d
p
, ⇡
i =
Z
X⇥X
d
(
x, y
)pd

(
x, y
) ;

2 ⇧(
µ, ⌫
)
! Wp is a distance on
M+(
X
).
Wp
(
x, y
) =
d
(
x, y
) x
!
y
! 0
! Wp works for singular distributions. d
(
x, y
)
x y

24. Optimal Transport
Optimal transport: [Kantorovitch 1942]
W
p
p
(
µ, ⌫
) def.
= min

h
d
p
, ⇡
i =
Z
X⇥X
d
(
x, y
)pd

(
x, y
) ;

2 ⇧(
µ, ⌫
)
! Wp is a distance on
M+(
X
).
Wp
(
x, y
) =
d
(
x, y
) x
!
y
! 0
! Wp works for singular distributions.
g✓ µ✓
X
Z

Minimum Kantorovitch Estimator:
min

Wp(µ✓, ⌫)
[Bassetti et al, 06]
Algorithms: [Montavon et al 16], [Bernton et al 17], [Genevay et al 17]
generative
model
min

Wp(g✓,]⇣, ⌫)
d
(
x, y
)
x y

25. From OT to VAE
min

Wp
p
(g✓,]⇣, ⌫)
µ✓

= min
✓,⇡
⇢Z
X⇥X
d
(
x, y
)pd

(
x, y
) ;

2 ⇧(
µ✓, ⌫
)

26. From OT to VAE
min

Wp
p
(g✓,]⇣, ⌫)
µ✓
g✓

= min
✓,⇡
⇢Z
X⇥X
d
(
x, y
)pd

(
x, y
) ;

2 ⇧(
µ✓, ⌫
)
= min
✓,
⇢Z
Z⇥X
d(g✓(z), y)pd (z, y) ; 2 ⇧(⇣, ⌫)

27. From OT to VAE
min

Wp
p
(g✓,]⇣, ⌫)
µ✓
g✓

d⇠
Approximation of :
[Bousquet, et al, 17]
Arxiv:1705.07642

P
i f⇠(
xi)
,xi
= min
✓,⇡
⇢Z
X⇥X
d
(
x, y
)pd

(
x, y
) ;

2 ⇧(
µ✓, ⌫
)
= min
✓,
⇢Z
Z⇥X
d(g✓(z), y)pd (z, y) ; 2 ⇧(⇣, ⌫)
⇡ min
✓,⇠
(
X
i
d
(
g✓(
d⇠(
xi)
, xi)p ;
g⇠]⌫

)

28. From OT to VAE
min

Wp
p
(g✓,]⇣, ⌫)
µ✓
g✓

d⇠
g✓ d⇠
⇡ Id
Variational Auto-Encoders:
[Kingma, Welling, 13]
Approximation of :
[Bousquet, et al, 17]
Arxiv:1705.07642

P
i f⇠(
xi)
,xi
= min
✓,⇡
⇢Z
X⇥X
d
(
x, y
)pd

(
x, y
) ;

2 ⇧(
µ✓, ⌫
)
= min
✓,
⇢Z
Z⇥X
d(g✓(z), y)pd (z, y) ; 2 ⇧(⇣, ⌫)
⇡ min
✓,⇠
(
X
i
d
(
g✓(
d⇠(
xi)
, xi)p ;
g⇠]⌫

)

d⇠

29. Conclusion
Unsupervised learning:
!
learning a (generator, discriminator) pair.

30. Conclusion
Unsupervised learning:
!
learning a (generator, discriminator) pair.
[Goodfellow et al, 14]
! also an OT-like problem!
[Arjovsky et al, 17], Arxiv:1701.07875

31. Conclusion
Unsupervised learning:
!
learning a (generator, discriminator) pair.