Gabriel Peyré
June 20, 2017
1.8k

Optimal Transport and Deep Generative Models

Talk at "Rencontre Maths-Industrie"

June 20, 2017

Transcript

1. Optimal Transport and Deep Generative Models Gabriel Peyré www.numerical-tours.com Joint

works with: Aude Genevay and Marco Cuturi É C O L E N O R M A L E S U P É R I E U R E
2. Optimal Transport: Theory to Applications Monge Kantorovich Dantzig Brenier Otto

McCann Villani ansport framework Sliced Wasserstein projection Applications lication to Color Transfer Source image (X) Style image (Y) Sliced Wasserstein projection of X to style image color statistics Y Source image after color transfer J. Rabin Wasserstein Regularization
3. Overview Density Fitting vs. Auto Encoders Discriminative vs. Generative Models

g✓ Z X g✓ µ✓ ⇣ ⇡ d⇠ ⇡
4. Discriminative vs Generative Models Z x z X Z X

Low dimension High dimension Generative Discriminative g✓ d⇠ g✓ d⇠
5. Discriminative vs Generative Models Z x z X Z X

Low dimension High dimension Generative Discriminative g✓ d⇠ g✓ d⇠ classiﬁcation, z =class probability Supervised: ! Learn d⇠ from labeled data (xi, zi)i.
6. Discriminative vs Generative Models Z x z X Z X

Low dimension High dimension Generative Discriminative g✓ d⇠ g✓ d⇠ classiﬁcation, z =class probability Supervised: ! Learn d⇠ from labeled data (xi, zi)i. Compression: z = d⇠(x) is a representation. Generation: x = g✓(z) is a synthesis. Un-supervised: ! Learn (g✓, d⇠) from data (xi)i.
7. Discriminative vs Generative Models Z x z X Z X

Low dimension High dimension Generative Discriminative g✓ d⇠ g✓ d⇠ classiﬁcation, z =class probability Supervised: ! Learn d⇠ from labeled data (xi, zi)i. Compression: z = d⇠(x) is a representation. Generation: x = g✓(z) is a synthesis. Un-supervised: ! Learn (g✓, d⇠) from data (xi)i. Density ﬁtting g✓({ zi }i) ⇡ { xi }i Auto-encoders g✓( d⇠( xi)) ⇡ xi
8. Discriminative vs Generative Models Z x z X Z X

Low dimension High dimension Generative Discriminative g✓ d⇠ g✓ d⇠ classiﬁcation, z =class probability Supervised: ! Learn d⇠ from labeled data (xi, zi)i. Compression: z = d⇠(x) is a representation. Generation: x = g✓(z) is a synthesis. Un-supervised: ! Learn (g✓, d⇠) from data (xi)i. Density ﬁtting g✓({ zi }i) ⇡ { xi }i Auto-encoders g✓( d⇠( xi)) ⇡ xi Optimal transport map d⇠
9. Deep Discriminative vs Generative Models ✓1 ✓2 ⇢ ⇢ z

x g✓ ⇢ ⇢ z x Discriminative Generative d⇠ ⇠1 ⇠2 g✓(z) = ⇢(✓K(. . . ⇢(✓2(⇢(✓1(z) . . .) Deep networks: d⇠( x ) = ⇢ ( ⇠K( . . . ⇢ ( ⇠2( ⇢ ( ⇠1( x ) . . . )
10. Deep Discriminative vs Generative Models ✓1 ✓2 ⇢ ⇢ z

x g✓ ⇢ ⇢ z x Discriminative Generative z1 z2 g✓ Z x z X d⇠ d⇠ ⇠1 ⇠2 g✓(z) = ⇢(✓K(. . . ⇢(✓2(⇢(✓1(z) . . .) Deep networks: d⇠( x ) = ⇢ ( ⇠K( . . . ⇢ ( ⇠2( ⇢ ( ⇠1( x ) . . . )

12. Overview Density Fitting vs. Auto Encoders Discriminative vs. Generative Models

g✓ Z X g✓ µ✓ ⇣ ⇡ d⇠ ⇡
13. Density Fitting vs. Generative Models Parametric model: ✓ 7! µ✓

µ✓ ✓ Observations: ⌫ = 1 n P n i =1 xi
14. Density Fitting vs. Generative Models Parametric model: ✓ 7! µ✓

µ✓ ✓ min ✓ c KL( µ✓ |⌫ ) def. = X j log( f✓( yj)) dµ✓(y) = f✓(y)dy Density ﬁtting: Maximum likelihood (MLE) Observations: ⌫ = 1 n P n i =1 xi
15. Density Fitting vs. Generative Models Parametric model: ✓ 7! µ✓

µ✓ ✓ min ✓ c KL( µ✓ |⌫ ) def. = X j log( f✓( yj)) dµ✓(y) = f✓(y)dy Density ﬁtting: Maximum likelihood (MLE) g✓ µ✓ X Z ⇣ Generative model ﬁt: µ✓ = g✓,]⇣ c KL(µ✓ |⌫) = +1 ! MLE undeﬁned. ! Need a weaker metric. Observations: ⌫ = 1 n P n i =1 xi
16. Comparing Measures and Spaces Source image (X) Style image (Y)

Sliced Wasserstein projection of X to style image color statistics Y Source image after color transfer J. Rabin Wasserstein Regularization ! images, vision, graphics and machine learning, . . . • Probability distributions and histograms
17. Comparing Measures and Spaces Source image (X) Style image (Y)

Sliced Wasserstein projection of X to style image color statistics Y Source image after color transfer J. Rabin Wasserstein Regularization ! images, vision, graphics and machine learning, . . . • Probability distributions and histograms Optimal transport mean L2 mean • Optimal transport
18. Comparing Measures and Spaces Source image (X) Style image (Y)

Sliced Wasserstein projection of X to style image color statistics Y Source image after color transfer J. Rabin Wasserstein Regularization ! images, vision, graphics and machine learning, . . . • Probability distributions and histograms Optimal transport mean L2 mean • Optimal transport ! well deﬁned for discrete or singular distributions (“weak” metric). µ
19. Probability Measures and Couplings Couplings: P1]⇡(S) def. = ⇡(S, X)

P2]⇡(S) def. = ⇡(X, S) Marginals: ⇧(µ, ⌫) def. = {⇡ 2 M+(X ⇥ X) ; P1]⇡ = µ, P2]⇡ = ⌫} ⇡ Discrete
20. Probability Measures and Couplings Couplings: P1]⇡(S) def. = ⇡(S, X)

P2]⇡(S) def. = ⇡(X, S) Marginals: ⇧(µ, ⌫) def. = {⇡ 2 M+(X ⇥ X) ; P1]⇡ = µ, P2]⇡ = ⌫} ⇡ Discrete ⇡ Semi-discrete
21. Probability Measures and Couplings Couplings: P1]⇡(S) def. = ⇡(S, X)

P2]⇡(S) def. = ⇡(X, S) Marginals: ⇧(µ, ⌫) def. = {⇡ 2 M+(X ⇥ X) ; P1]⇡ = µ, P2]⇡ = ⌫} ⇡ Discrete ⇡ Continuous ⇡ Semi-discrete
22. Optimal Transport Optimal transport: [Kantorovitch 1942] W p p (

µ, ⌫ ) def. = min ⇡ ⇢ h d p , ⇡ i = Z X⇥X d ( x, y )pd ⇡ ( x, y ) ; ⇡ 2 ⇧( µ, ⌫ ) d ( x, y ) x y
23. Optimal Transport Optimal transport: [Kantorovitch 1942] W p p (

µ, ⌫ ) def. = min ⇡ ⇢ h d p , ⇡ i = Z X⇥X d ( x, y )pd ⇡ ( x, y ) ; ⇡ 2 ⇧( µ, ⌫ ) ! Wp is a distance on M+( X ). Wp ( x, y ) = d ( x, y ) x ! y ! 0 ! Wp works for singular distributions. d ( x, y ) x y
24. Optimal Transport Optimal transport: [Kantorovitch 1942] W p p (

µ, ⌫ ) def. = min ⇡ ⇢ h d p , ⇡ i = Z X⇥X d ( x, y )pd ⇡ ( x, y ) ; ⇡ 2 ⇧( µ, ⌫ ) ! Wp is a distance on M+( X ). Wp ( x, y ) = d ( x, y ) x ! y ! 0 ! Wp works for singular distributions. g✓ µ✓ X Z ⇣ Minimum Kantorovitch Estimator: min ✓ Wp(µ✓, ⌫) [Bassetti et al, 06] Algorithms: [Montavon et al 16], [Bernton et al 17], [Genevay et al 17] generative model min ✓ Wp(g✓,]⇣, ⌫) d ( x, y ) x y
25. From OT to VAE min ✓ Wp p (g✓,]⇣, ⌫)

µ✓ ⇡ = min ✓,⇡ ⇢Z X⇥X d ( x, y )pd ⇡ ( x, y ) ; ⇡ 2 ⇧( µ✓, ⌫ )
26. From OT to VAE min ✓ Wp p (g✓,]⇣, ⌫)

µ✓ g✓ ⇣ ⇡ = min ✓,⇡ ⇢Z X⇥X d ( x, y )pd ⇡ ( x, y ) ; ⇡ 2 ⇧( µ✓, ⌫ ) = min ✓, ⇢Z Z⇥X d(g✓(z), y)pd (z, y) ; 2 ⇧(⇣, ⌫)
27. From OT to VAE min ✓ Wp p (g✓,]⇣, ⌫)

µ✓ g✓ ⇣ ⇡ d⇠ Approximation of : [Bousquet, et al, 17] Arxiv:1705.07642 ⇠ ⇡ P i f⇠( xi) ,xi = min ✓,⇡ ⇢Z X⇥X d ( x, y )pd ⇡ ( x, y ) ; ⇡ 2 ⇧( µ✓, ⌫ ) = min ✓, ⇢Z Z⇥X d(g✓(z), y)pd (z, y) ; 2 ⇧(⇣, ⌫) ⇡ min ✓,⇠ ( X i d ( g✓( d⇠( xi) , xi)p ; g⇠]⌫ ⇡ ⇣ )
28. From OT to VAE min ✓ Wp p (g✓,]⇣, ⌫)

µ✓ g✓ ⇣ ⇡ d⇠ g✓ d⇠ ⇡ Id Variational Auto-Encoders: [Kingma, Welling, 13] Approximation of : [Bousquet, et al, 17] Arxiv:1705.07642 ⇠ ⇡ P i f⇠( xi) ,xi = min ✓,⇡ ⇢Z X⇥X d ( x, y )pd ⇡ ( x, y ) ; ⇡ 2 ⇧( µ✓, ⌫ ) = min ✓, ⇢Z Z⇥X d(g✓(z), y)pd (z, y) ; 2 ⇧(⇣, ⌫) ⇡ min ✓,⇠ ( X i d ( g✓( d⇠( xi) , xi)p ; g⇠]⌫ ⇡ ⇣ ) ⇣ d⇠

30. Conclusion Unsupervised learning: ! learning a (generator, discriminator) pair. Generative

Adversarial Networks: [Goodfellow et al, 14] ! also an OT-like problem! [Arjovsky et al, 17], Arxiv:1701.07875
31. Conclusion Unsupervised learning: ! learning a (generator, discriminator) pair. Generative

Adversarial Networks: [Goodfellow et al, 14] ! also an OT-like problem! [Arjovsky et al, 17], Arxiv:1701.07875 Open problems: ! how far are VAE/GAN from “true” OT? ! using OT to improve VAE/GAN training.