Slide 1

Slide 1 text

Optimal Transport and Deep Generative Models Gabriel Peyré www.numerical-tours.com Joint works with: Aude Genevay and Marco Cuturi É C O L E N O R M A L E S U P É R I E U R E

Slide 2

Slide 2 text

Optimal Transport: Theory to Applications Monge Kantorovich Dantzig Brenier Otto McCann Villani ansport framework Sliced Wasserstein projection Applications lication to Color Transfer Source image (X) Style image (Y) Sliced Wasserstein projection of X to style image color statistics Y Source image after color transfer J. Rabin Wasserstein Regularization

Slide 3

Slide 3 text

Overview Density Fitting vs. Auto Encoders Discriminative vs. Generative Models g✓ Z X g✓ µ✓ ⇣ ⇡ d⇠ ⇡

Slide 4

Slide 4 text

Discriminative vs Generative Models Z x z X Z X Low dimension High dimension Generative Discriminative g✓ d⇠ g✓ d⇠

Slide 5

Slide 5 text

Discriminative vs Generative Models Z x z X Z X Low dimension High dimension Generative Discriminative g✓ d⇠ g✓ d⇠ classification, z =class probability Supervised: ! Learn d⇠ from labeled data (xi, zi)i.

Slide 6

Slide 6 text

Discriminative vs Generative Models Z x z X Z X Low dimension High dimension Generative Discriminative g✓ d⇠ g✓ d⇠ classification, z =class probability Supervised: ! Learn d⇠ from labeled data (xi, zi)i. Compression: z = d⇠(x) is a representation. Generation: x = g✓(z) is a synthesis. Un-supervised: ! Learn (g✓, d⇠) from data (xi)i.

Slide 7

Slide 7 text

Discriminative vs Generative Models Z x z X Z X Low dimension High dimension Generative Discriminative g✓ d⇠ g✓ d⇠ classification, z =class probability Supervised: ! Learn d⇠ from labeled data (xi, zi)i. Compression: z = d⇠(x) is a representation. Generation: x = g✓(z) is a synthesis. Un-supervised: ! Learn (g✓, d⇠) from data (xi)i. Density fitting g✓({ zi }i) ⇡ { xi }i Auto-encoders g✓( d⇠( xi)) ⇡ xi

Slide 8

Slide 8 text

Discriminative vs Generative Models Z x z X Z X Low dimension High dimension Generative Discriminative g✓ d⇠ g✓ d⇠ classification, z =class probability Supervised: ! Learn d⇠ from labeled data (xi, zi)i. Compression: z = d⇠(x) is a representation. Generation: x = g✓(z) is a synthesis. Un-supervised: ! Learn (g✓, d⇠) from data (xi)i. Density fitting g✓({ zi }i) ⇡ { xi }i Auto-encoders g✓( d⇠( xi)) ⇡ xi Optimal transport map d⇠

Slide 9

Slide 9 text

Deep Discriminative vs Generative Models ✓1 ✓2 ⇢ ⇢ z x g✓ ⇢ ⇢ z x Discriminative Generative d⇠ ⇠1 ⇠2 g✓(z) = ⇢(✓K(. . . ⇢(✓2(⇢(✓1(z) . . .) Deep networks: d⇠( x ) = ⇢ ( ⇠K( . . . ⇢ ( ⇠2( ⇢ ( ⇠1( x ) . . . )

Slide 10

Slide 10 text

Deep Discriminative vs Generative Models ✓1 ✓2 ⇢ ⇢ z x g✓ ⇢ ⇢ z x Discriminative Generative z1 z2 g✓ Z x z X d⇠ d⇠ ⇠1 ⇠2 g✓(z) = ⇢(✓K(. . . ⇢(✓2(⇢(✓1(z) . . .) Deep networks: d⇠( x ) = ⇢ ( ⇠K( . . . ⇢ ( ⇠2( ⇢ ( ⇠1( x ) . . . )

Slide 11

Slide 11 text

Examples of Image Generation [Credit ArXiv:1511.06434] g✓ Z X

Slide 12

Slide 12 text

Overview Density Fitting vs. Auto Encoders Discriminative vs. Generative Models g✓ Z X g✓ µ✓ ⇣ ⇡ d⇠ ⇡

Slide 13

Slide 13 text

Density Fitting vs. Generative Models Parametric model: ✓ 7! µ✓ µ✓ ✓ Observations: ⌫ = 1 n P n i =1 xi

Slide 14

Slide 14 text

Density Fitting vs. Generative Models Parametric model: ✓ 7! µ✓ µ✓ ✓ min ✓ c KL( µ✓ |⌫ ) def. = X j log( f✓( yj)) dµ✓(y) = f✓(y)dy Density fitting: Maximum likelihood (MLE) Observations: ⌫ = 1 n P n i =1 xi

Slide 15

Slide 15 text

Density Fitting vs. Generative Models Parametric model: ✓ 7! µ✓ µ✓ ✓ min ✓ c KL( µ✓ |⌫ ) def. = X j log( f✓( yj)) dµ✓(y) = f✓(y)dy Density fitting: Maximum likelihood (MLE) g✓ µ✓ X Z ⇣ Generative model fit: µ✓ = g✓,]⇣ c KL(µ✓ |⌫) = +1 ! MLE undefined. ! Need a weaker metric. Observations: ⌫ = 1 n P n i =1 xi

Slide 16

Slide 16 text

Comparing Measures and Spaces Source image (X) Style image (Y) Sliced Wasserstein projection of X to style image color statistics Y Source image after color transfer J. Rabin Wasserstein Regularization ! images, vision, graphics and machine learning, . . . • Probability distributions and histograms

Slide 17

Slide 17 text

Comparing Measures and Spaces Source image (X) Style image (Y) Sliced Wasserstein projection of X to style image color statistics Y Source image after color transfer J. Rabin Wasserstein Regularization ! images, vision, graphics and machine learning, . . . • Probability distributions and histograms Optimal transport mean L2 mean • Optimal transport

Slide 18

Slide 18 text

Comparing Measures and Spaces Source image (X) Style image (Y) Sliced Wasserstein projection of X to style image color statistics Y Source image after color transfer J. Rabin Wasserstein Regularization ! images, vision, graphics and machine learning, . . . • Probability distributions and histograms Optimal transport mean L2 mean • Optimal transport ! well defined for discrete or singular distributions (“weak” metric). µ

Slide 19

Slide 19 text

Probability Measures and Couplings Couplings: P1]⇡(S) def. = ⇡(S, X) P2]⇡(S) def. = ⇡(X, S) Marginals: ⇧(µ, ⌫) def. = {⇡ 2 M+(X ⇥ X) ; P1]⇡ = µ, P2]⇡ = ⌫} ⇡ Discrete

Slide 20

Slide 20 text

Probability Measures and Couplings Couplings: P1]⇡(S) def. = ⇡(S, X) P2]⇡(S) def. = ⇡(X, S) Marginals: ⇧(µ, ⌫) def. = {⇡ 2 M+(X ⇥ X) ; P1]⇡ = µ, P2]⇡ = ⌫} ⇡ Discrete ⇡ Semi-discrete

Slide 21

Slide 21 text

Probability Measures and Couplings Couplings: P1]⇡(S) def. = ⇡(S, X) P2]⇡(S) def. = ⇡(X, S) Marginals: ⇧(µ, ⌫) def. = {⇡ 2 M+(X ⇥ X) ; P1]⇡ = µ, P2]⇡ = ⌫} ⇡ Discrete ⇡ Continuous ⇡ Semi-discrete

Slide 22

Slide 22 text

Optimal Transport Optimal transport: [Kantorovitch 1942] W p p ( µ, ⌫ ) def. = min ⇡ ⇢ h d p , ⇡ i = Z X⇥X d ( x, y )pd ⇡ ( x, y ) ; ⇡ 2 ⇧( µ, ⌫ ) d ( x, y ) x y

Slide 23

Slide 23 text

Optimal Transport Optimal transport: [Kantorovitch 1942] W p p ( µ, ⌫ ) def. = min ⇡ ⇢ h d p , ⇡ i = Z X⇥X d ( x, y )pd ⇡ ( x, y ) ; ⇡ 2 ⇧( µ, ⌫ ) ! Wp is a distance on M+( X ). Wp ( x, y ) = d ( x, y ) x ! y ! 0 ! Wp works for singular distributions. d ( x, y ) x y

Slide 24

Slide 24 text

Optimal Transport Optimal transport: [Kantorovitch 1942] W p p ( µ, ⌫ ) def. = min ⇡ ⇢ h d p , ⇡ i = Z X⇥X d ( x, y )pd ⇡ ( x, y ) ; ⇡ 2 ⇧( µ, ⌫ ) ! Wp is a distance on M+( X ). Wp ( x, y ) = d ( x, y ) x ! y ! 0 ! Wp works for singular distributions. g✓ µ✓ X Z ⇣ Minimum Kantorovitch Estimator: min ✓ Wp(µ✓, ⌫) [Bassetti et al, 06] Algorithms: [Montavon et al 16], [Bernton et al 17], [Genevay et al 17] generative model min ✓ Wp(g✓,]⇣, ⌫) d ( x, y ) x y

Slide 25

Slide 25 text

From OT to VAE min ✓ Wp p (g✓,]⇣, ⌫) µ✓ ⇡ = min ✓,⇡ ⇢Z X⇥X d ( x, y )pd ⇡ ( x, y ) ; ⇡ 2 ⇧( µ✓, ⌫ )

Slide 26

Slide 26 text

From OT to VAE min ✓ Wp p (g✓,]⇣, ⌫) µ✓ g✓ ⇣ ⇡ = min ✓,⇡ ⇢Z X⇥X d ( x, y )pd ⇡ ( x, y ) ; ⇡ 2 ⇧( µ✓, ⌫ ) = min ✓, ⇢Z Z⇥X d(g✓(z), y)pd (z, y) ; 2 ⇧(⇣, ⌫)

Slide 27

Slide 27 text

From OT to VAE min ✓ Wp p (g✓,]⇣, ⌫) µ✓ g✓ ⇣ ⇡ d⇠ Approximation of : [Bousquet, et al, 17] Arxiv:1705.07642 ⇠ ⇡ P i f⇠( xi) ,xi = min ✓,⇡ ⇢Z X⇥X d ( x, y )pd ⇡ ( x, y ) ; ⇡ 2 ⇧( µ✓, ⌫ ) = min ✓, ⇢Z Z⇥X d(g✓(z), y)pd (z, y) ; 2 ⇧(⇣, ⌫) ⇡ min ✓,⇠ ( X i d ( g✓( d⇠( xi) , xi)p ; g⇠]⌫ ⇡ ⇣ )

Slide 28

Slide 28 text

From OT to VAE min ✓ Wp p (g✓,]⇣, ⌫) µ✓ g✓ ⇣ ⇡ d⇠ g✓ d⇠ ⇡ Id Variational Auto-Encoders: [Kingma, Welling, 13] Approximation of : [Bousquet, et al, 17] Arxiv:1705.07642 ⇠ ⇡ P i f⇠( xi) ,xi = min ✓,⇡ ⇢Z X⇥X d ( x, y )pd ⇡ ( x, y ) ; ⇡ 2 ⇧( µ✓, ⌫ ) = min ✓, ⇢Z Z⇥X d(g✓(z), y)pd (z, y) ; 2 ⇧(⇣, ⌫) ⇡ min ✓,⇠ ( X i d ( g✓( d⇠( xi) , xi)p ; g⇠]⌫ ⇡ ⇣ ) ⇣ d⇠

Slide 29

Slide 29 text

Conclusion Unsupervised learning: ! learning a (generator, discriminator) pair.

Slide 30

Slide 30 text

Conclusion Unsupervised learning: ! learning a (generator, discriminator) pair. Generative Adversarial Networks: [Goodfellow et al, 14] ! also an OT-like problem! [Arjovsky et al, 17], Arxiv:1701.07875

Slide 31

Slide 31 text

Conclusion Unsupervised learning: ! learning a (generator, discriminator) pair. Generative Adversarial Networks: [Goodfellow et al, 14] ! also an OT-like problem! [Arjovsky et al, 17], Arxiv:1701.07875 Open problems: ! how far are VAE/GAN from “true” OT? ! using OT to improve VAE/GAN training.