Optimal Transport and Deep Generative Models

Optimal Transport and Deep Generative Models

Talk at "Rencontre Maths-Industrie"

E34ded36efe4b7abb12510d4e525fee8?s=128

Gabriel Peyré

June 20, 2017
Tweet

Transcript

  1. Optimal Transport and Deep Generative Models Gabriel Peyré www.numerical-tours.com Joint

    works with: Aude Genevay and Marco Cuturi É C O L E N O R M A L E S U P É R I E U R E
  2. Optimal Transport: Theory to Applications Monge Kantorovich Dantzig Brenier Otto

    McCann Villani ansport framework Sliced Wasserstein projection Applications lication to Color Transfer Source image (X) Style image (Y) Sliced Wasserstein projection of X to style image color statistics Y Source image after color transfer J. Rabin Wasserstein Regularization
  3. Overview Density Fitting vs. Auto Encoders Discriminative vs. Generative Models

    g✓ Z X g✓ µ✓ ⇣ ⇡ d⇠ ⇡
  4. Discriminative vs Generative Models Z x z X Z X

    Low dimension High dimension Generative Discriminative g✓ d⇠ g✓ d⇠
  5. Discriminative vs Generative Models Z x z X Z X

    Low dimension High dimension Generative Discriminative g✓ d⇠ g✓ d⇠ classification, z =class probability Supervised: ! Learn d⇠ from labeled data (xi, zi)i.
  6. Discriminative vs Generative Models Z x z X Z X

    Low dimension High dimension Generative Discriminative g✓ d⇠ g✓ d⇠ classification, z =class probability Supervised: ! Learn d⇠ from labeled data (xi, zi)i. Compression: z = d⇠(x) is a representation. Generation: x = g✓(z) is a synthesis. Un-supervised: ! Learn (g✓, d⇠) from data (xi)i.
  7. Discriminative vs Generative Models Z x z X Z X

    Low dimension High dimension Generative Discriminative g✓ d⇠ g✓ d⇠ classification, z =class probability Supervised: ! Learn d⇠ from labeled data (xi, zi)i. Compression: z = d⇠(x) is a representation. Generation: x = g✓(z) is a synthesis. Un-supervised: ! Learn (g✓, d⇠) from data (xi)i. Density fitting g✓({ zi }i) ⇡ { xi }i Auto-encoders g✓( d⇠( xi)) ⇡ xi
  8. Discriminative vs Generative Models Z x z X Z X

    Low dimension High dimension Generative Discriminative g✓ d⇠ g✓ d⇠ classification, z =class probability Supervised: ! Learn d⇠ from labeled data (xi, zi)i. Compression: z = d⇠(x) is a representation. Generation: x = g✓(z) is a synthesis. Un-supervised: ! Learn (g✓, d⇠) from data (xi)i. Density fitting g✓({ zi }i) ⇡ { xi }i Auto-encoders g✓( d⇠( xi)) ⇡ xi Optimal transport map d⇠
  9. Deep Discriminative vs Generative Models ✓1 ✓2 ⇢ ⇢ z

    x g✓ ⇢ ⇢ z x Discriminative Generative d⇠ ⇠1 ⇠2 g✓(z) = ⇢(✓K(. . . ⇢(✓2(⇢(✓1(z) . . .) Deep networks: d⇠( x ) = ⇢ ( ⇠K( . . . ⇢ ( ⇠2( ⇢ ( ⇠1( x ) . . . )
  10. Deep Discriminative vs Generative Models ✓1 ✓2 ⇢ ⇢ z

    x g✓ ⇢ ⇢ z x Discriminative Generative z1 z2 g✓ Z x z X d⇠ d⇠ ⇠1 ⇠2 g✓(z) = ⇢(✓K(. . . ⇢(✓2(⇢(✓1(z) . . .) Deep networks: d⇠( x ) = ⇢ ( ⇠K( . . . ⇢ ( ⇠2( ⇢ ( ⇠1( x ) . . . )
  11. Examples of Image Generation [Credit ArXiv:1511.06434] g✓ Z X

  12. Overview Density Fitting vs. Auto Encoders Discriminative vs. Generative Models

    g✓ Z X g✓ µ✓ ⇣ ⇡ d⇠ ⇡
  13. Density Fitting vs. Generative Models Parametric model: ✓ 7! µ✓

    µ✓ ✓ Observations: ⌫ = 1 n P n i =1 xi
  14. Density Fitting vs. Generative Models Parametric model: ✓ 7! µ✓

    µ✓ ✓ min ✓ c KL( µ✓ |⌫ ) def. = X j log( f✓( yj)) dµ✓(y) = f✓(y)dy Density fitting: Maximum likelihood (MLE) Observations: ⌫ = 1 n P n i =1 xi
  15. Density Fitting vs. Generative Models Parametric model: ✓ 7! µ✓

    µ✓ ✓ min ✓ c KL( µ✓ |⌫ ) def. = X j log( f✓( yj)) dµ✓(y) = f✓(y)dy Density fitting: Maximum likelihood (MLE) g✓ µ✓ X Z ⇣ Generative model fit: µ✓ = g✓,]⇣ c KL(µ✓ |⌫) = +1 ! MLE undefined. ! Need a weaker metric. Observations: ⌫ = 1 n P n i =1 xi
  16. Comparing Measures and Spaces Source image (X) Style image (Y)

    Sliced Wasserstein projection of X to style image color statistics Y Source image after color transfer J. Rabin Wasserstein Regularization ! images, vision, graphics and machine learning, . . . • Probability distributions and histograms
  17. Comparing Measures and Spaces Source image (X) Style image (Y)

    Sliced Wasserstein projection of X to style image color statistics Y Source image after color transfer J. Rabin Wasserstein Regularization ! images, vision, graphics and machine learning, . . . • Probability distributions and histograms Optimal transport mean L2 mean • Optimal transport
  18. Comparing Measures and Spaces Source image (X) Style image (Y)

    Sliced Wasserstein projection of X to style image color statistics Y Source image after color transfer J. Rabin Wasserstein Regularization ! images, vision, graphics and machine learning, . . . • Probability distributions and histograms Optimal transport mean L2 mean • Optimal transport ! well defined for discrete or singular distributions (“weak” metric). µ
  19. Probability Measures and Couplings Couplings: P1]⇡(S) def. = ⇡(S, X)

    P2]⇡(S) def. = ⇡(X, S) Marginals: ⇧(µ, ⌫) def. = {⇡ 2 M+(X ⇥ X) ; P1]⇡ = µ, P2]⇡ = ⌫} ⇡ Discrete
  20. Probability Measures and Couplings Couplings: P1]⇡(S) def. = ⇡(S, X)

    P2]⇡(S) def. = ⇡(X, S) Marginals: ⇧(µ, ⌫) def. = {⇡ 2 M+(X ⇥ X) ; P1]⇡ = µ, P2]⇡ = ⌫} ⇡ Discrete ⇡ Semi-discrete
  21. Probability Measures and Couplings Couplings: P1]⇡(S) def. = ⇡(S, X)

    P2]⇡(S) def. = ⇡(X, S) Marginals: ⇧(µ, ⌫) def. = {⇡ 2 M+(X ⇥ X) ; P1]⇡ = µ, P2]⇡ = ⌫} ⇡ Discrete ⇡ Continuous ⇡ Semi-discrete
  22. Optimal Transport Optimal transport: [Kantorovitch 1942] W p p (

    µ, ⌫ ) def. = min ⇡ ⇢ h d p , ⇡ i = Z X⇥X d ( x, y )pd ⇡ ( x, y ) ; ⇡ 2 ⇧( µ, ⌫ ) d ( x, y ) x y
  23. Optimal Transport Optimal transport: [Kantorovitch 1942] W p p (

    µ, ⌫ ) def. = min ⇡ ⇢ h d p , ⇡ i = Z X⇥X d ( x, y )pd ⇡ ( x, y ) ; ⇡ 2 ⇧( µ, ⌫ ) ! Wp is a distance on M+( X ). Wp ( x, y ) = d ( x, y ) x ! y ! 0 ! Wp works for singular distributions. d ( x, y ) x y
  24. Optimal Transport Optimal transport: [Kantorovitch 1942] W p p (

    µ, ⌫ ) def. = min ⇡ ⇢ h d p , ⇡ i = Z X⇥X d ( x, y )pd ⇡ ( x, y ) ; ⇡ 2 ⇧( µ, ⌫ ) ! Wp is a distance on M+( X ). Wp ( x, y ) = d ( x, y ) x ! y ! 0 ! Wp works for singular distributions. g✓ µ✓ X Z ⇣ Minimum Kantorovitch Estimator: min ✓ Wp(µ✓, ⌫) [Bassetti et al, 06] Algorithms: [Montavon et al 16], [Bernton et al 17], [Genevay et al 17] generative model min ✓ Wp(g✓,]⇣, ⌫) d ( x, y ) x y
  25. From OT to VAE min ✓ Wp p (g✓,]⇣, ⌫)

    µ✓ ⇡ = min ✓,⇡ ⇢Z X⇥X d ( x, y )pd ⇡ ( x, y ) ; ⇡ 2 ⇧( µ✓, ⌫ )
  26. From OT to VAE min ✓ Wp p (g✓,]⇣, ⌫)

    µ✓ g✓ ⇣ ⇡ = min ✓,⇡ ⇢Z X⇥X d ( x, y )pd ⇡ ( x, y ) ; ⇡ 2 ⇧( µ✓, ⌫ ) = min ✓, ⇢Z Z⇥X d(g✓(z), y)pd (z, y) ; 2 ⇧(⇣, ⌫)
  27. From OT to VAE min ✓ Wp p (g✓,]⇣, ⌫)

    µ✓ g✓ ⇣ ⇡ d⇠ Approximation of : [Bousquet, et al, 17] Arxiv:1705.07642 ⇠ ⇡ P i f⇠( xi) ,xi = min ✓,⇡ ⇢Z X⇥X d ( x, y )pd ⇡ ( x, y ) ; ⇡ 2 ⇧( µ✓, ⌫ ) = min ✓, ⇢Z Z⇥X d(g✓(z), y)pd (z, y) ; 2 ⇧(⇣, ⌫) ⇡ min ✓,⇠ ( X i d ( g✓( d⇠( xi) , xi)p ; g⇠]⌫ ⇡ ⇣ )
  28. From OT to VAE min ✓ Wp p (g✓,]⇣, ⌫)

    µ✓ g✓ ⇣ ⇡ d⇠ g✓ d⇠ ⇡ Id Variational Auto-Encoders: [Kingma, Welling, 13] Approximation of : [Bousquet, et al, 17] Arxiv:1705.07642 ⇠ ⇡ P i f⇠( xi) ,xi = min ✓,⇡ ⇢Z X⇥X d ( x, y )pd ⇡ ( x, y ) ; ⇡ 2 ⇧( µ✓, ⌫ ) = min ✓, ⇢Z Z⇥X d(g✓(z), y)pd (z, y) ; 2 ⇧(⇣, ⌫) ⇡ min ✓,⇠ ( X i d ( g✓( d⇠( xi) , xi)p ; g⇠]⌫ ⇡ ⇣ ) ⇣ d⇠
  29. Conclusion Unsupervised learning: ! learning a (generator, discriminator) pair.

  30. Conclusion Unsupervised learning: ! learning a (generator, discriminator) pair. Generative

    Adversarial Networks: [Goodfellow et al, 14] ! also an OT-like problem! [Arjovsky et al, 17], Arxiv:1701.07875
  31. Conclusion Unsupervised learning: ! learning a (generator, discriminator) pair. Generative

    Adversarial Networks: [Goodfellow et al, 14] ! also an OT-like problem! [Arjovsky et al, 17], Arxiv:1701.07875 Open problems: ! how far are VAE/GAN from “true” OT? ! using OT to improve VAE/GAN training.