Optimal Transport and Deep Generative Models

Optimal Transport and Deep Generative Models Gabriel Peyré www.numerical-tours.com Joint
works with: Aude Genevay and Marco Cuturi É C O L E N O R M A L E S U P É R I E U R E

Optimal Transport: Theory to Applications Monge Kantorovich Dantzig Brenier Otto
McCann Villani ansport framework Sliced Wasserstein projection Applications lication to Color Transfer Source image (X) Style image (Y) Sliced Wasserstein projection of X to style image color statistics Y Source image after color transfer J. Rabin Wasserstein Regularization

Overview Density Fitting vs. Auto Encoders Discriminative vs. Generative Models
g✓ Z X g✓ µ✓ ⇣ ⇡ d⇠ ⇡

Discriminative vs Generative Models Z x z X Z X
Low dimension High dimension Generative Discriminative g✓ d⇠ g✓ d⇠

Low dimension High dimension Generative Discriminative g✓ d⇠ g✓ d⇠ classiﬁcation, z =class probability Supervised: ! Learn d⇠ from labeled data (xi, zi)i.

Low dimension High dimension Generative Discriminative g✓ d⇠ g✓ d⇠ classiﬁcation, z =class probability Supervised: ! Learn d⇠ from labeled data (xi, zi)i. Compression: z = d⇠(x) is a representation. Generation: x = g✓(z) is a synthesis. Un-supervised: ! Learn (g✓, d⇠) from data (xi)i.

Low dimension High dimension Generative Discriminative g✓ d⇠ g✓ d⇠ classiﬁcation, z =class probability Supervised: ! Learn d⇠ from labeled data (xi, zi)i. Compression: z = d⇠(x) is a representation. Generation: x = g✓(z) is a synthesis. Un-supervised: ! Learn (g✓, d⇠) from data (xi)i. Density ﬁtting g✓({ zi }i) ⇡ { xi }i Auto-encoders g✓( d⇠( xi)) ⇡ xi

Low dimension High dimension Generative Discriminative g✓ d⇠ g✓ d⇠ classiﬁcation, z =class probability Supervised: ! Learn d⇠ from labeled data (xi, zi)i. Compression: z = d⇠(x) is a representation. Generation: x = g✓(z) is a synthesis. Un-supervised: ! Learn (g✓, d⇠) from data (xi)i. Density ﬁtting g✓({ zi }i) ⇡ { xi }i Auto-encoders g✓( d⇠( xi)) ⇡ xi Optimal transport map d⇠

Deep Discriminative vs Generative Models ✓1 ✓2 ⇢ ⇢ z
x g✓ ⇢ ⇢ z x Discriminative Generative d⇠ ⇠1 ⇠2 g✓(z) = ⇢(✓K(. . . ⇢(✓2(⇢(✓1(z) . . .) Deep networks: d⇠( x ) = ⇢ ( ⇠K( . . . ⇢ ( ⇠2( ⇢ ( ⇠1( x ) . . . )

Deep Discriminative vs Generative Models ✓1 ✓2 ⇢ ⇢ z
x g✓ ⇢ ⇢ z x Discriminative Generative z1 z2 g✓ Z x z X d⇠ d⇠ ⇠1 ⇠2 g✓(z) = ⇢(✓K(. . . ⇢(✓2(⇢(✓1(z) . . .) Deep networks: d⇠( x ) = ⇢ ( ⇠K( . . . ⇢ ( ⇠2( ⇢ ( ⇠1( x ) . . . )

Examples of Image Generation [Credit ArXiv:1511.06434] g✓ Z X

Overview Density Fitting vs. Auto Encoders Discriminative vs. Generative Models
g✓ Z X g✓ µ✓ ⇣ ⇡ d⇠ ⇡

Density Fitting vs. Generative Models Parametric model: ✓ 7! µ✓
µ✓ ✓ Observations: ⌫ = 1 n P n i =1 xi

µ✓ ✓ min ✓ c KL( µ✓ |⌫ ) def. = X j log( f✓( yj)) dµ✓(y) = f✓(y)dy Density ﬁtting: Maximum likelihood (MLE) Observations: ⌫ = 1 n P n i =1 xi

µ✓ ✓ min ✓ c KL( µ✓ |⌫ ) def. = X j log( f✓( yj)) dµ✓(y) = f✓(y)dy Density fitting: Maximum likelihood (MLE) g✓ µ✓ X Z ⇣ Generative model fit: µ✓ = g✓,]⇣ c KL(µ✓ |⌫) = +1 ! MLE undefined. ! Need a weaker metric. Observations: ⌫ = 1 n P n i =1 xi

Comparing Measures and Spaces Source image (X) Style image (Y)
Sliced Wasserstein projection of X to style image color statistics Y Source image after color transfer J. Rabin Wasserstein Regularization ! images, vision, graphics and machine learning, . . . • Probability distributions and histograms

Sliced Wasserstein projection of X to style image color statistics Y Source image after color transfer J. Rabin Wasserstein Regularization ! images, vision, graphics and machine learning, . . . • Probability distributions and histograms Optimal transport mean L2 mean • Optimal transport

Sliced Wasserstein projection of X to style image color statistics Y Source image after color transfer J. Rabin Wasserstein Regularization ! images, vision, graphics and machine learning, . . . • Probability distributions and histograms Optimal transport mean L2 mean • Optimal transport ! well deﬁned for discrete or singular distributions (“weak” metric). µ

Probability Measures and Couplings Couplings: P1]⇡(S) def. = ⇡(S, X)
P2]⇡(S) def. = ⇡(X, S) Marginals: ⇧(µ, ⌫) def. = {⇡ 2 M+(X ⇥ X) ; P1]⇡ = µ, P2]⇡ = ⌫} ⇡ Discrete

P2]⇡(S) def. = ⇡(X, S) Marginals: ⇧(µ, ⌫) def. = {⇡ 2 M+(X ⇥ X) ; P1]⇡ = µ, P2]⇡ = ⌫} ⇡ Discrete ⇡ Semi-discrete

P2]⇡(S) def. = ⇡(X, S) Marginals: ⇧(µ, ⌫) def. = {⇡ 2 M+(X ⇥ X) ; P1]⇡ = µ, P2]⇡ = ⌫} ⇡ Discrete ⇡ Continuous ⇡ Semi-discrete

Optimal Transport Optimal transport: [Kantorovitch 1942] W p p (
µ, ⌫ ) def. = min ⇡ ⇢ h d p , ⇡ i = Z X⇥X d ( x, y )pd ⇡ ( x, y ) ; ⇡ 2 ⇧( µ, ⌫ ) d ( x, y ) x y

µ, ⌫ ) def. = min ⇡ ⇢ h d p , ⇡ i = Z X⇥X d ( x, y )pd ⇡ ( x, y ) ; ⇡ 2 ⇧( µ, ⌫ ) ! Wp is a distance on M+( X ). Wp ( x, y ) = d ( x, y ) x ! y ! 0 ! Wp works for singular distributions. d ( x, y ) x y

µ, ⌫ ) def. = min ⇡ ⇢ h d p , ⇡ i = Z X⇥X d ( x, y )pd ⇡ ( x, y ) ; ⇡ 2 ⇧( µ, ⌫ ) ! Wp is a distance on M+( X ). Wp ( x, y ) = d ( x, y ) x ! y ! 0 ! Wp works for singular distributions. g✓ µ✓ X Z ⇣ Minimum Kantorovitch Estimator: min ✓ Wp(µ✓, ⌫) [Bassetti et al, 06] Algorithms: [Montavon et al 16], [Bernton et al 17], [Genevay et al 17] generative model min ✓ Wp(g✓,]⇣, ⌫) d ( x, y ) x y

From OT to VAE min ✓ Wp p (g✓,]⇣, ⌫)
µ✓ ⇡ = min ✓,⇡ ⇢Z X⇥X d ( x, y )pd ⇡ ( x, y ) ; ⇡ 2 ⇧( µ✓, ⌫ )

µ✓ g✓ ⇣ ⇡ = min ✓,⇡ ⇢Z X⇥X d ( x, y )pd ⇡ ( x, y ) ; ⇡ 2 ⇧( µ✓, ⌫ ) = min ✓, ⇢Z Z⇥X d(g✓(z), y)pd (z, y) ; 2 ⇧(⇣, ⌫)

µ✓ g✓ ⇣ ⇡ d⇠ Approximation of : [Bousquet, et al, 17] Arxiv:1705.07642 ⇠ ⇡ P i f⇠( xi) ,xi = min ✓,⇡ ⇢Z X⇥X d ( x, y )pd ⇡ ( x, y ) ; ⇡ 2 ⇧( µ✓, ⌫ ) = min ✓, ⇢Z Z⇥X d(g✓(z), y)pd (z, y) ; 2 ⇧(⇣, ⌫) ⇡ min ✓,⇠ ( X i d ( g✓( d⇠( xi) , xi)p ; g⇠]⌫ ⇡ ⇣ )

µ✓ g✓ ⇣ ⇡ d⇠ g✓ d⇠ ⇡ Id Variational Auto-Encoders: [Kingma, Welling, 13] Approximation of : [Bousquet, et al, 17] Arxiv:1705.07642 ⇠ ⇡ P i f⇠( xi) ,xi = min ✓,⇡ ⇢Z X⇥X d ( x, y )pd ⇡ ( x, y ) ; ⇡ 2 ⇧( µ✓, ⌫ ) = min ✓, ⇢Z Z⇥X d(g✓(z), y)pd (z, y) ; 2 ⇧(⇣, ⌫) ⇡ min ✓,⇠ ( X i d ( g✓( d⇠( xi) , xi)p ; g⇠]⌫ ⇡ ⇣ ) ⇣ d⇠

Conclusion Unsupervised learning: ! learning a (generator, discriminator) pair.

Conclusion Unsupervised learning: ! learning a (generator, discriminator) pair. Generative
Adversarial Networks: [Goodfellow et al, 14] ! also an OT-like problem! [Arjovsky et al, 17], Arxiv:1701.07875

Conclusion Unsupervised learning: ! learning a (generator, discriminator) pair. Generative
Adversarial Networks: [Goodfellow et al, 14] ! also an OT-like problem! [Arjovsky et al, 17], Arxiv:1701.07875 Open problems: ! how far are VAE/GAN from “true” OT? ! using OT to improve VAE/GAN training.

Optimal Transport and Deep Generative Models

Optimal Transport and Deep Generative Models

Gabriel Peyré

More Decks by Gabriel Peyré

Other Decks in Research

Featured

Transcript

Optimal Transport and Deep Generative Models Gabriel Peyré www.numerical-tours.com Joint

Optimal Transport: Theory to Applications Monge Kantorovich Dantzig Brenier Otto

Overview Density Fitting vs. Auto Encoders Discriminative vs. Generative Models

Discriminative vs Generative Models Z x z X Z X

Discriminative vs Generative Models Z x z X Z X

Discriminative vs Generative Models Z x z X Z X

Discriminative vs Generative Models Z x z X Z X

Discriminative vs Generative Models Z x z X Z X

Deep Discriminative vs Generative Models ✓1 ✓2 ⇢ ⇢ z

Deep Discriminative vs Generative Models ✓1 ✓2 ⇢ ⇢ z

Examples of Image Generation [Credit ArXiv:1511.06434] g✓ Z X

Overview Density Fitting vs. Auto Encoders Discriminative vs. Generative Models

Density Fitting vs. Generative Models Parametric model: ✓ 7! µ✓

Density Fitting vs. Generative Models Parametric model: ✓ 7! µ✓

Density Fitting vs. Generative Models Parametric model: ✓ 7! µ✓

Comparing Measures and Spaces Source image (X) Style image (Y)

Comparing Measures and Spaces Source image (X) Style image (Y)

Comparing Measures and Spaces Source image (X) Style image (Y)

Probability Measures and Couplings Couplings: P1]⇡(S) def. = ⇡(S, X)

Probability Measures and Couplings Couplings: P1]⇡(S) def. = ⇡(S, X)

Probability Measures and Couplings Couplings: P1]⇡(S) def. = ⇡(S, X)

Optimal Transport Optimal transport: [Kantorovitch 1942] W p p (

Optimal Transport Optimal transport: [Kantorovitch 1942] W p p (

Optimal Transport Optimal transport: [Kantorovitch 1942] W p p (

From OT to VAE min ✓ Wp p (g✓,]⇣, ⌫)

From OT to VAE min ✓ Wp p (g✓,]⇣, ⌫)

From OT to VAE min ✓ Wp p (g✓,]⇣, ⌫)

From OT to VAE min ✓ Wp p (g✓,]⇣, ⌫)

Conclusion Unsupervised learning: ! learning a (generator, discriminator) pair.

Conclusion Unsupervised learning: ! learning a (generator, discriminator) pair. Generative

Conclusion Unsupervised learning: ! learning a (generator, discriminator) pair. Generative