Slide 1

Slide 1 text

Numerical Optimal Transport Gabriel Peyré É C O L E N O R M A L E S U P É R I E U R E Outline • What is the Lasso • Lasso with an orthogonal design • From projected gradient to proximal gradien • Optimality conditions and subgradients (LAR • Coordinate descent algorithm … with some demos www.numerical-tours.com http://optimaltransport.github.io

Slide 2

Slide 2 text

Overview • Measures and Histograms • From Monge to Kantorovitch Formulations • Entropic Regularization and Sinkhorn • Barycenters • Unbalanced OT and Gradient Flows • Minimum Kantorovitch Estimators • Gromov-Wasserstein

Slide 3

Slide 3 text

Comparing Measures and Spaces Source image (X) Style image (Y) Sliced Wasserstein projection of X to style image color statistics Y Source image after color transfer J. Rabin Wasserstein Regularization ! images, vision, graphics and machine learning, . . . • Probability distributions and histograms

Slide 4

Slide 4 text

Comparing Measures and Spaces Source image (X) Style image (Y) Sliced Wasserstein projection of X to style image color statistics Y Source image after color transfer J. Rabin Wasserstein Regularization ! images, vision, graphics and machine learning, . . . • Probability distributions and histograms • Optimal transport Optimal transport mean L2 mean

Slide 5

Slide 5 text

Probability Measures Positive Radon measure µ on a set X. dµ(x) = m(x)dx Measure of sets A ⇢ X: µ(A) = R A dµ(x) > 0 X X X µ = P i µi xi 2.2. Assignment and Monge Problem 9 Discrete d = 1 Discrete d = 2 Density d = 1 Density d = 2 Figure 2.1: Schematic display of discrete distributions – = q n i =1 ai”xi (red cor- responds to empirical uniform distribution ai = 1/n, and blue to arbitrary distri- butions) and densities d–(x) = fl– (x)dx (in violet), in both 1-D and 2-D. Discrete distributions in 1-D are displayed using vertical segments (with length equal to ai ) and in 2-D using point clouds (radius equal to ai ). is the dimension), can have a density d–(x) = fl – (x)dx w.r.t. the Lebesgue measure, often denoted fl – = d – d x , which means that d ⁄ ⁄

Slide 6

Slide 6 text

Probability Measures Positive Radon measure µ on a set X. dµ(x) = m(x)dx Integration against continuous functions: R X g(x)dµ(x) > 0 Measure of sets A ⇢ X: µ(A) = R A dµ(x) > 0 X X X µ = P i µi xi R X gdµ = R X m(x)dx R X gdµ = P i µig(xi) dµ(x) = m(x)dx µ = P i µi xi 2.2. Assignment and Monge Problem 9 Discrete d = 1 Discrete d = 2 Density d = 1 Density d = 2 Figure 2.1: Schematic display of discrete distributions – = q n i =1 ai”xi (red cor- responds to empirical uniform distribution ai = 1/n, and blue to arbitrary distri- butions) and densities d–(x) = fl– (x)dx (in violet), in both 1-D and 2-D. Discrete distributions in 1-D are displayed using vertical segments (with length equal to ai ) and in 2-D using point clouds (radius equal to ai ). is the dimension), can have a density d–(x) = fl – (x)dx w.r.t. the Lebesgue measure, often denoted fl – = d – d x , which means that d ⁄ ⁄

Slide 7

Slide 7 text

Probability Measures Positive Radon measure µ on a set X. dµ(x) = m(x)dx Integration against continuous functions: R X g(x)dµ(x) > 0 Measure of sets A ⇢ X: µ(A) = R A dµ(x) > 0 X X X Probability (normalized) measure: µ(X) = R X dµ(x) = 1 µ = P i µi xi R X gdµ = R X m(x)dx R X gdµ = P i µig(xi) dµ(x) = m(x)dx µ = P i µi xi 2.2. Assignment and Monge Problem 9 Discrete d = 1 Discrete d = 2 Density d = 1 Density d = 2 Figure 2.1: Schematic display of discrete distributions – = q n i =1 ai”xi (red cor- responds to empirical uniform distribution ai = 1/n, and blue to arbitrary distri- butions) and densities d–(x) = fl– (x)dx (in violet), in both 1-D and 2-D. Discrete distributions in 1-D are displayed using vertical segments (with length equal to ai ) and in 2-D using point clouds (radius equal to ai ). is the dimension), can have a density d–(x) = fl – (x)dx w.r.t. the Lebesgue measure, often denoted fl – = d – d x , which means that d ⁄ ⁄

Slide 8

Slide 8 text

Random vectors Radon measures P(X 2 A) R A dµ(x) = P(Xn 2 A) n!+1 ! P(X 2 A) 8 set A 8 continuous function f R fdµn n!+1 ! R fdµ Weak⇤ convergence: Convergence in law: Measures and Random Variables

Slide 9

Slide 9 text

Random vectors Radon measures P(X 2 A) R A dµ(x) = P(Xn 2 A) n!+1 ! P(X 2 A) 8 set A 8 continuous function f R fdµn n!+1 ! R fdµ Weak⇤ convergence: Convergence in law: Weak convergence: . . . Measures and Random Variables

Slide 10

Slide 10 text

Quotient space: Discrete measure: Discretization: Histogram vs. Empirical Lagrangian (point clouds) Eulerian (histograms) µ = N X i=1 µi xi xi 2 X, X i µi = 1 Constant weights µi = 1 N XN /⌃N {(µi)i > 0 ; P i µi = 1} Convex polytope (simplex): xi X X Fixed positions xi (e.g. grid)

Slide 11

Slide 11 text

- f . s - t e e e l . x Radon measures (µ, ⌫) on (X, Y ). Transfer of measure by f : X ! Y : push forward. ⌫ = f]µ defined by: ⌫(A) def. = µ(f 1(A)) () R Y g(y)d⌫(y) def. = R X g(f(x))dµ(x) y = f(x) f X µ ⌫ Y A f 1(A) Push Forward

Slide 12

Slide 12 text

- f . s - t e e e l . x Radon measures (µ, ⌫) on (X, Y ). Transfer of measure by f : X ! Y : push forward. ⌫ = f]µ defined by: ⌫(A) def. = µ(f 1(A)) () R Y g(y)d⌫(y) def. = R X g(f(x))dµ(x) y = f(x) f X µ ⌫ Y A f 1(A) Smooth densities: dµ = ⇢(x)dx, d⌫ = ⇠(x)dx Push Forward f]µ = ⌫ () ⇢(f(x))| det(@f(x))| = ⇠(x)

Slide 13

Slide 13 text

µ f]µ f f ' Measures: push-forward Functions: pull-back f]' Remark: f] and f] are adjoints Z Y 'd(f]µ) = Z X (f]')dµ = P i xi def. = P i f(xi) def. = ' f f : X ! Y f] : M(X) ! M(Y) f] : C(Y) ! C(X) Push-forward vs. Pull-back

Slide 14

Slide 14 text

P( lim n!+1 Xn = X) = 1 =) lim n!+1 E(|Xn X|p) = 0 P(Xn 2 A) n!+1 ! P(X 2 A) In law 8 " > 0, P(|Xn X| > ") n!+1 ! 0 In probability Almost sure In mean =) =) (the Xn can be defined on di↵erent spaces) Convergence of Random Variables

Slide 15

Slide 15 text

Overview • Measures and Histograms • From Monge to Kantorovitch Formulations • Entropic Regularization and Sinkhorn • Barycenters • Unbalanced OT and Gradient Flows • Minimum Kantorovitch Estimators • Gromov-Wasserstein

Slide 16

Slide 16 text

Gaspard Monge (1746-1818) (1784)

Slide 17

Slide 17 text

Monge Transport min ⌫=f]µ Z X c(x, f(x))dµ(x) f X µ Y ⌫

Slide 18

Slide 18 text

Monge Transport min ⌫=f]µ Z X c(x, f(x))dµ(x) f X µ Y ⌫ Theorem: [Brenier] for c(x, y) = ||x y||2, (µ, ⌫) with density, there exists a unique optimal f. One has f = r where is the unique convex function such that (r )]µ = ⌫

Slide 19

Slide 19 text

Monge Transport min ⌫=f]µ Z X c(x, f(x))dµ(x) f X µ Y ⌫ Monge-Amp` ere equation: ⇢(r ) det(@2 ) = ⇠ Theorem: [Brenier] for c(x, y) = ||x y||2, (µ, ⌫) with density, there exists a unique optimal f. One has f = r where is the unique convex function such that (r )]µ = ⌫

Slide 20

Slide 20 text

Monge Transport min ⌫=f]µ Z X c(x, f(x))dµ(x) f X µ Y ⌫ Monge-Amp` ere equation: ⇢(r ) det(@2 ) = ⇠ ⌫ µ f f0 ? µ ⌫ µ ⌫ f µ ⌫ f Non-uniqueness / non-existence: Theorem: [Brenier] for c(x, y) = ||x y||2, (µ, ⌫) with density, there exists a unique optimal f. One has f = r where is the unique convex function such that (r )]µ = ⌫

Slide 21

Slide 21 text

Leonid Kantorovich (1912-1986) Леонид Витальевич Канторович

Slide 22

Slide 22 text

Optimal Transport was formulated in 1930 by A.N. Tolstoi, 12 years before Kantorovich. He even solved a "large scale" 10×68 instance! Before Kantorovitch

Slide 23

Slide 23 text

xi yj Kantorovitch’s Formulation Points (xi)i, (yj)j T Input distributions Couplings Def.

Slide 24

Slide 24 text

xi yj Kantorovitch’s Formulation Points (xi)i, (yj)j T Input distributions Couplings Def. [Kantorovich 1942] Wasserstein Distance / EMD Def.

Slide 25

Slide 25 text

OT Between General Measures Couplings: P1]⇡(S) def. = ⇡(S, X) P2]⇡(S) def. = ⇡(X, S) Marginals: ⇡ ⇧(µ, ⌫) def. = {⇡ 2 M+(X ⇥ X) ; P1]⇡ = µ, P2]⇡ = ⌫} ⇡ Optimal transport: [Kantorovitch 1942] Wp p (µ, ⌫) def. = min ⇡ ⇢ hdp, ⇡i = Z X⇥X d(x, y)pd⇡(x, y) ; ⇡ 2 ⇧(µ, ⌫)

Slide 26

Slide 26 text

Couplings: the 3 Settings ⇡ Discrete ⇡ Continuous ⇡ Semi-discrete ↵ ↵ ↵ ↵ ↵ ↵

Slide 27

Slide 27 text

Couplings ↵ ↵ ⇡ ↵ ↵ ⇡ ↵ ↵ ⇡ ↵ ↵ ⇡

Slide 28

Slide 28 text

1-D Optimal Transport Remark. If ⌦ = R, c(x, y) = c(|x y|), c convex, F 1 µ , F 1 ⌫ quantile functions, W(µ, ⌫)= Z 1 0 c(|F 1 µ (x) F 1 ⌫ (x)|)dx

Slide 29

Slide 29 text

Cases 37 ρβ ρα Put Remark. If ⌦ = Rd, c(x, y) = kx yk2, and µ = N(mµ, ⌃µ), ⌫ = N(m⌫, ⌃⌫) then W2 2 (µ, ⌫) = kmµ m⌫ k2 + B(⌃µ, ⌃⌫)2 where B is the Bures metric B(⌃µ, ⌃⌫)2 = trace(⌃µ+⌃⌫ 2(⌃1/2 µ ⌃⌫⌃1/2 µ )1/2). The map T : x 7! m⌫ + A(x mµ) is optimal, where A = ⌃ 1 2 µ ✓ ⌃ 1 2 µ ⌃⌫⌃ 1 2 µ ◆1 2 ⌃ 1 2 µ . m 1 p 2⇡ e (x m)2 2 2 OT Between Gaussians T : x 7! m⌫ + A(x mµ)

Slide 30

Slide 30 text

OT on Gaussians and Bures' Distance

Slide 31

Slide 31 text

W1 OT and Min-cost Flows

Slide 32

Slide 32 text

Bins-to-bins metrics: dµ(x) = ⇢(x)dx d˜ µ(x) = ˜ ⇢(x)dx µ ˜ µ Kullback-Leibler divergence: DKL(µ, ˜ µ) = Z ⇢(x) log ⇢(x) ˜ ⇢(x) dx Hellinger distance: D H (µ, ˜ µ)2 = Z ⇣p ⇢(x) p ˜ ⇢(x) ⌘ 2 dx Metrics on the Space of Measures

Slide 33

Slide 33 text

Bins-to-bins metrics: dµ (x) = ⇢(x )dx dµ(x) = ⇢(x)dx d˜ µ(x) = ˜ ⇢(x)dx µ ˜ µ µ Kullback-Leibler divergence: DKL(µ, ˜ µ) = Z ⇢(x) log ⇢(x) ˜ ⇢(x) dx Hellinger distance: E↵ect of translation: D(µ, µ ) ⇡ cst W2(µ, µ ) = D H (µ, ˜ µ)2 = Z ⇣p ⇢(x) p ˜ ⇢(x) ⌘ 2 dx Metrics on the Space of Measures

Slide 34

Slide 34 text

D'(↵| ) def. = Z X ' ✓ d↵ d ◆ d ||↵ ||B def. = max f2B Z X f(x)(d↵(x) d (x)) Csisz´ ar divergences: Dual norms: X X ↵ ↵ D' || · ||B Weak topology Strong topology ! KL, TV, 2 , Hellinger . . . ! W1, flat, RKHS ⇤ , energy dist, . . . Csiszar Divergence vs Dual Norms

Slide 35

Slide 35 text

Csiszár divergences, a unifying way to define losses between arbitrary positive measures (discrete & densities). https://en.wikipedia.org/wiki/F- divergence … Csiszar Divergence

Slide 36

Slide 36 text

-3 -2 -1 0 1 2 3 0 1 2 3 Energy Gauss W 1 Flat -3 -2 -1 0 1 2 3 0 1 2 3 Energy Gauss W 1 Flat Dual norms: (aka Integral Probability Metrics) ||↵ ||B def. = max ⇢Z X f(x)(d↵(x) d (x)) ; f 2 B t t/2 Wasserstein 1: B = {f ; ||rf||1 6 1}. Flat norm: B = {f ; ||f||1 6 1, ||rf||1 6 1}. RKHS: B = f ; ||f||2 k 6 1 . ||↵ ||2 B = Z k(x, x0)d↵(x)d↵(x0) + Z k(x, x0)d (y)d (y0) 2 Z k(x, y)d↵(x)d (y) Energy distance: k(x, y) = ||x y|| Gaussian: k(x, y) = e | |x y| |2 2 2 ↵ ↵ Dual Norms

Slide 37

Slide 37 text

RKHS Norms aka Maximum Mean Discrepency

Slide 38

Slide 38 text

The Earth Mover’s Distance µ ⌫ dist(I1, I2) = W1(µ, ⌫) [Rubner’98]

Slide 39

Slide 39 text

29 The Word Mover’s Distance [Kusner’15] dist(D1, D2) = W2(µ, ⌫) µ ⌫

Slide 40

Slide 40 text

Overview • Measures and Histograms • From Monge to Kantorovitch Formulations • Linear Programming and Semi-discrete • Entropic Regularization and Sinkhorn • Barycenters • Unbalanced OT and Gradient Flows • Minimum Kantorovitch Estimators • Gromov-Wasserstein

Slide 41

Slide 41 text

Linear programming: µ = PN1 i=1 pi xi , ⌫ = PN2 j=1 pj yi Algorithms

Slide 42

Slide 42 text

Linear programming: µ = PN1 i=1 pi xi , ⌫ = PN2 j=1 pj yi ⇠ O(N3) Hungarian/Auction: µ = 1 N PN i=1 xi , ⌫ = 1 N PN j=1 yj Algorithms

Slide 43

Slide 43 text

Linear programming: µ = PN1 i=1 pi xi , ⌫ = PN2 j=1 pj yi ⇠ O(N3) Hungarian/Auction: µ = 1 N PN i=1 xi , ⌫ = 1 N PN j=1 yj Algorithms µ ⌫

Slide 44

Slide 44 text

Linear programming: µ = PN1 i=1 pi xi , ⌫ = PN2 j=1 pj yi ⇠ O(N3) Hungarian/Auction: µ = 1 N PN i=1 xi , ⌫ = 1 N PN j=1 yj Monge-Amp` ere/Benamou-Brenier, d = || · ||2 2 . also reveal that the network simplex behaves in O(n 2) in our con- text, which is a major gain at the scale at which we typical work, i.e. thousands of particles. This finding is also useful for applica- tions that use EMD, where using the network simplex instead of the transport simplex can bring a significant performance increase. Our experiments also show that fixed-point precision further speeds up the computation. We observed that the value of the final trans- port cost is less accurate because of the limited precision, but that the particle pairing that produces the actual interpolation scheme remains unchanged. We used the fixed point method to generate the results presented in this paper. The results of the performance study are also of broader interest, as current EMD image retrieval or color transfer techniques rely on slower solvers [Rubner et al. 2000; Kanters et al. 2003; Morovic and Sun 2003]. 101 102 103 104 10−4 10−2 100 102 Problem size : number of bins per histogram Time in seconds Network Simplex (fixed point) Network Simplex (double prec.) Transport Simplex y = α x2 y = β x3 Figure 6: Log-log plot of the running times of different solvers. Figure 7: Synthetic 2D examples on a Euclidean domain. Th left and right columns show the input distributions, while the cent columns show interpolations for ↵ = 1/4, ↵ = 1/2, and ↵ = 3/ positive. A negative side effect of this choice is that interpolatin between BRDFs of equal energy conserves their log energy (§ 3. instead of their energy. Because we apply a concave remappin the interpolated value is guaranteed to be always lower, which e sures that our result does not break the energy preservation rul Algorithms µ ⌫

Slide 45

Slide 45 text

Linear programming: µ = PN1 i=1 pi xi , ⌫ = PN2 j=1 pj yi ⇠ O(N3) Hungarian/Auction: µ = 1 N PN i=1 xi , ⌫ = 1 N PN j=1 yj Monge-Amp` ere/Benamou-Brenier, d = || · ||2 2 . also reveal that the network simplex behaves in O(n 2) in our con- text, which is a major gain at the scale at which we typical work, i.e. thousands of particles. This finding is also useful for applica- tions that use EMD, where using the network simplex instead of the transport simplex can bring a significant performance increase. Our experiments also show that fixed-point precision further speeds up the computation. We observed that the value of the final trans- port cost is less accurate because of the limited precision, but that the particle pairing that produces the actual interpolation scheme remains unchanged. We used the fixed point method to generate the results presented in this paper. The results of the performance study are also of broader interest, as current EMD image retrieval or color transfer techniques rely on slower solvers [Rubner et al. 2000; Kanters et al. 2003; Morovic and Sun 2003]. 101 102 103 104 10−4 10−2 100 102 Problem size : number of bins per histogram Time in seconds Network Simplex (fixed point) Network Simplex (double prec.) Transport Simplex y = α x2 y = β x3 Figure 6: Log-log plot of the running times of different solvers. Figure 7: Synthetic 2D examples on a Euclidean domain. Th left and right columns show the input distributions, while the cent columns show interpolations for ↵ = 1/4, ↵ = 1/2, and ↵ = 3/ positive. A negative side effect of this choice is that interpolatin between BRDFs of equal energy conserves their log energy (§ 3. instead of their energy. Because we apply a concave remappin the interpolated value is guaranteed to be always lower, which e sures that our result does not break the energy preservation rul Algorithms µ ⌫ Semi-discrete: Laguerre cells, d = || · ||2 2 . [Levy,’15]

Slide 46

Slide 46 text

Linear programming: µ = PN1 i=1 pi xi , ⌫ = PN2 j=1 pj yi ⇠ O(N3) Hungarian/Auction: µ = 1 N PN i=1 xi , ⌫ = 1 N PN j=1 yj Monge-Amp` ere/Benamou-Brenier, d = || · ||2 2 . also reveal that the network simplex behaves in O(n 2) in our con- text, which is a major gain at the scale at which we typical work, i.e. thousands of particles. This finding is also useful for applica- tions that use EMD, where using the network simplex instead of the transport simplex can bring a significant performance increase. Our experiments also show that fixed-point precision further speeds up the computation. We observed that the value of the final trans- port cost is less accurate because of the limited precision, but that the particle pairing that produces the actual interpolation scheme remains unchanged. We used the fixed point method to generate the results presented in this paper. The results of the performance study are also of broader interest, as current EMD image retrieval or color transfer techniques rely on slower solvers [Rubner et al. 2000; Kanters et al. 2003; Morovic and Sun 2003]. 101 102 103 104 10−4 10−2 100 102 Problem size : number of bins per histogram Time in seconds Network Simplex (fixed point) Network Simplex (double prec.) Transport Simplex y = α x2 y = β x3 Figure 6: Log-log plot of the running times of different solvers. Figure 7: Synthetic 2D examples on a Euclidean domain. Th left and right columns show the input distributions, while the cent columns show interpolations for ↵ = 1/4, ↵ = 1/2, and ↵ = 3/ positive. A negative side effect of this choice is that interpolatin between BRDFs of equal energy conserves their log energy (§ 3. instead of their energy. Because we apply a concave remappin the interpolated value is guaranteed to be always lower, which e sures that our result does not break the energy preservation rul Algorithms µ ⌫ Semi-discrete: Laguerre cells, d = || · ||2 2 . [Levy,’15]

Slide 47

Slide 47 text

Need for fast approximate algorithms for generic c. Linear programming: µ = PN1 i=1 pi xi , ⌫ = PN2 j=1 pj yi ⇠ O(N3) Hungarian/Auction: µ = 1 N PN i=1 xi , ⌫ = 1 N PN j=1 yj Monge-Amp` ere/Benamou-Brenier, d = || · ||2 2 . also reveal that the network simplex behaves in O(n 2) in our con- text, which is a major gain at the scale at which we typical work, i.e. thousands of particles. This finding is also useful for applica- tions that use EMD, where using the network simplex instead of the transport simplex can bring a significant performance increase. Our experiments also show that fixed-point precision further speeds up the computation. We observed that the value of the final trans- port cost is less accurate because of the limited precision, but that the particle pairing that produces the actual interpolation scheme remains unchanged. We used the fixed point method to generate the results presented in this paper. The results of the performance study are also of broader interest, as current EMD image retrieval or color transfer techniques rely on slower solvers [Rubner et al. 2000; Kanters et al. 2003; Morovic and Sun 2003]. 101 102 103 104 10−4 10−2 100 102 Problem size : number of bins per histogram Time in seconds Network Simplex (fixed point) Network Simplex (double prec.) Transport Simplex y = α x2 y = β x3 Figure 6: Log-log plot of the running times of different solvers. Figure 7: Synthetic 2D examples on a Euclidean domain. Th left and right columns show the input distributions, while the cent columns show interpolations for ↵ = 1/4, ↵ = 1/2, and ↵ = 3/ positive. A negative side effect of this choice is that interpolatin between BRDFs of equal energy conserves their log energy (§ 3. instead of their energy. Because we apply a concave remappin the interpolated value is guaranteed to be always lower, which e sures that our result does not break the energy preservation rul Algorithms µ ⌫ Semi-discrete: Laguerre cells, d = || · ||2 2 . [Levy,’15]

Slide 48

Slide 48 text

C-Transform 92 Semi-discrete Optimal Transport 0 0.5 1 -0.2 0 0.2 0.4 0.6 0.5 1 1.5 2 p = 1/2 p = 1 p = 3/2 p = 2 Figure 5.1: Top: examples of semi-discrete ¯ c-transforms g ¯ c in 1-D, for ground d the useful indicator function notation (4.42). alternate minimization on either f or g leads to the im- n of c-transform: ’ y œ Y, fc(y) def. = inf xœX c(x, y) ≠ f(x), (5.1) ’ x œ X, g¯ c(x) def. = inf yœY c(x, y) ≠ g(y), (5.2) oted ¯ c(y, x) def. = c(x, y). Indeed, one can check that œ argmax g E(f, g) and g¯ c œ argmax f E(f, g). (5.3) ese partial minimizations define maximizers on the sup- tively – and —, while the definitions (5.1) actually define he whole spaces X and Y. This is thus a way to extend in ay solutions of (2.22) on the whole spaces. When X = Rd Îx ≠ yÎp, then the c-transform (5.1) fc is the so-called n between ≠f and ηÎp. The definition of fc is also often a “Hopf-Lax formula”. (f, g) œ C(X) ◊ C(Y) ‘æ (g¯ c, fc) œ C(X) ◊ C(Y) replaces s by “better” ones (improving the dual objective E). Func- be written in the form fc and g¯ c are called c-concave and ctions. In the special case c(x, y) = Èx, yÍ in X = Y = Rd, n coincides with the usual notion of concave functions. turally Proposition 3.1 to a continuous case, one has the operations are replaced by a “soft-min”. Using (5.3), one can reformulate (2.22) as an unconstrained program over a single potential Lc (–, —) = max fœC ( X ) ⁄ X f(x)d–(x) + ⁄ Y fc(y)d—(y), = max gœC ( Y ) ⁄ X g¯ c(x)d–(x) + ⁄ Y g(y)d—(y). Since one can iterate the map (f, g) ‘æ (g¯ c, fc), it is possible to these optimization problems the constraint that f is ¯ c-concave is c-concave, which is important to ensure enough regularity on potentials and show for instance existence of solutions to (2.22) 5.2 Semi-discrete Formulation A case of particular interest is when — = q j bj ” yj is discrete (of the same construction applies if – is discrete by exchanging the (–, —)). One can adapt the definition of the ¯ c transform (5.1) setting by restricting the minimization to the support (y j ) j of — ’ g œ Rm, ’ x œ X, g ¯ c(x) def. = min jœJmK c(x, y j ) ≠ gj . This transform maps a vector g to a continuous function g ¯ c œ Note that this definition coincides with (5.1) when imposing th space X is equal to the support of —. Figure 5.1 shows some ex of such discrete ¯ c-transforms in 1-D and 2-D. Using this discrete ¯ c-transform, in this semi-discrete case, ( equivalent to the following finite dimensional optimization

Slide 49

Slide 49 text

Semi-discrete Descent Algorithm 94 Semi-discrete Optimal Transport ¸ = 1 ¸ = 3 ¸ = 50 ¸ = 100 Matching Figure 5.2: Iterations of the semi-discrete OT algorithm minimizing (5.8) (here a simple gradient descent is used). The support (yj ) j of the discrete measure — is indicated by the red points, while the continuous measure – is the uniform measure on a square. The blue cells display the Laguerre partition (Lg(¸) (yj )) j where g ( ¸ ) is the discrete dual potential computed at iteration ¸.

Slide 50

Slide 50 text

2000 4000 6000 8000 10000 -8 -6 -4 -2 0 log E(g(?)) E(g(`)) ` Stochastic gradient descent for the semi-discrete Optimal Transport, illustration of convergence and corresponding Laguerre cells. https://arxiv.org/abs/1605.08527 Semi-discrete Stochastic Descent

Slide 51

Slide 51 text

The simplex of histograms with N bins is ⌃ N def. = p 2 R+ N ; P i p i = 1 . The entropy of T 2 RN⇥N + is defined as H(T) def. = P N i,j=1 T i,j (log(T i,j ) 1). The set of couplings between histograms p 2 ⌃ N1 and q 2 ⌃ N2 is C p,q def. = T 2 (R +)N1 ⇥N2 ; T1 N2 = p, T>1 N1 = q . Here, 1 N def. = (1, . . . , 1)> 2 RN . For any tensor L = (L i,j,k,` ) i,j,k,` and matrix (T i,j ) i,j , we define the tensor- matrix multiplication as L ⌦ T def. = ⇣ X k,` L i,j,k,` T k,` ⌘ i,j . (1) 2. Gromov-Wasserstein Discrepancy 2.1. Entropic Optimal Transport Optimal transport distances are useful to compare two his- tograms (p, q) 2 ⌃ N1 ⇥ ⌃ N2 defined on the same metric 1 In our setting, sinc learning problems, distance matrices, i does not necessaril We define the Gro two measured simil and ( ¯ C, q) 2 RN2 ⇥ GW(C where E C,C The matrix T is a which the similarit L is some loss fun the similarity matr the quadratic loss the Kullback-Leibl a log(a/b) a + inition (4) of GW Entropy: Entropic Regularization Def. Regularized OT: [Cuturi NIPS’13]

Slide 52

Slide 52 text

The simplex of histograms with N bins is ⌃ N def. = p 2 R+ N ; P i p i = 1 . The entropy of T 2 RN⇥N + is defined as H(T) def. = P N i,j=1 T i,j (log(T i,j ) 1). The set of couplings between histograms p 2 ⌃ N1 and q 2 ⌃ N2 is C p,q def. = T 2 (R +)N1 ⇥N2 ; T1 N2 = p, T>1 N1 = q . Here, 1 N def. = (1, . . . , 1)> 2 RN . For any tensor L = (L i,j,k,` ) i,j,k,` and matrix (T i,j ) i,j , we define the tensor- matrix multiplication as L ⌦ T def. = ⇣ X k,` L i,j,k,` T k,` ⌘ i,j . (1) 2. Gromov-Wasserstein Discrepancy 2.1. Entropic Optimal Transport Optimal transport distances are useful to compare two his- tograms (p, q) 2 ⌃ N1 ⇥ ⌃ N2 defined on the same metric 1 In our setting, sinc learning problems, distance matrices, i does not necessaril We define the Gro two measured simil and ( ¯ C, q) 2 RN2 ⇥ GW(C where E C,C The matrix T is a which the similarit L is some loss fun the similarity matr the quadratic loss the Kullback-Leibl a log(a/b) a + inition (4) of GW Entropy: Entropic Regularization Regularization impact on solution: Def. Regularized OT: [Cuturi NIPS’13] c " T" " "

Slide 53

Slide 53 text

Overview • Measures and Histograms • From Monge to Kantorovitch Formulations • Entropic Regularization and Sinkhorn • Barycenters • Unbalanced OT and Gradient Flows • Minimum Kantorovitch Estimators • Gromov-Wasserstein

Slide 54

Slide 54 text

Sinkhorn’s Algorithm

Slide 55

Slide 55 text

Sinkhorn’s Algorithm

Slide 56

Slide 56 text

Sinkhorn’s Algorithm Only matrix/vector multiplications. ! Parallelizable. ! Streams well on GPU.

Slide 57

Slide 57 text

Sinkhorn’s Algorithm Only matrix/vector multiplications. ! Parallelizable. ! Streams well on GPU.

Slide 58

Slide 58 text

Optimal Transport on Surfaces Triangulated mesh M. Geodesic distance dM .

Slide 59

Slide 59 text

Optimal Transport on Surfaces Level sets xi d(xi, ·) Ground cost: c(x, y) = dM (x, y)↵. Triangulated mesh M. Geodesic distance dM .

Slide 60

Slide 60 text

Optimal Transport on Surfaces Level sets xi d(xi, ·) Computing c (Fast-Marching): N2 log(N) ! too costly. Ground cost: c(x, y) = dM (x, y)↵. Triangulated mesh M. Geodesic distance dM .

Slide 61

Slide 61 text

Entropic Transport on Surfaces @tut(x, ·) = M ut(x, ·), ut=0(x, ·) = x Heat equation on M: " t

Slide 62

Slide 62 text

Entropic Transport on Surfaces @tut(x, ·) = M ut(x, ·), ut=0(x, ·) = x Heat equation on M: Theorem: [Varadhan] " log(u") "!0 ! d2 M " t

Slide 63

Slide 63 text

Entropic Transport on Surfaces @tut(x, ·) = M ut(x, ·), ut=0(x, ·) = x Heat equation on M: Theorem: [Varadhan] " log(u") "!0 ! d2 M " Sinkhorn kernel: K def. = e d2 M " ⇡ u" ⇡ Id " ` M ` t

Slide 64

Slide 64 text

MRI Data Procesing [with A. Gramfort] L2 barycenter W2 2 barycenter Ground cost c = dM : geodesic on cortical surface M.

Slide 65

Slide 65 text

Regularization for General Measures Schr¨ odinger’s problem: Landmark computational paper: [Cuturi 2013]. ⇡" def. = argmin ⇡ {hdp, ⇡i + "KL(⇡|⇡0) ; ⇡ 2 ⇧(µ, ⌫)} ⇡" = argmin ⇡2⇧(µ,⌫) KL(⇡|K) K(x, y) def. = e dp(x,y) " ⇡0(x, y)

Slide 66

Slide 66 text

Regularization for General Measures Schr¨ odinger’s problem: Landmark computational paper: [Cuturi 2013]. ⇡" def. = argmin ⇡ {hdp, ⇡i + "KL(⇡|⇡0) ; ⇡ 2 ⇧(µ, ⌫)} ⇡" = argmin ⇡2⇧(µ,⌫) KL(⇡|K) K(x, y) def. = e dp(x,y) " ⇡0(x, y) Proposition: [Carlier, Duval, Peyr´ e, Schmitzer 2015] ⇡" "!0 ! argmin ⇡2⇧(µ,⌫) hdp, ⇡i ⇡" "!+1 ! µ(x)⌫(y) " ⇡" "

Slide 67

Slide 67 text

Back to Sinkhorn’s Algorithm ProxKL f1/" (˜ µ) = µ ProxKL f2/" (˜ ⌫) = ⌫ f1 = ◆µ f2 = ◆⌫ Optimal transport problem:

Slide 68

Slide 68 text

Back to Sinkhorn’s Algorithm ProxKL f1/" (˜ µ) = µ ProxKL f2/" (˜ ⌫) = ⌫ [Sinkhorn 1967][Deming,Stephan 1940] Sinkhorn/IPFP algorithm: a(`+1) def. = µ Kb(`) and b(`+1) def. = ⌫ K⇤a(`+1) f1 = ◆µ f2 = ◆⌫ Optimal transport problem: ` ⇡(`) "

Slide 69

Slide 69 text

Back to Sinkhorn’s Algorithm ProxKL f1/" (˜ µ) = µ ProxKL f2/" (˜ ⌫) = ⌫ ⇡" " [Sinkhorn 1967][Deming,Stephan 1940] Sinkhorn/IPFP algorithm: a(`+1) def. = µ Kb(`) and b(`+1) def. = ⌫ K⇤a(`+1) f1 = ◆µ f2 = ◆⌫ Optimal transport problem: Proposition: [Franklin,Lorenz 1989] ⇡(`) def. = diag(a(`))K diag(b(`)) || log(⇡ (`)) log(⇡ ?)||1 = O(1 )` , ⇠  1/" c Local rate: [Knight 2008] 1000 2000 3000 4000 5000 -2 -1.5 -1 -0.5 0 ` log(|| log(⇡(`)) log(⇡?)||1) ` ⇡(`) "

Slide 70

Slide 70 text

0 0.5 1 -0.2 0 0.2 0 0.1 0.2 0.3 Á = 0 Á = 0.01 Á = 0.1 Á = 0.3 Figure 5.3: Top: examples of entropic semi-discrete ¯ c-transforms g ¯ c,Á in 1-D, for 0 0.5 1 -0.2 0 0.2 0 0.1 0.2 0.3 Á = 0 Á = 0.01 Á = 0.1 Á = 0.3 Figure 5.3: Top: examples of entropic semi-discrete ¯ c-transforms g ¯ c,Á in 1-D, for Laguerre cells “Sinkhorn” Laguerre cells Semi-discrete Optimal Transport is a multiclass SVM. Sinkhorn version is logistic regression. Decomposes space in entropic Laguerre diagrams 0 0.5 1 -0.2 0 0.2 0 0.1 0.2 0.3 Á = 0 Á = 0.01 Á = 0.1 Á = 0.3 Figure 5.3: Top: examples of entropic semi-discrete ¯ c-transforms g ¯ c,Á in 1-D, for ground cost c(x, y) = |x ≠ y| for varying Á (see colorbar). The red points are at locations (yj, ≠gj ) j . Bottom: examples of entropic semi-discrete ¯ c-transforms g ¯ c,Á in 2-D, for ground cost c(x, y) = Îx ≠ yÎ for varying Á. The black curves are the level sets of the function g ¯ c,Á, while the colors indicate the smoothed indicator function of the Laguerre cells ‰Á j . The red points are at locations yj œ R2 , and their size is proportional to gj . Semi-discrete OT and Entropy

Slide 71

Slide 71 text

Overview • Measures and Histograms • From Monge to Kantorovitch Formulations • Entropic Regularization and Sinkhorn • Barycenters • Unbalanced OT and Gradient Flows • Minimum Kantorovitch Estimators • Gromov-Wasserstein

Slide 72

Slide 72 text

Wasserstein Barycenters Barycenters of measures (µk)k: P k k = 1 µ? 2 argmin µ P k kW2 2 (µk, µ)

Slide 73

Slide 73 text

Wasserstein Barycenters µ µ1 µ3 W2 (µ1 , µ ) W 2 (µ 2 ,µ ) µ2 W2 (µ3 ,µ ) Barycenters of measures (µk)k: P k k = 1 If µk = xk then µ? = P k kxk Generalizes Euclidean barycenter: µ? 2 argmin µ P k kW2 2 (µk, µ)

Slide 74

Slide 74 text

Wasserstein Barycenters µ µ1 µ3 W2 (µ1 , µ ) W 2 (µ 2 ,µ ) µ2 W2 (µ3 ,µ ) Barycenters of measures (µk)k: P k k = 1 If µk = xk then µ? = P k kxk Generalizes Euclidean barycenter: µ? 2 argmin µ P k kW2 2 (µk, µ)

Slide 75

Slide 75 text

Wasserstein Barycenters µ µ1 µ3 W2 (µ1 , µ ) W 2 (µ 2 ,µ ) µ2 W2 (µ3 ,µ ) Barycenters of measures (µk)k: P k k = 1 If µk = xk then µ? = P k kxk Generalizes Euclidean barycenter: µ2 µ1 Mc Cann’s displacement interpolation. µ? 2 argmin µ P k kW2 2 (µk, µ)

Slide 76

Slide 76 text

Wasserstein Barycenters µ µ1 µ3 W2 (µ1 , µ ) W 2 (µ 2 ,µ ) µ2 W2 (µ3 ,µ ) Barycenters of measures (µk)k: P k k = 1 If µk = xk then µ? = P k kxk Generalizes Euclidean barycenter: µ2 µ1 µ3 µ2 µ1 Mc Cann’s displacement interpolation. µ? 2 argmin µ P k kW2 2 (µk, µ)

Slide 77

Slide 77 text

Wasserstein Barycenters µ µ1 µ3 W2 (µ1 , µ ) W 2 (µ 2 ,µ ) µ2 W2 (µ3 ,µ ) Barycenters of measures (µk)k: P k k = 1 If µk = xk then µ? = P k kxk Generalizes Euclidean barycenter: µ2 µ1 µ3 µ exists and is unique. Theorem: if µ1 does not vanish on small sets, [Agueh, Carlier, 2010] (for c(x, y) = ||x y||2) µ2 µ1 Mc Cann’s displacement interpolation. µ? 2 argmin µ P k kW2 2 (µk, µ)

Slide 78

Slide 78 text

Displacement Interpolation

Slide 79

Slide 79 text

110 Dynamic Formulations –0 –1 / 5 –2 / 5 –3 / 5 –4 / 5 –1 Figure 7.2: Comparison of displacement interpolation (7.8) of discrete measures. Top: point clouds (empirical measures (–0, –1) with the same number of points). Bottom: same but with weight. For 0 < t < 1, the top example corresponds to an Displacement Interpolation

Slide 80

Slide 80 text

2 ⌃3 Wasserstein mean L2 mean Wasserstein Barycenters

Slide 81

Slide 81 text

2 ⌃3 Wasserstein mean L2 mean Wasserstein Barycenters

Slide 82

Slide 82 text

Regularized Barycenters min (⇡k)k,µ ( X k k (hc, ⇡k i + "KL(⇡k |⇡0,k)) ; 8k, ⇡k 2 ⇧(µk, µ) )

Slide 83

Slide 83 text

Regularized Barycenters ! Need to fix a discretization grid for µ, i.e. choose (⇡0,k)k min (⇡k)k,µ ( X k k (hc, ⇡k i + "KL(⇡k |⇡0,k)) ; 8k, ⇡k 2 ⇧(µk, µ) )

Slide 84

Slide 84 text

Regularized Barycenters ! Need to fix a discretization grid for µ, i.e. choose (⇡0,k)k ! Sinkhorn-like algorithm [Benamou, Carlier, Cuturi, Nenna, Peyr´ e, 2015]. min (⇡k)k,µ ( X k k (hc, ⇡k i + "KL(⇡k |⇡0,k)) ; 8k, ⇡k 2 ⇧(µk, µ) )

Slide 85

Slide 85 text

Regularized Barycenters ! Need to fix a discretization grid for µ, i.e. choose (⇡0,k)k ! Sinkhorn-like algorithm [Benamou, Carlier, Cuturi, Nenna, Peyr´ e, 2015]. [Solomon et al, SIGGRAPH 2015] min (⇡k)k,µ ( X k k (hc, ⇡k i + "KL(⇡k |⇡0,k)) ; 8k, ⇡k 2 ⇧(µk, µ) )

Slide 86

Slide 86 text

Barycenters of 2D Shapes

Slide 87

Slide 87 text

Barycenters of 3D Shapes

Slide 88

Slide 88 text

Barycenter on a Surface 1 µ1 µ2 0 1

Slide 89

Slide 89 text

Barycenter on a Surface 1 µ1 µ2 µ3 µ4 µ5 µ6 µ1 µ2 0 1

Slide 90

Slide 90 text

Barycenter on a Surface 1 = (1, . . . , 1)/6 µ1 µ2 µ3 µ4 µ5 µ6 µ1 µ2 0 1

Slide 91

Slide 91 text

Color Transfer µ ⌫ Input images: (f, g) (chrominance components) Input measures: f g µ(A) = U(f 1(A)), ⌫(A) = U(g 1(A))

Slide 92

Slide 92 text

Color Transfer µ ⌫ Input images: (f, g) (chrominance components) Input measures: f g µ(A) = U(f 1(A)), ⌫(A) = U(g 1(A))

Slide 93

Slide 93 text

Color Transfer µ ⌫ Input images: (f, g) (chrominance components) Input measures: f T T f g ˜ T g µ(A) = U(f 1(A)), ⌫(A) = U(g 1(A))

Slide 94

Slide 94 text

Topic Models [Rolet’16]

Slide 95

Slide 95 text

Overview • Measures and Histograms • From Monge to Kantorovitch Formulations • Entropic Regularization and Sinkhorn • Barycenters • Unbalanced OT and Gradient Flows • Minimum Kantorovitch Estimators • Gromov-Wasserstein

Slide 96

Slide 96 text

Unbalanced Transport (⇠, µ) 2 M+(X)2, KL(⇠|µ) def. = R X log ⇣ d⇠ dµ ⌘ dµ + R X (dµ d⇠) WFc(µ, ⌫) def. = min ⇡ hc, ⇡i + KL(P1]⇡|µ) + KL(P2]⇡|⌫) [Liereo, Mielke, Savar´ e 2015]

Slide 97

Slide 97 text

Unbalanced Transport [Liereo, Mielke, Savar´ e 2015] [Chizat, Schmitzer, Peyr´ e, Vialard 2015] Proposition: then WF1/2 c is a distance on M+(X). (⇠, µ) 2 M+(X)2, KL(⇠|µ) def. = R X log ⇣ d⇠ dµ ⌘ dµ + R X (dµ d⇠) If c(x, y) = log(cos(min(d(x, y), ⇡ 2 )) WFc(µ, ⌫) def. = min ⇡ hc, ⇡i + KL(P1]⇡|µ) + KL(P2]⇡|⌫) [Liereo, Mielke, Savar´ e 2015]

Slide 98

Slide 98 text

Unbalanced Transport ! “Dynamic” Benamou-Brenier formulation. [Liereo, Mielke, Savar´ e 2015][Kondratyev, Monsaingeon, Vorotnikov, 2015] [Liereo, Mielke, Savar´ e 2015] [Chizat, Schmitzer, Peyr´ e, Vialard 2015] Proposition: then WF1/2 c is a distance on M+(X). [Chizat, Schmitzer, P, Vialard 2015] (⇠, µ) 2 M+(X)2, KL(⇠|µ) def. = R X log ⇣ d⇠ dµ ⌘ dµ + R X (dµ d⇠) If c(x, y) = log(cos(min(d(x, y), ⇡ 2 )) WFc(µ, ⌫) def. = min ⇡ hc, ⇡i + KL(P1]⇡|µ) + KL(P2]⇡|⌫) [Liereo, Mielke, Savar´ e 2015] Balanced OT Unbalanced OT

Slide 99

Slide 99 text

p = + 1 p = 2 p = 1 Metric space (X, d), minimize F(x) on X. F(x) = ||x||2 on (X = R2, || · ||p) Implicit Euler step: xk+1 def. = argmin x2X d(xk, x)2 + ⌧F(x) xk+1 xk {x ; d(xk, x) ⇠ ⌧} xk+2 xk+3 Implicit Euler Stepping

Slide 100

Slide 100 text

xk+1 = argmin x2X d(xk, x)2 + ⌧hrF(xk), xi p = + 1 p = 1 p = 2 p = + 1 p = 2 p = 1 xk+1 = argmin x2X d(xk, x)2 + ⌧F(x) Metric space (X, d), minimize F(x) on X. F(x) = ||x||2 on (X = R2, || · ||p) Implicit Explicit Implicit vs. Explicit Stepping

Slide 101

Slide 101 text

Wasserstein Gradient Flows Implicit Euler step: [Jordan, Kinderlehrer, Otto 1998] µt+1 = ProxW ⌧f (µt) def. = argmin µ2M+(X) W2 2 (µt, µ) + ⌧f(µ)

Slide 102

Slide 102 text

Wasserstein Gradient Flows Implicit Euler step: [Jordan, Kinderlehrer, Otto 1998] Formal limit ⌧ ! 0: @tµ = div (µr(f0(µ))) µt+1 = ProxW ⌧f (µt) def. = argmin µ2M+(X) W2 2 (µt, µ) + ⌧f(µ)

Slide 103

Slide 103 text

Wasserstein Gradient Flows Implicit Euler step: [Jordan, Kinderlehrer, Otto 1998] Formal limit ⌧ ! 0: @tµ = div (µr(f0(µ))) (heat di↵usion) f(µ) = R log(dµ dx )dµ @tµ = µ µt+1 = ProxW ⌧f (µt) def. = argmin µ2M+(X) W2 2 (µt, µ) + ⌧f(µ)

Slide 104

Slide 104 text

Wasserstein Gradient Flows Implicit Euler step: [Jordan, Kinderlehrer, Otto 1998] Formal limit ⌧ ! 0: @tµ = div (µr(f0(µ))) (advection) f(µ) = R wdµ @tµ = div(µrw) (heat di↵usion) f(µ) = R log(dµ dx )dµ @tµ = µ Evolution µt Evolution µt rw rw µt+1 = ProxW ⌧f (µt) def. = argmin µ2M+(X) W2 2 (µt, µ) + ⌧f(µ)

Slide 105

Slide 105 text

Wasserstein Gradient Flows Implicit Euler step: [Jordan, Kinderlehrer, Otto 1998] Formal limit ⌧ ! 0: @tµ = div (µr(f0(µ))) (advection) f(µ) = R wdµ @tµ = div(µrw) (heat di↵usion) f(µ) = R log(dµ dx )dµ @tµ = µ (non-linear di↵usion) f(µ) = 1 m 1 R (dµ dx )m 1dµ @tµ = µm Evolution µt Evolution µt rw rw µt+1 = ProxW ⌧f (µt) def. = argmin µ2M+(X) W2 2 (µt, µ) + ⌧f(µ)

Slide 106

Slide 106 text

Eulerian vs. Lagrangian Discretization

Slide 107

Slide 107 text

ˆ µn µ n ! +1 ˆ µn = 1 n Pn i=1 xi ˆ H(ˆ µn) def. = P i log(minj6=i ||xi xj ||) R log(dµ dx (x))dµ(x) H(µ) def. = n ! +1 Lagrangian Discretization of Entropy

Slide 108

Slide 108 text

time d⇢t dt = ⇢t + r(V ⇢t) min ⇢ E(⇢) def. = Z V (x)⇢(x)dx + Z ⇢(x) log(⇢(x))dx V (x) = ||x||2 Wasserstein flow of E: Lagrangian Discretization of Gradient Flows

Slide 109

Slide 109 text

Generalized Entropic Regularization Primal: min ⇡ hdp, ⇡i + f1(P1]⇡) + f2(P2]⇡) + "KL(⇡|⇡0)

Slide 110

Slide 110 text

Generalized Entropic Regularization Primal: min ⇡ hdp, ⇡i + f1(P1]⇡) + f2(P2]⇡) + "KL(⇡|⇡0) Dual: (a, b) def. = (e u " , e v " ) max u,v f⇤ 1 (u) f⇤ 2 (v) "he u " , Ke v " i ⇡(x, y) = a(x)K(x, y)b(y)

Slide 111

Slide 111 text

Generalized Entropic Regularization Primal: min ⇡ hdp, ⇡i + f1(P1]⇡) + f2(P2]⇡) + "KL(⇡|⇡0) Block coordinates relaxation: max u f⇤ 1 (u) "he u " , Ke v " i max v f⇤ 2 (v) "he v " , K⇤e u " i (Iv) (Iu) Dual: (a, b) def. = (e u " , e v " ) max u,v f⇤ 1 (u) f⇤ 2 (v) "he u " , Ke v " i ⇡(x, y) = a(x)K(x, y)b(y)

Slide 112

Slide 112 text

Generalized Entropic Regularization Primal: min ⇡ hdp, ⇡i + f1(P1]⇡) + f2(P2]⇡) + "KL(⇡|⇡0) Block coordinates relaxation: max u f⇤ 1 (u) "he u " , Ke v " i max v f⇤ 2 (v) "he v " , K⇤e u " i (Iv) (Iu) Proposition: ProxKL f1/" (µ) def. = argmin ⌫f1(⌫) + "KL(⌫|µ) the solutions of (Iu) and (Iv) read: a = ProxKL f1/" (Kb) Kb b = ProxKL f2/" (K⇤a) K⇤a Dual: (a, b) def. = (e u " , e v " ) max u,v f⇤ 1 (u) f⇤ 2 (v) "he u " , Ke v " i ⇡(x, y) = a(x)K(x, y)b(y)

Slide 113

Slide 113 text

Generalized Entropic Regularization Primal: ! Only matrix-vector multiplications. ! Highly parallelizable. ! On regular grids: only convolutions! Linear time iterations. min ⇡ hdp, ⇡i + f1(P1]⇡) + f2(P2]⇡) + "KL(⇡|⇡0) Block coordinates relaxation: max u f⇤ 1 (u) "he u " , Ke v " i max v f⇤ 2 (v) "he v " , K⇤e u " i (Iv) (Iu) Proposition: ProxKL f1/" (µ) def. = argmin ⌫f1(⌫) + "KL(⌫|µ) the solutions of (Iu) and (Iv) read: a = ProxKL f1/" (Kb) Kb b = ProxKL f2/" (K⇤a) K⇤a Dual: (a, b) def. = (e u " , e v " ) max u,v f⇤ 1 (u) f⇤ 2 (v) "he u " , Ke v " i ⇡(x, y) = a(x)K(x, y)b(y)

Slide 114

Slide 114 text

Gradient Flows: Crowd Motion µt+1 def. = argminµ W↵ ↵ (µt, µ) + ⌧f(µ) Congestion-inducing function: f(µ) = ◆[0,] (µ) + hw, µi [Maury, Roudne↵-Chupin, Santambrogio 2010]

Slide 115

Slide 115 text

Gradient Flows: Crowd Motion µt+1 def. = argminµ W↵ ↵ (µt, µ) + ⌧f(µ) Proposition: Prox1 " f (µ) = min(e "wµ, ) Congestion-inducing function: f(µ) = ◆[0,] (µ) + hw, µi [Maury, Roudne↵-Chupin, Santambrogio 2010]

Slide 116

Slide 116 text

Gradient Flows: Crowd Motion µt+1 def. = argminµ W↵ ↵ (µt, µ) + ⌧f(µ) Proposition: Prox1 " f (µ) = min(e "wµ, )  = ||µt=0 ||1  = 2||µt=0 ||1  = 4||µt=0 ||1 Congestion-inducing function: f(µ) = ◆[0,] (µ) + hw, µi [Maury, Roudne↵-Chupin, Santambrogio 2010] rw

Slide 117

Slide 117 text

Crowd Motion on a Surface Potential cos(w)  = ||µt=0 ||1  = 6||µt=0 ||1 X = triangulated mesh.

Slide 118

Slide 118 text

Crowd Motion on a Surface Potential cos(w)  = ||µt=0 ||1  = 6||µt=0 ||1 X = triangulated mesh.

Slide 119

Slide 119 text

Gradient Flows: Crowd Motion with Obstacles Potential cos(w)  = ||µt=0 ||1  = 2||µt=0 ||1  = 4||µt=0 ||1  = 6||µt=0 ||1 X = sub-domain of R2.

Slide 120

Slide 120 text

Gradient Flows: Crowd Motion with Obstacles Potential cos(w)  = ||µt=0 ||1  = 2||µt=0 ||1  = 4||µt=0 ||1  = 6||µt=0 ||1 X = sub-domain of R2.

Slide 121

Slide 121 text

Multiple-Density Gradient Flows (µ1,t+1, µ2,t+1) def. = argmin (µ1,µ2) W↵ ↵ (µ1,t, µ1) + W↵ ↵ (µ2,t, µ2) + ⌧f(µ1, µ2)

Slide 122

Slide 122 text

Multiple-Density Gradient Flows Wasserstein attraction: rw (µ1,t+1, µ2,t+1) def. = argmin (µ1,µ2) W↵ ↵ (µ1,t, µ1) + W↵ ↵ (µ2,t, µ2) + ⌧f(µ1, µ2) f(µ1, µ2) = W↵ ↵ (µ1, µ2) + h1(µ1) + h2(µ2) hi(µ) = hw, µi

Slide 123

Slide 123 text

Multiple-Density Gradient Flows Wasserstein attraction: rw (µ1,t+1, µ2,t+1) def. = argmin (µ1,µ2) W↵ ↵ (µ1,t, µ1) + W↵ ↵ (µ2,t, µ2) + ⌧f(µ1, µ2) f(µ1, µ2) = W↵ ↵ (µ1, µ2) + h1(µ1) + h2(µ2) hi(µ) = ◆[0,] (µ). hi(µ) = hw, µi

Slide 124

Slide 124 text

Overview • Measures and Histograms • From Monge to Kantorovitch Formulations • Entropic Regularization and Sinkhorn • Barycenters • Unbalanced OT and Gradient Flows • Minimum Kantorovitch Estimators • Gromov-Wasserstein

Slide 125

Slide 125 text

Discriminative vs Generative Models Z x z X Z X Low dimension High dimension Generative Discriminative g✓ d⇠ g✓ d⇠

Slide 126

Slide 126 text

Discriminative vs Generative Models Z x z X Z X Low dimension High dimension Generative Discriminative g✓ d⇠ g✓ d⇠ classification, z =class probability Supervised: ! Learn d⇠ from labeled data (xi, zi)i.

Slide 127

Slide 127 text

Discriminative vs Generative Models Z x z X Z X Low dimension High dimension Generative Discriminative g✓ d⇠ g✓ d⇠ classification, z =class probability Supervised: ! Learn d⇠ from labeled data (xi, zi)i. Compression: z = d⇠(x) is a representation. Generation: x = g✓(z) is a synthesis. Un-supervised: ! Learn (g✓, d⇠) from data (xi)i.

Slide 128

Slide 128 text

Discriminative vs Generative Models Z x z X Z X Low dimension High dimension Generative Discriminative g✓ d⇠ g✓ d⇠ classification, z =class probability Supervised: ! Learn d⇠ from labeled data (xi, zi)i. Compression: z = d⇠(x) is a representation. Generation: x = g✓(z) is a synthesis. Un-supervised: ! Learn (g✓, d⇠) from data (xi)i. Density fitting g✓({zi }i) ⇡ {xi }i Auto-encoders g✓(d⇠(xi)) ⇡ xi

Slide 129

Slide 129 text

Discriminative vs Generative Models Z x z X Z X Low dimension High dimension Generative Discriminative g✓ d⇠ g✓ d⇠ classification, z =class probability Supervised: ! Learn d⇠ from labeled data (xi, zi)i. Compression: z = d⇠(x) is a representation. Generation: x = g✓(z) is a synthesis. Un-supervised: ! Learn (g✓, d⇠) from data (xi)i. Density fitting g✓({zi }i) ⇡ {xi }i Auto-encoders g✓(d⇠(xi)) ⇡ xi Optimal transport map d⇠

Slide 130

Slide 130 text

Density Fitting and Generative Models Parametric model: ✓ 7! µ✓ µ✓ ✓ Observations: ⌫ = 1 n Pn i=1 xi

Slide 131

Slide 131 text

Density Fitting and Generative Models Parametric model: ✓ 7! µ✓ µ✓ ✓ Observations: ⌫ = 1 n Pn i=1 xi dµ✓(y) = f✓(y)dy Density fitting: Maximum likelihood (MLE) min ✓ c KL(⌫|µ✓) def. = X j log(f✓(yj))

Slide 132

Slide 132 text

Density Fitting and Generative Models Parametric model: ✓ 7! µ✓ µ✓ ✓ Observations: ⌫ = 1 n Pn i=1 xi g✓ µ✓ X Z ⇣ Generative model fit: µ✓ = g✓,]⇣ ! MLE undefined. ! Need a weaker metric. c KL(µ✓ |⌫) = +1 dµ✓(y) = f✓(y)dy Density fitting: Maximum likelihood (MLE) min ✓ c KL(⌫|µ✓) def. = X j log(f✓(yj))

Slide 133

Slide 133 text

Loss Functions for Measures Optimal Transport Distances µ = 1 N P i xi W(µ, ⌫)p def. = min T 2Cµ,⌫ P i,j Ti,j ||xi yj ||p ⌫ = 1 P P j yj Density fitting: min ✓ D(µ✓, ⌫) T µ ⌫

Slide 134

Slide 134 text

Loss Functions for Measures Optimal Transport Distances µ = 1 N P i xi W(µ, ⌫)p def. = min T 2Cµ,⌫ P i,j Ti,j ||xi yj ||p ⌫ = 1 P P j yj Maximum Mean Discrepancy (MMD) ||µ ⌫||2 k def. = 1 N2 X i,i0 k(xi, xi0 ) + 1 P2 X j,j0 k(yj, yj0 ) 2 NP X i,j k(xi, yj) Gaussian: k(x, y) = e | |x y| |2 2 2 . Energy distance: k(x, y) = ||x y||2. Density fitting: min ✓ D(µ✓, ⌫) T µ ⌫ k µ ⌫

Slide 135

Slide 135 text

Loss Functions for Measures Optimal Transport Distances µ = 1 N P i xi W(µ, ⌫)p def. = min T 2Cµ,⌫ P i,j Ti,j ||xi yj ||p ⌫ = 1 P P j yj Maximum Mean Discrepancy (MMD) ||µ ⌫||2 k def. = 1 N2 X i,i0 k(xi, xi0 ) + 1 P2 X j,j0 k(yj, yj0 ) 2 NP X i,j k(xi, yj) Gaussian: k(x, y) = e | |x y| |2 2 2 . Energy distance: k(x, y) = ||x y||2. Density fitting: min ✓ D(µ✓, ⌫) T µ ⌫ k µ ⌫ µ ⌫ T" ¯ W"(µ, ⌫)p def. = W"(µ, ⌫)p 1 2 W"(µ, µ)p 1 2 W"(⌫, ⌫)p W"(µ, ⌫)p def. = P i,j T" i,j ||xi yj ||p Sinkhorn divergences [Cuturi 13]

Slide 136

Slide 136 text

Loss Functions for Measures Optimal Transport Distances µ = 1 N P i xi W(µ, ⌫)p def. = min T 2Cµ,⌫ P i,j Ti,j ||xi yj ||p ⌫ = 1 P P j yj Maximum Mean Discrepancy (MMD) ||µ ⌫||2 k def. = 1 N2 X i,i0 k(xi, xi0 ) + 1 P2 X j,j0 k(yj, yj0 ) 2 NP X i,j k(xi, yj) Gaussian: k(x, y) = e | |x y| |2 2 2 . Energy distance: k(x, y) = ||x y||2. for k(x, y) = ||x y||p Density fitting: min ✓ D(µ✓, ⌫) Theorem: [Ramdas, G.Trillos, Cuturi 17] "!+1 ! ||µ ⌫||2 k ¯ W"(µ, ⌫)p "!0 ! W(µ, ⌫)p T µ ⌫ k µ ⌫ µ ⌫ T" ¯ W"(µ, ⌫)p def. = W"(µ, ⌫)p 1 2 W"(µ, µ)p 1 2 W"(⌫, ⌫)p W"(µ, ⌫)p def. = P i,j T" i,j ||xi yj ||p Sinkhorn divergences [Cuturi 13]

Slide 137

Slide 137 text

Loss Functions for Measures Optimal Transport Distances µ = 1 N P i xi W(µ, ⌫)p def. = min T 2Cµ,⌫ P i,j Ti,j ||xi yj ||p ⌫ = 1 P P j yj Maximum Mean Discrepancy (MMD) ||µ ⌫||2 k def. = 1 N2 X i,i0 k(xi, xi0 ) + 1 P2 X j,j0 k(yj, yj0 ) 2 NP X i,j k(xi, yj) Gaussian: k(x, y) = e | |x y| |2 2 2 . Energy distance: k(x, y) = ||x y||2. for k(x, y) = ||x y||p Density fitting: min ✓ D(µ✓, ⌫) Theorem: [Ramdas, G.Trillos, Cuturi 17] "!+1 ! ||µ ⌫||2 k ¯ W"(µ, ⌫)p "!0 ! W(µ, ⌫)p T µ ⌫ k µ ⌫ – Scale free (no , no heavy tail kernel). – Non-Euclidean, arbitrary ground distance. – Less biased gradient. – No curse of dimension (low sample complexity). Best of both worlds: ! cross-validate " µ ⌫ T" ¯ W"(µ, ⌫)p def. = W"(µ, ⌫)p 1 2 W"(µ, µ)p 1 2 W"(⌫, ⌫)p W"(µ, ⌫)p def. = P i,j T" i,j ||xi yj ||p Sinkhorn divergences [Cuturi 13]

Slide 138

Slide 138 text

Deep Discriminative vs Generative Models ✓1 ✓2 ⇢ ⇢ z x g✓ ⇢ ⇢ z x Discriminative Generative d⇠ ⇠1 ⇠2 g✓(z) = ⇢(✓K(. . . ⇢(✓2(⇢(✓1(z) . . .) Deep networks: d⇠(x) = ⇢(⇠K(. . . ⇢(⇠2(⇢(⇠1(x) . . .)

Slide 139

Slide 139 text

Deep Discriminative vs Generative Models ✓1 ✓2 ⇢ ⇢ z x g✓ ⇢ ⇢ z x Discriminative Generative z1 z2 g✓ Z x z X d⇠ d⇠ ⇠1 ⇠2 g✓(z) = ⇢(✓K(. . . ⇢(✓2(⇢(✓1(z) . . .) Deep networks: d⇠(x) = ⇢(⇠K(. . . ⇢(⇠2(⇢(⇠1(x) . . .)

Slide 140

Slide 140 text

Examples of Image Generation [Credit ArXiv:1511.06434] g✓ Z X

Slide 141

Slide 141 text

Overview • Measures and Histograms • From Monge to Kantorovitch Formulations • Entropic Regularization and Sinkhorn • Barycenters • Unbalanced OT and Gradient Flows • Minimum Kantorovitch Estimators • Gromov-Wasserstein

Slide 142

Slide 142 text

unregistered spaces Gromov-Wasserstein Inputs: {(similarity/kernel matrix, histogram)} X Y

Slide 143

Slide 143 text

unregistered spaces Gromov-Wasserstein Inputs: {(similarity/kernel matrix, histogram)} X Y [Memoli 2011] Def. Gromov-Wasserstein distance: [Sturm 2012]

Slide 144

Slide 144 text

unregistered spaces Gromov-Wasserstein Inputs: {(similarity/kernel matrix, histogram)} X Y ! need for a fast approximate solver. ! NP-hard in general. [Memoli 2011] Def. Gromov-Wasserstein distance: [Sturm 2012]

Slide 145

Slide 145 text

Y xi X Gromov-Wasserstein as a Metric yj Def.

Slide 146

Slide 146 text

Y xi X Gromov-Wasserstein as a Metric yj X Y f f X Y Def. () Isometries on M: Def.

Slide 147

Slide 147 text

Y xi X Gromov-Wasserstein as a Metric yj ! “bending-invariant” objects recognition. X Y f f X Y Def. () Isometries on M: Def. Prop. [Memoli 2011]

Slide 148

Slide 148 text

X Y For Arbitrary Spaces Metric-measure spaces (X, Y ): (dX , µ), (dY , ⌫)

Slide 149

Slide 149 text

X Y For Arbitrary Spaces d X (x,x 0 ) dY (y, y0) Def. Gromov-Wasserstein distance: [Memoli 2011] [Sturm 2012] GW2 2 (dX , µ, dY , ⌫) def. = min ⇡2⇧(µ,⌫) Z X2⇥Y 2 |dX(x, x0) dY (y, y0)|2d⇡(x, y)d⇡(x0, y0) Metric-measure spaces (X, Y ): (dX , µ), (dY , ⌫)

Slide 150

Slide 150 text

X Y For Arbitrary Spaces f X Y Prop. GW is a distance on mm-spaces/isometries. ! “bending-invariant” objects recognition. d X (x,x 0 ) dY (y, y0) Def. Gromov-Wasserstein distance: [Memoli 2011] [Sturm 2012] GW2 2 (dX , µ, dY , ⌫) def. = min ⇡2⇧(µ,⌫) Z X2⇥Y 2 |dX(x, x0) dY (y, y0)|2d⇡(x, y)d⇡(x0, y0) Metric-measure spaces (X, Y ): (dX , µ), (dY , ⌫)

Slide 151

Slide 151 text

X Y For Arbitrary Spaces f X Y Prop. GW is a distance on mm-spaces/isometries. ! need for a fast approximate solver. ! “bending-invariant” objects recognition. ! QAP: NP-hard in general. d X (x,x 0 ) dY (y, y0) Def. Gromov-Wasserstein distance: [Memoli 2011] [Sturm 2012] GW2 2 (dX , µ, dY , ⌫) def. = min ⇡2⇧(µ,⌫) Z X2⇥Y 2 |dX(x, x0) dY (y, y0)|2d⇡(x, y)d⇡(x0, y0) Metric-measure spaces (X, Y ): (dX , µ), (dY , ⌫)

Slide 152

Slide 152 text

Entropic Gromov Wasserstein Projected mirror descent: Def. Entropic Gromov-Wasserstein

Slide 153

Slide 153 text

Entropic Gromov Wasserstein Projected mirror descent: Def. Entropic Gromov-Wasserstein Projected mirror descent: Def. Prop. for ⌧ = 1/", the iteration reads repeat: until convergence. initialization: return T func T = GW(C, ¯ C, p, q)

Slide 154

Slide 154 text

phine Vladimir G. Kim Adobe Research Suvrit Sra MIT Source Targets Figure 1: Entropic GW can find correspondences between a source surface (left) and a surface with similar structure, a surface with shared semantic structure, a noisy 3D point cloud, an icon, and a hand drawing. Each fuzzy map was computed using the same code. In this paper, we propose a new correspondence algorithm that minimizes distortion of long- and short-range distances alike. We study an entropically-regularized version of the Gromov-Wasserstein (GW) mapping objective function from [M´ emoli 2011] measuring the distortion of geodesic distances. The optimizer is a probabilistic matching expressed as a “fuzzy” correspondence matrix in the style of [Kim et al. 2012; Solomon et al. 2012]; we control sharpness of the correspondence via the weight of the entropic regularizer. 0 0.02 0.04 0 0.02 0 0.02 Teddies Humans Four-legged Armadillo Figure 15: MDS embedding of four classes from SHREC dataset. 0 0.5 1 0 0.5 1 1 5 10 15 20 25 30 35 40 45 PCA 1 PCA 2 Figure 16: Recovery of galloping horse sequence. 0 is the base shape) as a feature vector for shape i. We reproduce the result presented in the work of Rustamov et al., recovering the circular structure of meshes from a galloping horse animation sequence (Figure 16). Unlike Rustamov et al., however, our method does not require ground truth maps between shapes as input. 5.2 Supervised Matching An important feature of a matching tool is the ability to incorporate user input, e.g. ground truth matches of points or regions. In the GWα framework, one way to enforce these constraints is to provide a stencil S specifying a sparsity pattern for the map Γ. Incorporating Figure 18: Mapping a set of 185 images onto a two shapes while preserving color similarity. (Images from Flickr public domain collection.) Rn0×n0 + and D ∈ Rn×n + we are given symmetric weight matrices W0 ∈ Rn0×n0 + and W ∈ Rn×n + . We could solve a weighted version of the GWα matching problem (3) that prioritizes maps preserving distances corresponding to large W values: min Γ∈M ijkℓ (D0ij −Dkℓ )2ΓikΓjℓW0ijWjℓµ0iµ0jµkµℓ . (8) For instance, (W0, W) might contain confidence values expressing the quality of the entries of (D0, D). Or, W0, W could take values in {ε, 1} reducing the weight of distances that are unknown or do not need to be preserved by Γ. Following the same simplifications as §3.1, we can optimize this objective by minimizing ⟨Γ, ΛW (Γ)⟩, where ΛW (Γ) := 1 2 [D∧2 0 ⊗ W0][[µ0 ]]Γ[[µ]]W − [D0 ⊗ W0][[µ0 ]]Γ[[µ]][D ⊗ W] + 1 2 W0[[µ0 ]]Γ[[µ]][D∧2 ⊗ W] Applications of GW: Shapes Analysis Use T to define registration between: Colors distribution Shape Shape Shape

Slide 155

Slide 155 text

phine Vladimir G. Kim Adobe Research Suvrit Sra MIT Source Targets Figure 1: Entropic GW can find correspondences between a source surface (left) and a surface with similar structure, a surface with shared semantic structure, a noisy 3D point cloud, an icon, and a hand drawing. Each fuzzy map was computed using the same code. In this paper, we propose a new correspondence algorithm that minimizes distortion of long- and short-range distances alike. We study an entropically-regularized version of the Gromov-Wasserstein (GW) mapping objective function from [M´ emoli 2011] measuring the distortion of geodesic distances. The optimizer is a probabilistic matching expressed as a “fuzzy” correspondence matrix in the style of [Kim et al. 2012; Solomon et al. 2012]; we control sharpness of the correspondence via the weight of the entropic regularizer. 0 0.02 0.04 0 0.02 0 0.02 Teddies Humans Four-legged Armadillo Figure 15: MDS embedding of four classes from SHREC dataset. 0 0.5 1 0 0.5 1 1 5 10 15 20 25 30 35 40 45 PCA 1 PCA 2 Figure 16: Recovery of galloping horse sequence. 0 is the base shape) as a feature vector for shape i. We reproduce the result presented in the work of Rustamov et al., recovering the circular structure of meshes from a galloping horse animation sequence (Figure 16). Unlike Rustamov et al., however, our method does not require ground truth maps between shapes as input. 5.2 Supervised Matching An important feature of a matching tool is the ability to incorporate user input, e.g. ground truth matches of points or regions. In the GWα framework, one way to enforce these constraints is to provide a stencil S specifying a sparsity pattern for the map Γ. Incorporating Figure 18: Mapping a set of 185 images onto a two shapes while preserving color similarity. (Images from Flickr public domain collection.) Rn0×n0 + and D ∈ Rn×n + we are given symmetric weight matrices W0 ∈ Rn0×n0 + and W ∈ Rn×n + . We could solve a weighted version of the GWα matching problem (3) that prioritizes maps preserving distances corresponding to large W values: min Γ∈M ijkℓ (D0ij −Dkℓ )2ΓikΓjℓW0ijWjℓµ0iµ0jµkµℓ . (8) For instance, (W0, W) might contain confidence values expressing the quality of the entries of (D0, D). Or, W0, W could take values in {ε, 1} reducing the weight of distances that are unknown or do not need to be preserved by Γ. Following the same simplifications as §3.1, we can optimize this objective by minimizing ⟨Γ, ΛW (Γ)⟩, where ΛW (Γ) := 1 2 [D∧2 0 ⊗ W0][[µ0 ]]Γ[[µ]]W − [D0 ⊗ W0][[µ0 ]]Γ[[µ]][D ⊗ W] + 1 2 W0[[µ0 ]]Γ[[µ]][D∧2 ⊗ W] Applications of GW: Shapes Analysis 0 0.02 0.04 0 0.02 0 0.02 Te Hu Fo Ar Figure 1: The database that has been used, divid MDS in 3-D Use T to define registration between: Colors distribution Shape Shape Shape Geodesic distances GW distances MDS Vizualization Shapes (Xs)s 0 0.02 0.04 0 0.02 0 0.02 Teddies Humans Four-legged Armadillo Figure 15: MDS embedding of four classes from SHREC dataset. 0 0.5 1 1 5 10 15 20 25 30 35 40 45 PCA 2 Figure 18: Mapping a set of 185 images onto a two shapes while preserving color similarity. (Images from Flickr public domain collection.) Rn0×n0 + and D ∈ Rn×n + we are given symmetric weight matrices W0 ∈ Rn0×n0 + and W ∈ Rn×n + . We could solve a weighted version of the GWα matching problem (3) that prioritizes maps preserving distances corresponding to large W values: MDS in 2-D

Slide 156

Slide 156 text

Applications of GW: Quantum Chemistry Regression problem: ! f by solving DFT approximation is too costly.

Slide 157

Slide 157 text

Applications of GW: Quantum Chemistry Regression problem: ! f by solving DFT approximation is too costly. [Rupp et al 2012] 2 6 6 6 6 4 3 7 7 7 7 5

Slide 158

Slide 158 text

Applications of GW: Quantum Chemistry Regression problem: ! f by solving DFT approximation is too costly. GW-interpolation: [Rupp et al 2012] 2 6 6 6 6 4 3 7 7 7 7 5

Slide 159

Slide 159 text

Gromov-Wasserstein Geodesics Def. Gromov-Wasserstein Geodesic

Slide 160

Slide 160 text

Gromov-Wasserstein Geodesics Def. Gromov-Wasserstein Geodesic Prop. [Sturm 2012]

Slide 161

Slide 161 text

Gromov-Wasserstein Geodesics Def. Gromov-Wasserstein Geodesic ! X ⇥ Y is not practical for most applications. (need to fix the size of the geodesic embedding space) ! Extension to more than 2 input spaces? Prop. [Sturm 2012]

Slide 162

Slide 162 text

Gromov-Wasserstein Barycenters Input: 1 2 3 Def. GW Barycenters

Slide 163

Slide 163 text

Gromov-Wasserstein Barycenters Input: 1 2 3 Def. GW Barycenters repeat: until convergence. initialization: C C0 for s = 1 to S do return C Alternating minimization: On Ts

Slide 164

Slide 164 text

Conclusion: Toward High-dimensional OT Monge Kantorovich Dantzig Brenier Otto McCann Villani ansport framework Sliced Wasserstein projection Applications lication to Color Transfer Source image (X) Style image (Y) Sliced Wasserstein projection of X to style image color statistics Y Source image after color transfer J. Rabin Wasserstein Regularization