Gabriel Peyré
December 07, 2017
8.2k

# Numerical Optimal Transport

Slides for a course.

## Gabriel Peyré

December 07, 2017

## Transcript

1. ### Numerical Optimal Transport Gabriel Peyré É C O L E

N O R M A L E S U P É R I E U R E Outline • What is the Lasso • Lasso with an orthogonal design • From projected gradient to proximal gradien • Optimality conditions and subgradients (LAR • Coordinate descent algorithm … with some demos www.numerical-tours.com http://optimaltransport.github.io
2. ### Overview • Measures and Histograms • From Monge to Kantorovitch

Formulations • Entropic Regularization and Sinkhorn • Barycenters • Unbalanced OT and Gradient Flows • Minimum Kantorovitch Estimators • Gromov-Wasserstein
3. ### Comparing Measures and Spaces Source image (X) Style image (Y)

Sliced Wasserstein projection of X to style image color statistics Y Source image after color transfer J. Rabin Wasserstein Regularization ! images, vision, graphics and machine learning, . . . • Probability distributions and histograms
4. ### Comparing Measures and Spaces Source image (X) Style image (Y)

Sliced Wasserstein projection of X to style image color statistics Y Source image after color transfer J. Rabin Wasserstein Regularization ! images, vision, graphics and machine learning, . . . • Probability distributions and histograms • Optimal transport Optimal transport mean L2 mean
5. ### Probability Measures Positive Radon measure µ on a set X.

dµ(x) = m(x)dx Measure of sets A ⇢ X: µ(A) = R A dµ(x) > 0 X X X µ = P i µi xi 2.2. Assignment and Monge Problem 9 Discrete d = 1 Discrete d = 2 Density d = 1 Density d = 2 Figure 2.1: Schematic display of discrete distributions – = q n i =1 ai”xi (red cor- responds to empirical uniform distribution ai = 1/n, and blue to arbitrary distri- butions) and densities d–(x) = ﬂ– (x)dx (in violet), in both 1-D and 2-D. Discrete distributions in 1-D are displayed using vertical segments (with length equal to ai ) and in 2-D using point clouds (radius equal to ai ). is the dimension), can have a density d–(x) = ﬂ – (x)dx w.r.t. the Lebesgue measure, often denoted ﬂ – = d – d x , which means that d ⁄ ⁄
6. ### Probability Measures Positive Radon measure µ on a set X.

dµ(x) = m(x)dx Integration against continuous functions: R X g(x)dµ(x) > 0 Measure of sets A ⇢ X: µ(A) = R A dµ(x) > 0 X X X µ = P i µi xi R X gdµ = R X m(x)dx R X gdµ = P i µig(xi) dµ(x) = m(x)dx µ = P i µi xi 2.2. Assignment and Monge Problem 9 Discrete d = 1 Discrete d = 2 Density d = 1 Density d = 2 Figure 2.1: Schematic display of discrete distributions – = q n i =1 ai”xi (red cor- responds to empirical uniform distribution ai = 1/n, and blue to arbitrary distri- butions) and densities d–(x) = ﬂ– (x)dx (in violet), in both 1-D and 2-D. Discrete distributions in 1-D are displayed using vertical segments (with length equal to ai ) and in 2-D using point clouds (radius equal to ai ). is the dimension), can have a density d–(x) = ﬂ – (x)dx w.r.t. the Lebesgue measure, often denoted ﬂ – = d – d x , which means that d ⁄ ⁄
7. ### Probability Measures Positive Radon measure µ on a set X.

dµ(x) = m(x)dx Integration against continuous functions: R X g(x)dµ(x) > 0 Measure of sets A ⇢ X: µ(A) = R A dµ(x) > 0 X X X Probability (normalized) measure: µ(X) = R X dµ(x) = 1 µ = P i µi xi R X gdµ = R X m(x)dx R X gdµ = P i µig(xi) dµ(x) = m(x)dx µ = P i µi xi 2.2. Assignment and Monge Problem 9 Discrete d = 1 Discrete d = 2 Density d = 1 Density d = 2 Figure 2.1: Schematic display of discrete distributions – = q n i =1 ai”xi (red cor- responds to empirical uniform distribution ai = 1/n, and blue to arbitrary distri- butions) and densities d–(x) = ﬂ– (x)dx (in violet), in both 1-D and 2-D. Discrete distributions in 1-D are displayed using vertical segments (with length equal to ai ) and in 2-D using point clouds (radius equal to ai ). is the dimension), can have a density d–(x) = ﬂ – (x)dx w.r.t. the Lebesgue measure, often denoted ﬂ – = d – d x , which means that d ⁄ ⁄
8. ### Random vectors Radon measures P(X 2 A) R A dµ(x)

= P(Xn 2 A) n!+1 ! P(X 2 A) 8 set A 8 continuous function f R fdµn n!+1 ! R fdµ Weak⇤ convergence: Convergence in law: Measures and Random Variables
9. ### Random vectors Radon measures P(X 2 A) R A dµ(x)

= P(Xn 2 A) n!+1 ! P(X 2 A) 8 set A 8 continuous function f R fdµn n!+1 ! R fdµ Weak⇤ convergence: Convergence in law: Weak convergence: . . . Measures and Random Variables
10. ### Quotient space: Discrete measure: Discretization: Histogram vs. Empirical Lagrangian (point

clouds) Eulerian (histograms) µ = N X i=1 µi xi xi 2 X, X i µi = 1 Constant weights µi = 1 N XN /⌃N {(µi)i > 0 ; P i µi = 1} Convex polytope (simplex): xi X X Fixed positions xi (e.g. grid)
11. ### - f . s - t e e e l

. x Radon measures (µ, ⌫) on (X, Y ). Transfer of measure by f : X ! Y : push forward. ⌫ = f]µ deﬁned by: ⌫(A) def. = µ(f 1(A)) () R Y g(y)d⌫(y) def. = R X g(f(x))dµ(x) y = f(x) f X µ ⌫ Y A f 1(A) Push Forward
12. ### - f . s - t e e e l

. x Radon measures (µ, ⌫) on (X, Y ). Transfer of measure by f : X ! Y : push forward. ⌫ = f]µ deﬁned by: ⌫(A) def. = µ(f 1(A)) () R Y g(y)d⌫(y) def. = R X g(f(x))dµ(x) y = f(x) f X µ ⌫ Y A f 1(A) Smooth densities: dµ = ⇢(x)dx, d⌫ = ⇠(x)dx Push Forward f]µ = ⌫ () ⇢(f(x))| det(@f(x))| = ⇠(x)
13. ### µ f]µ f f ' Measures: push-forward Functions: pull-back f]'

Remark: f] and f] are adjoints Z Y 'd(f]µ) = Z X (f]')dµ = P i xi def. = P i f(xi) def. = ' f f : X ! Y f] : M(X) ! M(Y) f] : C(Y) ! C(X) Push-forward vs. Pull-back
14. ### P( lim n!+1 Xn = X) = 1 =) lim

n!+1 E(|Xn X|p) = 0 P(Xn 2 A) n!+1 ! P(X 2 A) In law 8 " > 0, P(|Xn X| > ") n!+1 ! 0 In probability Almost sure In mean =) =) (the Xn can be deﬁned on di↵erent spaces) Convergence of Random Variables
15. ### Overview • Measures and Histograms • From Monge to Kantorovitch

Formulations • Entropic Regularization and Sinkhorn • Barycenters • Unbalanced OT and Gradient Flows • Minimum Kantorovitch Estimators • Gromov-Wasserstein

µ Y ⌫
18. ### Monge Transport min ⌫=f]µ Z X c(x, f(x))dµ(x) f X

µ Y ⌫ Theorem: [Brenier] for c(x, y) = ||x y||2, (µ, ⌫) with density, there exists a unique optimal f. One has f = r where is the unique convex function such that (r )]µ = ⌫
19. ### Monge Transport min ⌫=f]µ Z X c(x, f(x))dµ(x) f X

µ Y ⌫ Monge-Amp` ere equation: ⇢(r ) det(@2 ) = ⇠ Theorem: [Brenier] for c(x, y) = ||x y||2, (µ, ⌫) with density, there exists a unique optimal f. One has f = r where is the unique convex function such that (r )]µ = ⌫
20. ### Monge Transport min ⌫=f]µ Z X c(x, f(x))dµ(x) f X

µ Y ⌫ Monge-Amp` ere equation: ⇢(r ) det(@2 ) = ⇠ ⌫ µ f f0 ? µ ⌫ µ ⌫ f µ ⌫ f Non-uniqueness / non-existence: Theorem: [Brenier] for c(x, y) = ||x y||2, (µ, ⌫) with density, there exists a unique optimal f. One has f = r where is the unique convex function such that (r )]µ = ⌫

22. ### Optimal Transport was formulated in 1930 by A.N. Tolstoi, 12

years before Kantorovich. He even solved a "large scale" 10×68 instance! Before Kantorovitch
23. ### xi yj Kantorovitch’s Formulation Points (xi)i, (yj)j T Input distributions

Couplings Def.
24. ### xi yj Kantorovitch’s Formulation Points (xi)i, (yj)j T Input distributions

Couplings Def. [Kantorovich 1942] Wasserstein Distance / EMD Def.
25. ### OT Between General Measures Couplings: P1]⇡(S) def. = ⇡(S, X)

P2]⇡(S) def. = ⇡(X, S) Marginals: ⇡ ⇧(µ, ⌫) def. = {⇡ 2 M+(X ⇥ X) ; P1]⇡ = µ, P2]⇡ = ⌫} ⇡ Optimal transport: [Kantorovitch 1942] Wp p (µ, ⌫) def. = min ⇡ ⇢ hdp, ⇡i = Z X⇥X d(x, y)pd⇡(x, y) ; ⇡ 2 ⇧(µ, ⌫)

↵ ↵ ↵ ↵ ↵ ↵

↵ ↵ ⇡
28. ### 1-D Optimal Transport Remark. If ⌦ = R, c(x, y)

= c(|x y|), c convex, F 1 µ , F 1 ⌫ quantile functions, W(µ, ⌫)= Z 1 0 c(|F 1 µ (x) F 1 ⌫ (x)|)dx
29. ### Cases 37 ρβ ρα Put Remark. If ⌦ = Rd,

c(x, y) = kx yk2, and µ = N(mµ, ⌃µ), ⌫ = N(m⌫, ⌃⌫) then W2 2 (µ, ⌫) = kmµ m⌫ k2 + B(⌃µ, ⌃⌫)2 where B is the Bures metric B(⌃µ, ⌃⌫)2 = trace(⌃µ+⌃⌫ 2(⌃1/2 µ ⌃⌫⌃1/2 µ )1/2). The map T : x 7! m⌫ + A(x mµ) is optimal, where A = ⌃ 1 2 µ ✓ ⌃ 1 2 µ ⌃⌫⌃ 1 2 µ ◆1 2 ⌃ 1 2 µ . m 1 p 2⇡ e (x m)2 2 2 OT Between Gaussians T : x 7! m⌫ + A(x mµ)

32. ### Bins-to-bins metrics: dµ(x) = ⇢(x)dx d˜ µ(x) = ˜ ⇢(x)dx

µ ˜ µ Kullback-Leibler divergence: DKL(µ, ˜ µ) = Z ⇢(x) log ⇢(x) ˜ ⇢(x) dx Hellinger distance: D H (µ, ˜ µ)2 = Z ⇣p ⇢(x) p ˜ ⇢(x) ⌘ 2 dx Metrics on the Space of Measures
33. ### Bins-to-bins metrics: dµ (x) = ⇢(x )dx dµ(x) = ⇢(x)dx

d˜ µ(x) = ˜ ⇢(x)dx µ ˜ µ µ Kullback-Leibler divergence: DKL(µ, ˜ µ) = Z ⇢(x) log ⇢(x) ˜ ⇢(x) dx Hellinger distance: E↵ect of translation: D(µ, µ ) ⇡ cst W2(µ, µ ) = D H (µ, ˜ µ)2 = Z ⇣p ⇢(x) p ˜ ⇢(x) ⌘ 2 dx Metrics on the Space of Measures
34. ### D'(↵| ) def. = Z X ' ✓ d↵ d

◆ d ||↵ ||B def. = max f2B Z X f(x)(d↵(x) d (x)) Csisz´ ar divergences: Dual norms: X X ↵ ↵ D' || · ||B Weak topology Strong topology ! KL, TV, 2 , Hellinger . . . ! W1, ﬂat, RKHS ⇤ , energy dist, . . . Csiszar Divergence vs Dual Norms
35. ### Csiszár divergences, a unifying way to deﬁne losses between arbitrary

positive measures (discrete & densities). https://en.wikipedia.org/wiki/F- divergence … Csiszar Divergence
36. ### -3 -2 -1 0 1 2 3 0 1 2

3 Energy Gauss W 1 Flat -3 -2 -1 0 1 2 3 0 1 2 3 Energy Gauss W 1 Flat Dual norms: (aka Integral Probability Metrics) ||↵ ||B def. = max ⇢Z X f(x)(d↵(x) d (x)) ; f 2 B t t/2 Wasserstein 1: B = {f ; ||rf||1 6 1}. Flat norm: B = {f ; ||f||1 6 1, ||rf||1 6 1}. RKHS: B = f ; ||f||2 k 6 1 . ||↵ ||2 B = Z k(x, x0)d↵(x)d↵(x0) + Z k(x, x0)d (y)d (y0) 2 Z k(x, y)d↵(x)d (y) Energy distance: k(x, y) = ||x y|| Gaussian: k(x, y) = e | |x y| |2 2 2 ↵ ↵ Dual Norms

38. ### The Earth Mover’s Distance µ ⌫ dist(I1, I2) = W1(µ,

⌫) [Rubner’98]

⌫) µ ⌫
40. ### Overview • Measures and Histograms • From Monge to Kantorovitch

Formulations • Linear Programming and Semi-discrete • Entropic Regularization and Sinkhorn • Barycenters • Unbalanced OT and Gradient Flows • Minimum Kantorovitch Estimators • Gromov-Wasserstein
41. ### Linear programming: µ = PN1 i=1 pi xi , ⌫

= PN2 j=1 pj yi Algorithms
42. ### Linear programming: µ = PN1 i=1 pi xi , ⌫

= PN2 j=1 pj yi ⇠ O(N3) Hungarian/Auction: µ = 1 N PN i=1 xi , ⌫ = 1 N PN j=1 yj Algorithms
43. ### Linear programming: µ = PN1 i=1 pi xi , ⌫

= PN2 j=1 pj yi ⇠ O(N3) Hungarian/Auction: µ = 1 N PN i=1 xi , ⌫ = 1 N PN j=1 yj Algorithms µ ⌫
44. ### Linear programming: µ = PN1 i=1 pi xi , ⌫

= PN2 j=1 pj yi ⇠ O(N3) Hungarian/Auction: µ = 1 N PN i=1 xi , ⌫ = 1 N PN j=1 yj Monge-Amp` ere/Benamou-Brenier, d = || · ||2 2 . also reveal that the network simplex behaves in O(n 2) in our con- text, which is a major gain at the scale at which we typical work, i.e. thousands of particles. This ﬁnding is also useful for applica- tions that use EMD, where using the network simplex instead of the transport simplex can bring a signiﬁcant performance increase. Our experiments also show that ﬁxed-point precision further speeds up the computation. We observed that the value of the ﬁnal trans- port cost is less accurate because of the limited precision, but that the particle pairing that produces the actual interpolation scheme remains unchanged. We used the ﬁxed point method to generate the results presented in this paper. The results of the performance study are also of broader interest, as current EMD image retrieval or color transfer techniques rely on slower solvers [Rubner et al. 2000; Kanters et al. 2003; Morovic and Sun 2003]. 101 102 103 104 10−4 10−2 100 102 Problem size : number of bins per histogram Time in seconds Network Simplex (fixed point) Network Simplex (double prec.) Transport Simplex y = α x2 y = β x3 Figure 6: Log-log plot of the running times of different solvers. Figure 7: Synthetic 2D examples on a Euclidean domain. Th left and right columns show the input distributions, while the cent columns show interpolations for ↵ = 1/4, ↵ = 1/2, and ↵ = 3/ positive. A negative side effect of this choice is that interpolatin between BRDFs of equal energy conserves their log energy (§ 3. instead of their energy. Because we apply a concave remappin the interpolated value is guaranteed to be always lower, which e sures that our result does not break the energy preservation rul Algorithms µ ⌫
45. ### Linear programming: µ = PN1 i=1 pi xi , ⌫

= PN2 j=1 pj yi ⇠ O(N3) Hungarian/Auction: µ = 1 N PN i=1 xi , ⌫ = 1 N PN j=1 yj Monge-Amp` ere/Benamou-Brenier, d = || · ||2 2 . also reveal that the network simplex behaves in O(n 2) in our con- text, which is a major gain at the scale at which we typical work, i.e. thousands of particles. This ﬁnding is also useful for applica- tions that use EMD, where using the network simplex instead of the transport simplex can bring a signiﬁcant performance increase. Our experiments also show that ﬁxed-point precision further speeds up the computation. We observed that the value of the ﬁnal trans- port cost is less accurate because of the limited precision, but that the particle pairing that produces the actual interpolation scheme remains unchanged. We used the ﬁxed point method to generate the results presented in this paper. The results of the performance study are also of broader interest, as current EMD image retrieval or color transfer techniques rely on slower solvers [Rubner et al. 2000; Kanters et al. 2003; Morovic and Sun 2003]. 101 102 103 104 10−4 10−2 100 102 Problem size : number of bins per histogram Time in seconds Network Simplex (fixed point) Network Simplex (double prec.) Transport Simplex y = α x2 y = β x3 Figure 6: Log-log plot of the running times of different solvers. Figure 7: Synthetic 2D examples on a Euclidean domain. Th left and right columns show the input distributions, while the cent columns show interpolations for ↵ = 1/4, ↵ = 1/2, and ↵ = 3/ positive. A negative side effect of this choice is that interpolatin between BRDFs of equal energy conserves their log energy (§ 3. instead of their energy. Because we apply a concave remappin the interpolated value is guaranteed to be always lower, which e sures that our result does not break the energy preservation rul Algorithms µ ⌫ Semi-discrete: Laguerre cells, d = || · ||2 2 . [Levy,’15]
46. ### Linear programming: µ = PN1 i=1 pi xi , ⌫

= PN2 j=1 pj yi ⇠ O(N3) Hungarian/Auction: µ = 1 N PN i=1 xi , ⌫ = 1 N PN j=1 yj Monge-Amp` ere/Benamou-Brenier, d = || · ||2 2 . also reveal that the network simplex behaves in O(n 2) in our con- text, which is a major gain at the scale at which we typical work, i.e. thousands of particles. This ﬁnding is also useful for applica- tions that use EMD, where using the network simplex instead of the transport simplex can bring a signiﬁcant performance increase. Our experiments also show that ﬁxed-point precision further speeds up the computation. We observed that the value of the ﬁnal trans- port cost is less accurate because of the limited precision, but that the particle pairing that produces the actual interpolation scheme remains unchanged. We used the ﬁxed point method to generate the results presented in this paper. The results of the performance study are also of broader interest, as current EMD image retrieval or color transfer techniques rely on slower solvers [Rubner et al. 2000; Kanters et al. 2003; Morovic and Sun 2003]. 101 102 103 104 10−4 10−2 100 102 Problem size : number of bins per histogram Time in seconds Network Simplex (fixed point) Network Simplex (double prec.) Transport Simplex y = α x2 y = β x3 Figure 6: Log-log plot of the running times of different solvers. Figure 7: Synthetic 2D examples on a Euclidean domain. Th left and right columns show the input distributions, while the cent columns show interpolations for ↵ = 1/4, ↵ = 1/2, and ↵ = 3/ positive. A negative side effect of this choice is that interpolatin between BRDFs of equal energy conserves their log energy (§ 3. instead of their energy. Because we apply a concave remappin the interpolated value is guaranteed to be always lower, which e sures that our result does not break the energy preservation rul Algorithms µ ⌫ Semi-discrete: Laguerre cells, d = || · ||2 2 . [Levy,’15]
47. ### Need for fast approximate algorithms for generic c. Linear programming:

µ = PN1 i=1 pi xi , ⌫ = PN2 j=1 pj yi ⇠ O(N3) Hungarian/Auction: µ = 1 N PN i=1 xi , ⌫ = 1 N PN j=1 yj Monge-Amp` ere/Benamou-Brenier, d = || · ||2 2 . also reveal that the network simplex behaves in O(n 2) in our con- text, which is a major gain at the scale at which we typical work, i.e. thousands of particles. This ﬁnding is also useful for applica- tions that use EMD, where using the network simplex instead of the transport simplex can bring a signiﬁcant performance increase. Our experiments also show that ﬁxed-point precision further speeds up the computation. We observed that the value of the ﬁnal trans- port cost is less accurate because of the limited precision, but that the particle pairing that produces the actual interpolation scheme remains unchanged. We used the ﬁxed point method to generate the results presented in this paper. The results of the performance study are also of broader interest, as current EMD image retrieval or color transfer techniques rely on slower solvers [Rubner et al. 2000; Kanters et al. 2003; Morovic and Sun 2003]. 101 102 103 104 10−4 10−2 100 102 Problem size : number of bins per histogram Time in seconds Network Simplex (fixed point) Network Simplex (double prec.) Transport Simplex y = α x2 y = β x3 Figure 6: Log-log plot of the running times of different solvers. Figure 7: Synthetic 2D examples on a Euclidean domain. Th left and right columns show the input distributions, while the cent columns show interpolations for ↵ = 1/4, ↵ = 1/2, and ↵ = 3/ positive. A negative side effect of this choice is that interpolatin between BRDFs of equal energy conserves their log energy (§ 3. instead of their energy. Because we apply a concave remappin the interpolated value is guaranteed to be always lower, which e sures that our result does not break the energy preservation rul Algorithms µ ⌫ Semi-discrete: Laguerre cells, d = || · ||2 2 . [Levy,’15]
48. ### C-Transform 92 Semi-discrete Optimal Transport 0 0.5 1 -0.2 0

0.2 0.4 0.6 0.5 1 1.5 2 p = 1/2 p = 1 p = 3/2 p = 2 Figure 5.1: Top: examples of semi-discrete ¯ c-transforms g ¯ c in 1-D, for ground d the useful indicator function notation (4.42). alternate minimization on either f or g leads to the im- n of c-transform: ’ y œ Y, fc(y) def. = inf xœX c(x, y) ≠ f(x), (5.1) ’ x œ X, g¯ c(x) def. = inf yœY c(x, y) ≠ g(y), (5.2) oted ¯ c(y, x) def. = c(x, y). Indeed, one can check that œ argmax g E(f, g) and g¯ c œ argmax f E(f, g). (5.3) ese partial minimizations deﬁne maximizers on the sup- tively – and —, while the deﬁnitions (5.1) actually deﬁne he whole spaces X and Y. This is thus a way to extend in ay solutions of (2.22) on the whole spaces. When X = Rd Îx ≠ yÎp, then the c-transform (5.1) fc is the so-called n between ≠f and Î·Îp. The deﬁnition of fc is also often a “Hopf-Lax formula”. (f, g) œ C(X) ◊ C(Y) ‘æ (g¯ c, fc) œ C(X) ◊ C(Y) replaces s by “better” ones (improving the dual objective E). Func- be written in the form fc and g¯ c are called c-concave and ctions. In the special case c(x, y) = Èx, yÍ in X = Y = Rd, n coincides with the usual notion of concave functions. turally Proposition 3.1 to a continuous case, one has the operations are replaced by a “soft-min”. Using (5.3), one can reformulate (2.22) as an unconstrained program over a single potential Lc (–, —) = max fœC ( X ) ⁄ X f(x)d–(x) + ⁄ Y fc(y)d—(y), = max gœC ( Y ) ⁄ X g¯ c(x)d–(x) + ⁄ Y g(y)d—(y). Since one can iterate the map (f, g) ‘æ (g¯ c, fc), it is possible to these optimization problems the constraint that f is ¯ c-concave is c-concave, which is important to ensure enough regularity on potentials and show for instance existence of solutions to (2.22) 5.2 Semi-discrete Formulation A case of particular interest is when — = q j bj ” yj is discrete (of the same construction applies if – is discrete by exchanging the (–, —)). One can adapt the deﬁnition of the ¯ c transform (5.1) setting by restricting the minimization to the support (y j ) j of — ’ g œ Rm, ’ x œ X, g ¯ c(x) def. = min jœJmK c(x, y j ) ≠ gj . This transform maps a vector g to a continuous function g ¯ c œ Note that this deﬁnition coincides with (5.1) when imposing th space X is equal to the support of —. Figure 5.1 shows some ex of such discrete ¯ c-transforms in 1-D and 2-D. Using this discrete ¯ c-transform, in this semi-discrete case, ( equivalent to the following ﬁnite dimensional optimization
49. ### Semi-discrete Descent Algorithm 94 Semi-discrete Optimal Transport ¸ = 1

¸ = 3 ¸ = 50 ¸ = 100 Matching Figure 5.2: Iterations of the semi-discrete OT algorithm minimizing (5.8) (here a simple gradient descent is used). The support (yj ) j of the discrete measure — is indicated by the red points, while the continuous measure – is the uniform measure on a square. The blue cells display the Laguerre partition (Lg(¸) (yj )) j where g ( ¸ ) is the discrete dual potential computed at iteration ¸.
50. ### 2000 4000 6000 8000 10000 -8 -6 -4 -2 0

log E(g(?)) E(g(`)) ` Stochastic gradient descent for the semi-discrete Optimal Transport, illustration of convergence and corresponding Laguerre cells. https://arxiv.org/abs/1605.08527 Semi-discrete Stochastic Descent
51. ### The simplex of histograms with N bins is ⌃ N

def. = p 2 R+ N ; P i p i = 1 . The entropy of T 2 RN⇥N + is deﬁned as H(T) def. = P N i,j=1 T i,j (log(T i,j ) 1). The set of couplings between histograms p 2 ⌃ N1 and q 2 ⌃ N2 is C p,q def. = T 2 (R +)N1 ⇥N2 ; T1 N2 = p, T>1 N1 = q . Here, 1 N def. = (1, . . . , 1)> 2 RN . For any tensor L = (L i,j,k,` ) i,j,k,` and matrix (T i,j ) i,j , we deﬁne the tensor- matrix multiplication as L ⌦ T def. = ⇣ X k,` L i,j,k,` T k,` ⌘ i,j . (1) 2. Gromov-Wasserstein Discrepancy 2.1. Entropic Optimal Transport Optimal transport distances are useful to compare two his- tograms (p, q) 2 ⌃ N1 ⇥ ⌃ N2 deﬁned on the same metric 1 In our setting, sinc learning problems, distance matrices, i does not necessaril We deﬁne the Gro two measured simil and ( ¯ C, q) 2 RN2 ⇥ GW(C where E C,C The matrix T is a which the similarit L is some loss fun the similarity matr the quadratic loss the Kullback-Leibl a log(a/b) a + inition (4) of GW Entropy: Entropic Regularization Def. Regularized OT: [Cuturi NIPS’13]
52. ### The simplex of histograms with N bins is ⌃ N

def. = p 2 R+ N ; P i p i = 1 . The entropy of T 2 RN⇥N + is deﬁned as H(T) def. = P N i,j=1 T i,j (log(T i,j ) 1). The set of couplings between histograms p 2 ⌃ N1 and q 2 ⌃ N2 is C p,q def. = T 2 (R +)N1 ⇥N2 ; T1 N2 = p, T>1 N1 = q . Here, 1 N def. = (1, . . . , 1)> 2 RN . For any tensor L = (L i,j,k,` ) i,j,k,` and matrix (T i,j ) i,j , we deﬁne the tensor- matrix multiplication as L ⌦ T def. = ⇣ X k,` L i,j,k,` T k,` ⌘ i,j . (1) 2. Gromov-Wasserstein Discrepancy 2.1. Entropic Optimal Transport Optimal transport distances are useful to compare two his- tograms (p, q) 2 ⌃ N1 ⇥ ⌃ N2 deﬁned on the same metric 1 In our setting, sinc learning problems, distance matrices, i does not necessaril We deﬁne the Gro two measured simil and ( ¯ C, q) 2 RN2 ⇥ GW(C where E C,C The matrix T is a which the similarit L is some loss fun the similarity matr the quadratic loss the Kullback-Leibl a log(a/b) a + inition (4) of GW Entropy: Entropic Regularization Regularization impact on solution: Def. Regularized OT: [Cuturi NIPS’13] c " T" " "
53. ### Overview • Measures and Histograms • From Monge to Kantorovitch

Formulations • Entropic Regularization and Sinkhorn • Barycenters • Unbalanced OT and Gradient Flows • Minimum Kantorovitch Estimators • Gromov-Wasserstein

on GPU.

on GPU.

.
59. ### Optimal Transport on Surfaces Level sets xi d(xi, ·) Ground

cost: c(x, y) = dM (x, y)↵. Triangulated mesh M. Geodesic distance dM .
60. ### Optimal Transport on Surfaces Level sets xi d(xi, ·) Computing

c (Fast-Marching): N2 log(N) ! too costly. Ground cost: c(x, y) = dM (x, y)↵. Triangulated mesh M. Geodesic distance dM .
61. ### Entropic Transport on Surfaces @tut(x, ·) = M ut(x, ·),

ut=0(x, ·) = x Heat equation on M: " t
62. ### Entropic Transport on Surfaces @tut(x, ·) = M ut(x, ·),

ut=0(x, ·) = x Heat equation on M: Theorem: [Varadhan] " log(u") "!0 ! d2 M " t
63. ### Entropic Transport on Surfaces @tut(x, ·) = M ut(x, ·),

ut=0(x, ·) = x Heat equation on M: Theorem: [Varadhan] " log(u") "!0 ! d2 M " Sinkhorn kernel: K def. = e d2 M " ⇡ u" ⇡ Id " ` M ` t
64. ### MRI Data Procesing [with A. Gramfort] L2 barycenter W2 2

barycenter Ground cost c = dM : geodesic on cortical surface M.
65. ### Regularization for General Measures Schr¨ odinger’s problem: Landmark computational paper:

[Cuturi 2013]. ⇡" def. = argmin ⇡ {hdp, ⇡i + "KL(⇡|⇡0) ; ⇡ 2 ⇧(µ, ⌫)} ⇡" = argmin ⇡2⇧(µ,⌫) KL(⇡|K) K(x, y) def. = e dp(x,y) " ⇡0(x, y)
66. ### Regularization for General Measures Schr¨ odinger’s problem: Landmark computational paper:

[Cuturi 2013]. ⇡" def. = argmin ⇡ {hdp, ⇡i + "KL(⇡|⇡0) ; ⇡ 2 ⇧(µ, ⌫)} ⇡" = argmin ⇡2⇧(µ,⌫) KL(⇡|K) K(x, y) def. = e dp(x,y) " ⇡0(x, y) Proposition: [Carlier, Duval, Peyr´ e, Schmitzer 2015] ⇡" "!0 ! argmin ⇡2⇧(µ,⌫) hdp, ⇡i ⇡" "!+1 ! µ(x)⌫(y) " ⇡" "
67. ### Back to Sinkhorn’s Algorithm ProxKL f1/" (˜ µ) = µ

ProxKL f2/" (˜ ⌫) = ⌫ f1 = ◆µ f2 = ◆⌫ Optimal transport problem:
68. ### Back to Sinkhorn’s Algorithm ProxKL f1/" (˜ µ) = µ

ProxKL f2/" (˜ ⌫) = ⌫ [Sinkhorn 1967][Deming,Stephan 1940] Sinkhorn/IPFP algorithm: a(`+1) def. = µ Kb(`) and b(`+1) def. = ⌫ K⇤a(`+1) f1 = ◆µ f2 = ◆⌫ Optimal transport problem: ` ⇡(`) "
69. ### Back to Sinkhorn’s Algorithm ProxKL f1/" (˜ µ) = µ

ProxKL f2/" (˜ ⌫) = ⌫ ⇡" " [Sinkhorn 1967][Deming,Stephan 1940] Sinkhorn/IPFP algorithm: a(`+1) def. = µ Kb(`) and b(`+1) def. = ⌫ K⇤a(`+1) f1 = ◆µ f2 = ◆⌫ Optimal transport problem: Proposition: [Franklin,Lorenz 1989] ⇡(`) def. = diag(a(`))K diag(b(`)) || log(⇡ (`)) log(⇡ ?)||1 = O(1 )` , ⇠  1/" c Local rate: [Knight 2008] 1000 2000 3000 4000 5000 -2 -1.5 -1 -0.5 0 ` log(|| log(⇡(`)) log(⇡?)||1) ` ⇡(`) "
70. ### 0 0.5 1 -0.2 0 0.2 0 0.1 0.2 0.3

Á = 0 Á = 0.01 Á = 0.1 Á = 0.3 Figure 5.3: Top: examples of entropic semi-discrete ¯ c-transforms g ¯ c,Á in 1-D, for 0 0.5 1 -0.2 0 0.2 0 0.1 0.2 0.3 Á = 0 Á = 0.01 Á = 0.1 Á = 0.3 Figure 5.3: Top: examples of entropic semi-discrete ¯ c-transforms g ¯ c,Á in 1-D, for Laguerre cells “Sinkhorn” Laguerre cells Semi-discrete Optimal Transport is a multiclass SVM. Sinkhorn version is logistic regression. Decomposes space in entropic Laguerre diagrams 0 0.5 1 -0.2 0 0.2 0 0.1 0.2 0.3 Á = 0 Á = 0.01 Á = 0.1 Á = 0.3 Figure 5.3: Top: examples of entropic semi-discrete ¯ c-transforms g ¯ c,Á in 1-D, for ground cost c(x, y) = |x ≠ y| for varying Á (see colorbar). The red points are at locations (yj, ≠gj ) j . Bottom: examples of entropic semi-discrete ¯ c-transforms g ¯ c,Á in 2-D, for ground cost c(x, y) = Îx ≠ yÎ for varying Á. The black curves are the level sets of the function g ¯ c,Á, while the colors indicate the smoothed indicator function of the Laguerre cells ‰Á j . The red points are at locations yj œ R2 , and their size is proportional to gj . Semi-discrete OT and Entropy
71. ### Overview • Measures and Histograms • From Monge to Kantorovitch

Formulations • Entropic Regularization and Sinkhorn • Barycenters • Unbalanced OT and Gradient Flows • Minimum Kantorovitch Estimators • Gromov-Wasserstein
72. ### Wasserstein Barycenters Barycenters of measures (µk)k: P k k =

1 µ? 2 argmin µ P k kW2 2 (µk, µ)
73. ### Wasserstein Barycenters µ µ1 µ3 W2 (µ1 , µ )

W 2 (µ 2 ,µ ) µ2 W2 (µ3 ,µ ) Barycenters of measures (µk)k: P k k = 1 If µk = xk then µ? = P k kxk Generalizes Euclidean barycenter: µ? 2 argmin µ P k kW2 2 (µk, µ)
74. ### Wasserstein Barycenters µ µ1 µ3 W2 (µ1 , µ )

W 2 (µ 2 ,µ ) µ2 W2 (µ3 ,µ ) Barycenters of measures (µk)k: P k k = 1 If µk = xk then µ? = P k kxk Generalizes Euclidean barycenter: µ? 2 argmin µ P k kW2 2 (µk, µ)
75. ### Wasserstein Barycenters µ µ1 µ3 W2 (µ1 , µ )

W 2 (µ 2 ,µ ) µ2 W2 (µ3 ,µ ) Barycenters of measures (µk)k: P k k = 1 If µk = xk then µ? = P k kxk Generalizes Euclidean barycenter: µ2 µ1 Mc Cann’s displacement interpolation. µ? 2 argmin µ P k kW2 2 (µk, µ)
76. ### Wasserstein Barycenters µ µ1 µ3 W2 (µ1 , µ )

W 2 (µ 2 ,µ ) µ2 W2 (µ3 ,µ ) Barycenters of measures (µk)k: P k k = 1 If µk = xk then µ? = P k kxk Generalizes Euclidean barycenter: µ2 µ1 µ3 µ2 µ1 Mc Cann’s displacement interpolation. µ? 2 argmin µ P k kW2 2 (µk, µ)
77. ### Wasserstein Barycenters µ µ1 µ3 W2 (µ1 , µ )

W 2 (µ 2 ,µ ) µ2 W2 (µ3 ,µ ) Barycenters of measures (µk)k: P k k = 1 If µk = xk then µ? = P k kxk Generalizes Euclidean barycenter: µ2 µ1 µ3 µ exists and is unique. Theorem: if µ1 does not vanish on small sets, [Agueh, Carlier, 2010] (for c(x, y) = ||x y||2) µ2 µ1 Mc Cann’s displacement interpolation. µ? 2 argmin µ P k kW2 2 (µk, µ)

79. ### 110 Dynamic Formulations –0 –1 / 5 –2 / 5

–3 / 5 –4 / 5 –1 Figure 7.2: Comparison of displacement interpolation (7.8) of discrete measures. Top: point clouds (empirical measures (–0, –1) with the same number of points). Bottom: same but with weight. For 0 < t < 1, the top example corresponds to an Displacement Interpolation

82. ### Regularized Barycenters min (⇡k)k,µ ( X k k (hc, ⇡k

i + "KL(⇡k |⇡0,k)) ; 8k, ⇡k 2 ⇧(µk, µ) )
83. ### Regularized Barycenters ! Need to ﬁx a discretization grid for

µ, i.e. choose (⇡0,k)k min (⇡k)k,µ ( X k k (hc, ⇡k i + "KL(⇡k |⇡0,k)) ; 8k, ⇡k 2 ⇧(µk, µ) )
84. ### Regularized Barycenters ! Need to ﬁx a discretization grid for

µ, i.e. choose (⇡0,k)k ! Sinkhorn-like algorithm [Benamou, Carlier, Cuturi, Nenna, Peyr´ e, 2015]. min (⇡k)k,µ ( X k k (hc, ⇡k i + "KL(⇡k |⇡0,k)) ; 8k, ⇡k 2 ⇧(µk, µ) )
85. ### Regularized Barycenters ! Need to ﬁx a discretization grid for

µ, i.e. choose (⇡0,k)k ! Sinkhorn-like algorithm [Benamou, Carlier, Cuturi, Nenna, Peyr´ e, 2015]. [Solomon et al, SIGGRAPH 2015] min (⇡k)k,µ ( X k k (hc, ⇡k i + "KL(⇡k |⇡0,k)) ; 8k, ⇡k 2 ⇧(µk, µ) )

µ6 µ1 µ2 0 1
90. ### Barycenter on a Surface 1 = (1, . . .

, 1)/6 µ1 µ2 µ3 µ4 µ5 µ6 µ1 µ2 0 1
91. ### Color Transfer µ ⌫ Input images: (f, g) (chrominance components)

Input measures: f g µ(A) = U(f 1(A)), ⌫(A) = U(g 1(A))
92. ### Color Transfer µ ⌫ Input images: (f, g) (chrominance components)

Input measures: f g µ(A) = U(f 1(A)), ⌫(A) = U(g 1(A))
93. ### Color Transfer µ ⌫ Input images: (f, g) (chrominance components)

Input measures: f T T f g ˜ T g µ(A) = U(f 1(A)), ⌫(A) = U(g 1(A))

95. ### Overview • Measures and Histograms • From Monge to Kantorovitch

Formulations • Entropic Regularization and Sinkhorn • Barycenters • Unbalanced OT and Gradient Flows • Minimum Kantorovitch Estimators • Gromov-Wasserstein
96. ### Unbalanced Transport (⇠, µ) 2 M+(X)2, KL(⇠|µ) def. = R

X log ⇣ d⇠ dµ ⌘ dµ + R X (dµ d⇠) WFc(µ, ⌫) def. = min ⇡ hc, ⇡i + KL(P1]⇡|µ) + KL(P2]⇡|⌫) [Liereo, Mielke, Savar´ e 2015]
97. ### Unbalanced Transport [Liereo, Mielke, Savar´ e 2015] [Chizat, Schmitzer, Peyr´

e, Vialard 2015] Proposition: then WF1/2 c is a distance on M+(X). (⇠, µ) 2 M+(X)2, KL(⇠|µ) def. = R X log ⇣ d⇠ dµ ⌘ dµ + R X (dµ d⇠) If c(x, y) = log(cos(min(d(x, y), ⇡ 2 )) WFc(µ, ⌫) def. = min ⇡ hc, ⇡i + KL(P1]⇡|µ) + KL(P2]⇡|⌫) [Liereo, Mielke, Savar´ e 2015]
98. ### Unbalanced Transport ! “Dynamic” Benamou-Brenier formulation. [Liereo, Mielke, Savar´ e

2015][Kondratyev, Monsaingeon, Vorotnikov, 2015] [Liereo, Mielke, Savar´ e 2015] [Chizat, Schmitzer, Peyr´ e, Vialard 2015] Proposition: then WF1/2 c is a distance on M+(X). [Chizat, Schmitzer, P, Vialard 2015] (⇠, µ) 2 M+(X)2, KL(⇠|µ) def. = R X log ⇣ d⇠ dµ ⌘ dµ + R X (dµ d⇠) If c(x, y) = log(cos(min(d(x, y), ⇡ 2 )) WFc(µ, ⌫) def. = min ⇡ hc, ⇡i + KL(P1]⇡|µ) + KL(P2]⇡|⌫) [Liereo, Mielke, Savar´ e 2015] Balanced OT Unbalanced OT
99. ### p = + 1 p = 2 p = 1

Metric space (X, d), minimize F(x) on X. F(x) = ||x||2 on (X = R2, || · ||p) Implicit Euler step: xk+1 def. = argmin x2X d(xk, x)2 + ⌧F(x) xk+1 xk {x ; d(xk, x) ⇠ ⌧} xk+2 xk+3 Implicit Euler Stepping
100. ### xk+1 = argmin x2X d(xk, x)2 + ⌧hrF(xk), xi p

= + 1 p = 1 p = 2 p = + 1 p = 2 p = 1 xk+1 = argmin x2X d(xk, x)2 + ⌧F(x) Metric space (X, d), minimize F(x) on X. F(x) = ||x||2 on (X = R2, || · ||p) Implicit Explicit Implicit vs. Explicit Stepping
101. ### Wasserstein Gradient Flows Implicit Euler step: [Jordan, Kinderlehrer, Otto 1998]

µt+1 = ProxW ⌧f (µt) def. = argmin µ2M+(X) W2 2 (µt, µ) + ⌧f(µ)
102. ### Wasserstein Gradient Flows Implicit Euler step: [Jordan, Kinderlehrer, Otto 1998]

Formal limit ⌧ ! 0: @tµ = div (µr(f0(µ))) µt+1 = ProxW ⌧f (µt) def. = argmin µ2M+(X) W2 2 (µt, µ) + ⌧f(µ)
103. ### Wasserstein Gradient Flows Implicit Euler step: [Jordan, Kinderlehrer, Otto 1998]

Formal limit ⌧ ! 0: @tµ = div (µr(f0(µ))) (heat di↵usion) f(µ) = R log(dµ dx )dµ @tµ = µ µt+1 = ProxW ⌧f (µt) def. = argmin µ2M+(X) W2 2 (µt, µ) + ⌧f(µ)
104. ### Wasserstein Gradient Flows Implicit Euler step: [Jordan, Kinderlehrer, Otto 1998]

Formal limit ⌧ ! 0: @tµ = div (µr(f0(µ))) (advection) f(µ) = R wdµ @tµ = div(µrw) (heat di↵usion) f(µ) = R log(dµ dx )dµ @tµ = µ Evolution µt Evolution µt rw rw µt+1 = ProxW ⌧f (µt) def. = argmin µ2M+(X) W2 2 (µt, µ) + ⌧f(µ)
105. ### Wasserstein Gradient Flows Implicit Euler step: [Jordan, Kinderlehrer, Otto 1998]

Formal limit ⌧ ! 0: @tµ = div (µr(f0(µ))) (advection) f(µ) = R wdµ @tµ = div(µrw) (heat di↵usion) f(µ) = R log(dµ dx )dµ @tµ = µ (non-linear di↵usion) f(µ) = 1 m 1 R (dµ dx )m 1dµ @tµ = µm Evolution µt Evolution µt rw rw µt+1 = ProxW ⌧f (µt) def. = argmin µ2M+(X) W2 2 (µt, µ) + ⌧f(µ)

107. ### ˆ µn µ n ! +1 ˆ µn = 1

n Pn i=1 xi ˆ H(ˆ µn) def. = P i log(minj6=i ||xi xj ||) R log(dµ dx (x))dµ(x) H(µ) def. = n ! +1 Lagrangian Discretization of Entropy
108. ### time d⇢t dt = ⇢t + r(V ⇢t) min ⇢

E(⇢) def. = Z V (x)⇢(x)dx + Z ⇢(x) log(⇢(x))dx V (x) = ||x||2 Wasserstein ﬂow of E: Lagrangian Discretization of Gradient Flows
109. ### Generalized Entropic Regularization Primal: min ⇡ hdp, ⇡i + f1(P1]⇡)

+ f2(P2]⇡) + "KL(⇡|⇡0)
110. ### Generalized Entropic Regularization Primal: min ⇡ hdp, ⇡i + f1(P1]⇡)

+ f2(P2]⇡) + "KL(⇡|⇡0) Dual: (a, b) def. = (e u " , e v " ) max u,v f⇤ 1 (u) f⇤ 2 (v) "he u " , Ke v " i ⇡(x, y) = a(x)K(x, y)b(y)
111. ### Generalized Entropic Regularization Primal: min ⇡ hdp, ⇡i + f1(P1]⇡)

+ f2(P2]⇡) + "KL(⇡|⇡0) Block coordinates relaxation: max u f⇤ 1 (u) "he u " , Ke v " i max v f⇤ 2 (v) "he v " , K⇤e u " i (Iv) (Iu) Dual: (a, b) def. = (e u " , e v " ) max u,v f⇤ 1 (u) f⇤ 2 (v) "he u " , Ke v " i ⇡(x, y) = a(x)K(x, y)b(y)
112. ### Generalized Entropic Regularization Primal: min ⇡ hdp, ⇡i + f1(P1]⇡)

+ f2(P2]⇡) + "KL(⇡|⇡0) Block coordinates relaxation: max u f⇤ 1 (u) "he u " , Ke v " i max v f⇤ 2 (v) "he v " , K⇤e u " i (Iv) (Iu) Proposition: ProxKL f1/" (µ) def. = argmin ⌫f1(⌫) + "KL(⌫|µ) the solutions of (Iu) and (Iv) read: a = ProxKL f1/" (Kb) Kb b = ProxKL f2/" (K⇤a) K⇤a Dual: (a, b) def. = (e u " , e v " ) max u,v f⇤ 1 (u) f⇤ 2 (v) "he u " , Ke v " i ⇡(x, y) = a(x)K(x, y)b(y)
113. ### Generalized Entropic Regularization Primal: ! Only matrix-vector multiplications. ! Highly

parallelizable. ! On regular grids: only convolutions! Linear time iterations. min ⇡ hdp, ⇡i + f1(P1]⇡) + f2(P2]⇡) + "KL(⇡|⇡0) Block coordinates relaxation: max u f⇤ 1 (u) "he u " , Ke v " i max v f⇤ 2 (v) "he v " , K⇤e u " i (Iv) (Iu) Proposition: ProxKL f1/" (µ) def. = argmin ⌫f1(⌫) + "KL(⌫|µ) the solutions of (Iu) and (Iv) read: a = ProxKL f1/" (Kb) Kb b = ProxKL f2/" (K⇤a) K⇤a Dual: (a, b) def. = (e u " , e v " ) max u,v f⇤ 1 (u) f⇤ 2 (v) "he u " , Ke v " i ⇡(x, y) = a(x)K(x, y)b(y)
114. ### Gradient Flows: Crowd Motion µt+1 def. = argminµ W↵ ↵

(µt, µ) + ⌧f(µ) Congestion-inducing function: f(µ) = ◆[0,] (µ) + hw, µi [Maury, Roudne↵-Chupin, Santambrogio 2010]
115. ### Gradient Flows: Crowd Motion µt+1 def. = argminµ W↵ ↵

(µt, µ) + ⌧f(µ) Proposition: Prox1 " f (µ) = min(e "wµ, ) Congestion-inducing function: f(µ) = ◆[0,] (µ) + hw, µi [Maury, Roudne↵-Chupin, Santambrogio 2010]
116. ### Gradient Flows: Crowd Motion µt+1 def. = argminµ W↵ ↵

(µt, µ) + ⌧f(µ) Proposition: Prox1 " f (µ) = min(e "wµ, )  = ||µt=0 ||1  = 2||µt=0 ||1  = 4||µt=0 ||1 Congestion-inducing function: f(µ) = ◆[0,] (µ) + hw, µi [Maury, Roudne↵-Chupin, Santambrogio 2010] rw
117. ### Crowd Motion on a Surface Potential cos(w)  = ||µt=0

||1  = 6||µt=0 ||1 X = triangulated mesh.
118. ### Crowd Motion on a Surface Potential cos(w)  = ||µt=0

||1  = 6||µt=0 ||1 X = triangulated mesh.
119. ### Gradient Flows: Crowd Motion with Obstacles Potential cos(w)  =

||µt=0 ||1  = 2||µt=0 ||1  = 4||µt=0 ||1  = 6||µt=0 ||1 X = sub-domain of R2.
120. ### Gradient Flows: Crowd Motion with Obstacles Potential cos(w)  =

||µt=0 ||1  = 2||µt=0 ||1  = 4||µt=0 ||1  = 6||µt=0 ||1 X = sub-domain of R2.
121. ### Multiple-Density Gradient Flows (µ1,t+1, µ2,t+1) def. = argmin (µ1,µ2) W↵

↵ (µ1,t, µ1) + W↵ ↵ (µ2,t, µ2) + ⌧f(µ1, µ2)
122. ### Multiple-Density Gradient Flows Wasserstein attraction: rw (µ1,t+1, µ2,t+1) def. =

argmin (µ1,µ2) W↵ ↵ (µ1,t, µ1) + W↵ ↵ (µ2,t, µ2) + ⌧f(µ1, µ2) f(µ1, µ2) = W↵ ↵ (µ1, µ2) + h1(µ1) + h2(µ2) hi(µ) = hw, µi
123. ### Multiple-Density Gradient Flows Wasserstein attraction: rw (µ1,t+1, µ2,t+1) def. =

argmin (µ1,µ2) W↵ ↵ (µ1,t, µ1) + W↵ ↵ (µ2,t, µ2) + ⌧f(µ1, µ2) f(µ1, µ2) = W↵ ↵ (µ1, µ2) + h1(µ1) + h2(µ2) hi(µ) = ◆[0,] (µ). hi(µ) = hw, µi
124. ### Overview • Measures and Histograms • From Monge to Kantorovitch

Formulations • Entropic Regularization and Sinkhorn • Barycenters • Unbalanced OT and Gradient Flows • Minimum Kantorovitch Estimators • Gromov-Wasserstein
125. ### Discriminative vs Generative Models Z x z X Z X

Low dimension High dimension Generative Discriminative g✓ d⇠ g✓ d⇠
126. ### Discriminative vs Generative Models Z x z X Z X

Low dimension High dimension Generative Discriminative g✓ d⇠ g✓ d⇠ classiﬁcation, z =class probability Supervised: ! Learn d⇠ from labeled data (xi, zi)i.
127. ### Discriminative vs Generative Models Z x z X Z X

Low dimension High dimension Generative Discriminative g✓ d⇠ g✓ d⇠ classiﬁcation, z =class probability Supervised: ! Learn d⇠ from labeled data (xi, zi)i. Compression: z = d⇠(x) is a representation. Generation: x = g✓(z) is a synthesis. Un-supervised: ! Learn (g✓, d⇠) from data (xi)i.
128. ### Discriminative vs Generative Models Z x z X Z X

Low dimension High dimension Generative Discriminative g✓ d⇠ g✓ d⇠ classiﬁcation, z =class probability Supervised: ! Learn d⇠ from labeled data (xi, zi)i. Compression: z = d⇠(x) is a representation. Generation: x = g✓(z) is a synthesis. Un-supervised: ! Learn (g✓, d⇠) from data (xi)i. Density ﬁtting g✓({zi }i) ⇡ {xi }i Auto-encoders g✓(d⇠(xi)) ⇡ xi
129. ### Discriminative vs Generative Models Z x z X Z X

Low dimension High dimension Generative Discriminative g✓ d⇠ g✓ d⇠ classiﬁcation, z =class probability Supervised: ! Learn d⇠ from labeled data (xi, zi)i. Compression: z = d⇠(x) is a representation. Generation: x = g✓(z) is a synthesis. Un-supervised: ! Learn (g✓, d⇠) from data (xi)i. Density ﬁtting g✓({zi }i) ⇡ {xi }i Auto-encoders g✓(d⇠(xi)) ⇡ xi Optimal transport map d⇠
130. ### Density Fitting and Generative Models Parametric model: ✓ 7! µ✓

µ✓ ✓ Observations: ⌫ = 1 n Pn i=1 xi
131. ### Density Fitting and Generative Models Parametric model: ✓ 7! µ✓

µ✓ ✓ Observations: ⌫ = 1 n Pn i=1 xi dµ✓(y) = f✓(y)dy Density ﬁtting: Maximum likelihood (MLE) min ✓ c KL(⌫|µ✓) def. = X j log(f✓(yj))
132. ### Density Fitting and Generative Models Parametric model: ✓ 7! µ✓

µ✓ ✓ Observations: ⌫ = 1 n Pn i=1 xi g✓ µ✓ X Z ⇣ Generative model ﬁt: µ✓ = g✓,]⇣ ! MLE undeﬁned. ! Need a weaker metric. c KL(µ✓ |⌫) = +1 dµ✓(y) = f✓(y)dy Density ﬁtting: Maximum likelihood (MLE) min ✓ c KL(⌫|µ✓) def. = X j log(f✓(yj))
133. ### Loss Functions for Measures Optimal Transport Distances µ = 1

N P i xi W(µ, ⌫)p def. = min T 2Cµ,⌫ P i,j Ti,j ||xi yj ||p ⌫ = 1 P P j yj Density ﬁtting: min ✓ D(µ✓, ⌫) T µ ⌫
134. ### Loss Functions for Measures Optimal Transport Distances µ = 1

N P i xi W(µ, ⌫)p def. = min T 2Cµ,⌫ P i,j Ti,j ||xi yj ||p ⌫ = 1 P P j yj Maximum Mean Discrepancy (MMD) ||µ ⌫||2 k def. = 1 N2 X i,i0 k(xi, xi0 ) + 1 P2 X j,j0 k(yj, yj0 ) 2 NP X i,j k(xi, yj) Gaussian: k(x, y) = e | |x y| |2 2 2 . Energy distance: k(x, y) = ||x y||2. Density ﬁtting: min ✓ D(µ✓, ⌫) T µ ⌫ k µ ⌫
135. ### Loss Functions for Measures Optimal Transport Distances µ = 1

N P i xi W(µ, ⌫)p def. = min T 2Cµ,⌫ P i,j Ti,j ||xi yj ||p ⌫ = 1 P P j yj Maximum Mean Discrepancy (MMD) ||µ ⌫||2 k def. = 1 N2 X i,i0 k(xi, xi0 ) + 1 P2 X j,j0 k(yj, yj0 ) 2 NP X i,j k(xi, yj) Gaussian: k(x, y) = e | |x y| |2 2 2 . Energy distance: k(x, y) = ||x y||2. Density ﬁtting: min ✓ D(µ✓, ⌫) T µ ⌫ k µ ⌫ µ ⌫ T" ¯ W"(µ, ⌫)p def. = W"(µ, ⌫)p 1 2 W"(µ, µ)p 1 2 W"(⌫, ⌫)p W"(µ, ⌫)p def. = P i,j T" i,j ||xi yj ||p Sinkhorn divergences [Cuturi 13]
136. ### Loss Functions for Measures Optimal Transport Distances µ = 1

N P i xi W(µ, ⌫)p def. = min T 2Cµ,⌫ P i,j Ti,j ||xi yj ||p ⌫ = 1 P P j yj Maximum Mean Discrepancy (MMD) ||µ ⌫||2 k def. = 1 N2 X i,i0 k(xi, xi0 ) + 1 P2 X j,j0 k(yj, yj0 ) 2 NP X i,j k(xi, yj) Gaussian: k(x, y) = e | |x y| |2 2 2 . Energy distance: k(x, y) = ||x y||2. for k(x, y) = ||x y||p Density ﬁtting: min ✓ D(µ✓, ⌫) Theorem: [Ramdas, G.Trillos, Cuturi 17] "!+1 ! ||µ ⌫||2 k ¯ W"(µ, ⌫)p "!0 ! W(µ, ⌫)p T µ ⌫ k µ ⌫ µ ⌫ T" ¯ W"(µ, ⌫)p def. = W"(µ, ⌫)p 1 2 W"(µ, µ)p 1 2 W"(⌫, ⌫)p W"(µ, ⌫)p def. = P i,j T" i,j ||xi yj ||p Sinkhorn divergences [Cuturi 13]
137. ### Loss Functions for Measures Optimal Transport Distances µ = 1

N P i xi W(µ, ⌫)p def. = min T 2Cµ,⌫ P i,j Ti,j ||xi yj ||p ⌫ = 1 P P j yj Maximum Mean Discrepancy (MMD) ||µ ⌫||2 k def. = 1 N2 X i,i0 k(xi, xi0 ) + 1 P2 X j,j0 k(yj, yj0 ) 2 NP X i,j k(xi, yj) Gaussian: k(x, y) = e | |x y| |2 2 2 . Energy distance: k(x, y) = ||x y||2. for k(x, y) = ||x y||p Density ﬁtting: min ✓ D(µ✓, ⌫) Theorem: [Ramdas, G.Trillos, Cuturi 17] "!+1 ! ||µ ⌫||2 k ¯ W"(µ, ⌫)p "!0 ! W(µ, ⌫)p T µ ⌫ k µ ⌫ – Scale free (no , no heavy tail kernel). – Non-Euclidean, arbitrary ground distance. – Less biased gradient. – No curse of dimension (low sample complexity). Best of both worlds: ! cross-validate " µ ⌫ T" ¯ W"(µ, ⌫)p def. = W"(µ, ⌫)p 1 2 W"(µ, µ)p 1 2 W"(⌫, ⌫)p W"(µ, ⌫)p def. = P i,j T" i,j ||xi yj ||p Sinkhorn divergences [Cuturi 13]
138. ### Deep Discriminative vs Generative Models ✓1 ✓2 ⇢ ⇢ z

x g✓ ⇢ ⇢ z x Discriminative Generative d⇠ ⇠1 ⇠2 g✓(z) = ⇢(✓K(. . . ⇢(✓2(⇢(✓1(z) . . .) Deep networks: d⇠(x) = ⇢(⇠K(. . . ⇢(⇠2(⇢(⇠1(x) . . .)
139. ### Deep Discriminative vs Generative Models ✓1 ✓2 ⇢ ⇢ z

x g✓ ⇢ ⇢ z x Discriminative Generative z1 z2 g✓ Z x z X d⇠ d⇠ ⇠1 ⇠2 g✓(z) = ⇢(✓K(. . . ⇢(✓2(⇢(✓1(z) . . .) Deep networks: d⇠(x) = ⇢(⇠K(. . . ⇢(⇠2(⇢(⇠1(x) . . .)

141. ### Overview • Measures and Histograms • From Monge to Kantorovitch

Formulations • Entropic Regularization and Sinkhorn • Barycenters • Unbalanced OT and Gradient Flows • Minimum Kantorovitch Estimators • Gromov-Wasserstein

143. ### unregistered spaces Gromov-Wasserstein Inputs: {(similarity/kernel matrix, histogram)} X Y [Memoli

2011] Def. Gromov-Wasserstein distance: [Sturm 2012]
144. ### unregistered spaces Gromov-Wasserstein Inputs: {(similarity/kernel matrix, histogram)} X Y !

need for a fast approximate solver. ! NP-hard in general. [Memoli 2011] Def. Gromov-Wasserstein distance: [Sturm 2012]

146. ### Y xi X Gromov-Wasserstein as a Metric yj X Y

f f X Y Def. () Isometries on M: Def.
147. ### Y xi X Gromov-Wasserstein as a Metric yj ! “bending-invariant”

objects recognition. X Y f f X Y Def. () Isometries on M: Def. Prop. [Memoli 2011]
148. ### X Y For Arbitrary Spaces Metric-measure spaces (X, Y ):

(dX , µ), (dY , ⌫)
149. ### X Y For Arbitrary Spaces d X (x,x 0 )

dY (y, y0) Def. Gromov-Wasserstein distance: [Memoli 2011] [Sturm 2012] GW2 2 (dX , µ, dY , ⌫) def. = min ⇡2⇧(µ,⌫) Z X2⇥Y 2 |dX(x, x0) dY (y, y0)|2d⇡(x, y)d⇡(x0, y0) Metric-measure spaces (X, Y ): (dX , µ), (dY , ⌫)
150. ### X Y For Arbitrary Spaces f X Y Prop. GW

is a distance on mm-spaces/isometries. ! “bending-invariant” objects recognition. d X (x,x 0 ) dY (y, y0) Def. Gromov-Wasserstein distance: [Memoli 2011] [Sturm 2012] GW2 2 (dX , µ, dY , ⌫) def. = min ⇡2⇧(µ,⌫) Z X2⇥Y 2 |dX(x, x0) dY (y, y0)|2d⇡(x, y)d⇡(x0, y0) Metric-measure spaces (X, Y ): (dX , µ), (dY , ⌫)
151. ### X Y For Arbitrary Spaces f X Y Prop. GW

is a distance on mm-spaces/isometries. ! need for a fast approximate solver. ! “bending-invariant” objects recognition. ! QAP: NP-hard in general. d X (x,x 0 ) dY (y, y0) Def. Gromov-Wasserstein distance: [Memoli 2011] [Sturm 2012] GW2 2 (dX , µ, dY , ⌫) def. = min ⇡2⇧(µ,⌫) Z X2⇥Y 2 |dX(x, x0) dY (y, y0)|2d⇡(x, y)d⇡(x0, y0) Metric-measure spaces (X, Y ): (dX , µ), (dY , ⌫)

153. ### Entropic Gromov Wasserstein Projected mirror descent: Def. Entropic Gromov-Wasserstein Projected

mirror descent: Def. Prop. for ⌧ = 1/", the iteration reads repeat: until convergence. initialization: return T func T = GW(C, ¯ C, p, q)
154. ### phine Vladimir G. Kim Adobe Research Suvrit Sra MIT Source

Targets Figure 1: Entropic GW can ﬁnd correspondences between a source surface (left) and a surface with similar structure, a surface with shared semantic structure, a noisy 3D point cloud, an icon, and a hand drawing. Each fuzzy map was computed using the same code. In this paper, we propose a new correspondence algorithm that minimizes distortion of long- and short-range distances alike. We study an entropically-regularized version of the Gromov-Wasserstein (GW) mapping objective function from [M´ emoli 2011] measuring the distortion of geodesic distances. The optimizer is a probabilistic matching expressed as a “fuzzy” correspondence matrix in the style of [Kim et al. 2012; Solomon et al. 2012]; we control sharpness of the correspondence via the weight of the entropic regularizer. 0 0.02 0.04 0 0.02 0 0.02 Teddies Humans Four-legged Armadillo Figure 15: MDS embedding of four classes from SHREC dataset. 0 0.5 1 0 0.5 1 1 5 10 15 20 25 30 35 40 45 PCA 1 PCA 2 Figure 16: Recovery of galloping horse sequence. 0 is the base shape) as a feature vector for shape i. We reproduce the result presented in the work of Rustamov et al., recovering the circular structure of meshes from a galloping horse animation sequence (Figure 16). Unlike Rustamov et al., however, our method does not require ground truth maps between shapes as input. 5.2 Supervised Matching An important feature of a matching tool is the ability to incorporate user input, e.g. ground truth matches of points or regions. In the GWα framework, one way to enforce these constraints is to provide a stencil S specifying a sparsity pattern for the map Γ. Incorporating Figure 18: Mapping a set of 185 images onto a two shapes while preserving color similarity. (Images from Flickr public domain collection.) Rn0×n0 + and D ∈ Rn×n + we are given symmetric weight matrices W0 ∈ Rn0×n0 + and W ∈ Rn×n + . We could solve a weighted version of the GWα matching problem (3) that prioritizes maps preserving distances corresponding to large W values: min Γ∈M ijkℓ (D0ij −Dkℓ )2ΓikΓjℓW0ijWjℓµ0iµ0jµkµℓ . (8) For instance, (W0, W) might contain conﬁdence values expressing the quality of the entries of (D0, D). Or, W0, W could take values in {ε, 1} reducing the weight of distances that are unknown or do not need to be preserved by Γ. Following the same simpliﬁcations as §3.1, we can optimize this objective by minimizing ⟨Γ, ΛW (Γ)⟩, where ΛW (Γ) := 1 2 [D∧2 0 ⊗ W0][[µ0 ]]Γ[[µ]]W − [D0 ⊗ W0][[µ0 ]]Γ[[µ]][D ⊗ W] + 1 2 W0[[µ0 ]]Γ[[µ]][D∧2 ⊗ W] Applications of GW: Shapes Analysis Use T to deﬁne registration between: Colors distribution Shape Shape Shape
155. ### phine Vladimir G. Kim Adobe Research Suvrit Sra MIT Source

Targets Figure 1: Entropic GW can ﬁnd correspondences between a source surface (left) and a surface with similar structure, a surface with shared semantic structure, a noisy 3D point cloud, an icon, and a hand drawing. Each fuzzy map was computed using the same code. In this paper, we propose a new correspondence algorithm that minimizes distortion of long- and short-range distances alike. We study an entropically-regularized version of the Gromov-Wasserstein (GW) mapping objective function from [M´ emoli 2011] measuring the distortion of geodesic distances. The optimizer is a probabilistic matching expressed as a “fuzzy” correspondence matrix in the style of [Kim et al. 2012; Solomon et al. 2012]; we control sharpness of the correspondence via the weight of the entropic regularizer. 0 0.02 0.04 0 0.02 0 0.02 Teddies Humans Four-legged Armadillo Figure 15: MDS embedding of four classes from SHREC dataset. 0 0.5 1 0 0.5 1 1 5 10 15 20 25 30 35 40 45 PCA 1 PCA 2 Figure 16: Recovery of galloping horse sequence. 0 is the base shape) as a feature vector for shape i. We reproduce the result presented in the work of Rustamov et al., recovering the circular structure of meshes from a galloping horse animation sequence (Figure 16). Unlike Rustamov et al., however, our method does not require ground truth maps between shapes as input. 5.2 Supervised Matching An important feature of a matching tool is the ability to incorporate user input, e.g. ground truth matches of points or regions. In the GWα framework, one way to enforce these constraints is to provide a stencil S specifying a sparsity pattern for the map Γ. Incorporating Figure 18: Mapping a set of 185 images onto a two shapes while preserving color similarity. (Images from Flickr public domain collection.) Rn0×n0 + and D ∈ Rn×n + we are given symmetric weight matrices W0 ∈ Rn0×n0 + and W ∈ Rn×n + . We could solve a weighted version of the GWα matching problem (3) that prioritizes maps preserving distances corresponding to large W values: min Γ∈M ijkℓ (D0ij −Dkℓ )2ΓikΓjℓW0ijWjℓµ0iµ0jµkµℓ . (8) For instance, (W0, W) might contain conﬁdence values expressing the quality of the entries of (D0, D). Or, W0, W could take values in {ε, 1} reducing the weight of distances that are unknown or do not need to be preserved by Γ. Following the same simpliﬁcations as §3.1, we can optimize this objective by minimizing ⟨Γ, ΛW (Γ)⟩, where ΛW (Γ) := 1 2 [D∧2 0 ⊗ W0][[µ0 ]]Γ[[µ]]W − [D0 ⊗ W0][[µ0 ]]Γ[[µ]][D ⊗ W] + 1 2 W0[[µ0 ]]Γ[[µ]][D∧2 ⊗ W] Applications of GW: Shapes Analysis 0 0.02 0.04 0 0.02 0 0.02 Te Hu Fo Ar Figure 1: The database that has been used, divid MDS in 3-D Use T to deﬁne registration between: Colors distribution Shape Shape Shape Geodesic distances GW distances MDS Vizualization Shapes (Xs)s 0 0.02 0.04 0 0.02 0 0.02 Teddies Humans Four-legged Armadillo Figure 15: MDS embedding of four classes from SHREC dataset. 0 0.5 1 1 5 10 15 20 25 30 35 40 45 PCA 2 Figure 18: Mapping a set of 185 images onto a two shapes while preserving color similarity. (Images from Flickr public domain collection.) Rn0×n0 + and D ∈ Rn×n + we are given symmetric weight matrices W0 ∈ Rn0×n0 + and W ∈ Rn×n + . We could solve a weighted version of the GWα matching problem (3) that prioritizes maps preserving distances corresponding to large W values: MDS in 2-D
156. ### Applications of GW: Quantum Chemistry Regression problem: ! f by

solving DFT approximation is too costly.
157. ### Applications of GW: Quantum Chemistry Regression problem: ! f by

solving DFT approximation is too costly. [Rupp et al 2012] 2 6 6 6 6 4 3 7 7 7 7 5
158. ### Applications of GW: Quantum Chemistry Regression problem: ! f by

solving DFT approximation is too costly. GW-interpolation: [Rupp et al 2012] 2 6 6 6 6 4 3 7 7 7 7 5

161. ### Gromov-Wasserstein Geodesics Def. Gromov-Wasserstein Geodesic ! X ⇥ Y is

not practical for most applications. (need to ﬁx the size of the geodesic embedding space) ! Extension to more than 2 input spaces? Prop. [Sturm 2012]

163. ### Gromov-Wasserstein Barycenters Input: 1 2 3 Def. GW Barycenters repeat:

until convergence. initialization: C C0 for s = 1 to S do return C Alternating minimization: On Ts
164. ### Conclusion: Toward High-dimensional OT Monge Kantorovich Dantzig Brenier Otto McCann

Villani ansport framework Sliced Wasserstein projection Applications lication to Color Transfer Source image (X) Style image (Y) Sliced Wasserstein projection of X to style image color statistics Y Source image after color transfer J. Rabin Wasserstein Regularization