Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Numerical Optimal Transport

Gabriel Peyré
December 07, 2017

Numerical Optimal Transport

Slides for a course.

Gabriel Peyré

December 07, 2017
Tweet

More Decks by Gabriel Peyré

Other Decks in Research

Transcript

  1. Numerical Optimal Transport Gabriel Peyré É C O L E

    N O R M A L E S U P É R I E U R E Outline • What is the Lasso • Lasso with an orthogonal design • From projected gradient to proximal gradien • Optimality conditions and subgradients (LAR • Coordinate descent algorithm … with some demos www.numerical-tours.com http://optimaltransport.github.io
  2. Overview • Measures and Histograms • From Monge to Kantorovitch

    Formulations • Entropic Regularization and Sinkhorn • Barycenters • Unbalanced OT and Gradient Flows • Minimum Kantorovitch Estimators • Gromov-Wasserstein
  3. Comparing Measures and Spaces Source image (X) Style image (Y)

    Sliced Wasserstein projection of X to style image color statistics Y Source image after color transfer J. Rabin Wasserstein Regularization ! images, vision, graphics and machine learning, . . . • Probability distributions and histograms
  4. Comparing Measures and Spaces Source image (X) Style image (Y)

    Sliced Wasserstein projection of X to style image color statistics Y Source image after color transfer J. Rabin Wasserstein Regularization ! images, vision, graphics and machine learning, . . . • Probability distributions and histograms • Optimal transport Optimal transport mean L2 mean
  5. Probability Measures Positive Radon measure µ on a set X.

    dµ(x) = m(x)dx Measure of sets A ⇢ X: µ(A) = R A dµ(x) > 0 X X X µ = P i µi xi 2.2. Assignment and Monge Problem 9 Discrete d = 1 Discrete d = 2 Density d = 1 Density d = 2 Figure 2.1: Schematic display of discrete distributions – = q n i =1 ai”xi (red cor- responds to empirical uniform distribution ai = 1/n, and blue to arbitrary distri- butions) and densities d–(x) = fl– (x)dx (in violet), in both 1-D and 2-D. Discrete distributions in 1-D are displayed using vertical segments (with length equal to ai ) and in 2-D using point clouds (radius equal to ai ). is the dimension), can have a density d–(x) = fl – (x)dx w.r.t. the Lebesgue measure, often denoted fl – = d – d x , which means that d ⁄ ⁄
  6. Probability Measures Positive Radon measure µ on a set X.

    dµ(x) = m(x)dx Integration against continuous functions: R X g(x)dµ(x) > 0 Measure of sets A ⇢ X: µ(A) = R A dµ(x) > 0 X X X µ = P i µi xi R X gdµ = R X m(x)dx R X gdµ = P i µig(xi) dµ(x) = m(x)dx µ = P i µi xi 2.2. Assignment and Monge Problem 9 Discrete d = 1 Discrete d = 2 Density d = 1 Density d = 2 Figure 2.1: Schematic display of discrete distributions – = q n i =1 ai”xi (red cor- responds to empirical uniform distribution ai = 1/n, and blue to arbitrary distri- butions) and densities d–(x) = fl– (x)dx (in violet), in both 1-D and 2-D. Discrete distributions in 1-D are displayed using vertical segments (with length equal to ai ) and in 2-D using point clouds (radius equal to ai ). is the dimension), can have a density d–(x) = fl – (x)dx w.r.t. the Lebesgue measure, often denoted fl – = d – d x , which means that d ⁄ ⁄
  7. Probability Measures Positive Radon measure µ on a set X.

    dµ(x) = m(x)dx Integration against continuous functions: R X g(x)dµ(x) > 0 Measure of sets A ⇢ X: µ(A) = R A dµ(x) > 0 X X X Probability (normalized) measure: µ(X) = R X dµ(x) = 1 µ = P i µi xi R X gdµ = R X m(x)dx R X gdµ = P i µig(xi) dµ(x) = m(x)dx µ = P i µi xi 2.2. Assignment and Monge Problem 9 Discrete d = 1 Discrete d = 2 Density d = 1 Density d = 2 Figure 2.1: Schematic display of discrete distributions – = q n i =1 ai”xi (red cor- responds to empirical uniform distribution ai = 1/n, and blue to arbitrary distri- butions) and densities d–(x) = fl– (x)dx (in violet), in both 1-D and 2-D. Discrete distributions in 1-D are displayed using vertical segments (with length equal to ai ) and in 2-D using point clouds (radius equal to ai ). is the dimension), can have a density d–(x) = fl – (x)dx w.r.t. the Lebesgue measure, often denoted fl – = d – d x , which means that d ⁄ ⁄
  8. Random vectors Radon measures P(X 2 A) R A dµ(x)

    = P(Xn 2 A) n!+1 ! P(X 2 A) 8 set A 8 continuous function f R fdµn n!+1 ! R fdµ Weak⇤ convergence: Convergence in law: Measures and Random Variables
  9. Random vectors Radon measures P(X 2 A) R A dµ(x)

    = P(Xn 2 A) n!+1 ! P(X 2 A) 8 set A 8 continuous function f R fdµn n!+1 ! R fdµ Weak⇤ convergence: Convergence in law: Weak convergence: . . . Measures and Random Variables
  10. Quotient space: Discrete measure: Discretization: Histogram vs. Empirical Lagrangian (point

    clouds) Eulerian (histograms) µ = N X i=1 µi xi xi 2 X, X i µi = 1 Constant weights µi = 1 N XN /⌃N {(µi)i > 0 ; P i µi = 1} Convex polytope (simplex): xi X X Fixed positions xi (e.g. grid)
  11. - f . s - t e e e l

    . x Radon measures (µ, ⌫) on (X, Y ). Transfer of measure by f : X ! Y : push forward. ⌫ = f]µ defined by: ⌫(A) def. = µ(f 1(A)) () R Y g(y)d⌫(y) def. = R X g(f(x))dµ(x) y = f(x) f X µ ⌫ Y A f 1(A) Push Forward
  12. - f . s - t e e e l

    . x Radon measures (µ, ⌫) on (X, Y ). Transfer of measure by f : X ! Y : push forward. ⌫ = f]µ defined by: ⌫(A) def. = µ(f 1(A)) () R Y g(y)d⌫(y) def. = R X g(f(x))dµ(x) y = f(x) f X µ ⌫ Y A f 1(A) Smooth densities: dµ = ⇢(x)dx, d⌫ = ⇠(x)dx Push Forward f]µ = ⌫ () ⇢(f(x))| det(@f(x))| = ⇠(x)
  13. µ f]µ f f ' Measures: push-forward Functions: pull-back f]'

    Remark: f] and f] are adjoints Z Y 'd(f]µ) = Z X (f]')dµ = P i xi def. = P i f(xi) def. = ' f f : X ! Y f] : M(X) ! M(Y) f] : C(Y) ! C(X) Push-forward vs. Pull-back
  14. P( lim n!+1 Xn = X) = 1 =) lim

    n!+1 E(|Xn X|p) = 0 P(Xn 2 A) n!+1 ! P(X 2 A) In law 8 " > 0, P(|Xn X| > ") n!+1 ! 0 In probability Almost sure In mean =) =) (the Xn can be defined on di↵erent spaces) Convergence of Random Variables
  15. Overview • Measures and Histograms • From Monge to Kantorovitch

    Formulations • Entropic Regularization and Sinkhorn • Barycenters • Unbalanced OT and Gradient Flows • Minimum Kantorovitch Estimators • Gromov-Wasserstein
  16. Monge Transport min ⌫=f]µ Z X c(x, f(x))dµ(x) f X

    µ Y ⌫ Theorem: [Brenier] for c(x, y) = ||x y||2, (µ, ⌫) with density, there exists a unique optimal f. One has f = r where is the unique convex function such that (r )]µ = ⌫
  17. Monge Transport min ⌫=f]µ Z X c(x, f(x))dµ(x) f X

    µ Y ⌫ Monge-Amp` ere equation: ⇢(r ) det(@2 ) = ⇠ Theorem: [Brenier] for c(x, y) = ||x y||2, (µ, ⌫) with density, there exists a unique optimal f. One has f = r where is the unique convex function such that (r )]µ = ⌫
  18. Monge Transport min ⌫=f]µ Z X c(x, f(x))dµ(x) f X

    µ Y ⌫ Monge-Amp` ere equation: ⇢(r ) det(@2 ) = ⇠ ⌫ µ f f0 ? µ ⌫ µ ⌫ f µ ⌫ f Non-uniqueness / non-existence: Theorem: [Brenier] for c(x, y) = ||x y||2, (µ, ⌫) with density, there exists a unique optimal f. One has f = r where is the unique convex function such that (r )]µ = ⌫
  19. Optimal Transport was formulated in 1930 by A.N. Tolstoi, 12

    years before Kantorovich. He even solved a "large scale" 10×68 instance! Before Kantorovitch
  20. xi yj Kantorovitch’s Formulation Points (xi)i, (yj)j T Input distributions

    Couplings Def. [Kantorovich 1942] Wasserstein Distance / EMD Def.
  21. OT Between General Measures Couplings: P1]⇡(S) def. = ⇡(S, X)

    P2]⇡(S) def. = ⇡(X, S) Marginals: ⇡ ⇧(µ, ⌫) def. = {⇡ 2 M+(X ⇥ X) ; P1]⇡ = µ, P2]⇡ = ⌫} ⇡ Optimal transport: [Kantorovitch 1942] Wp p (µ, ⌫) def. = min ⇡ ⇢ hdp, ⇡i = Z X⇥X d(x, y)pd⇡(x, y) ; ⇡ 2 ⇧(µ, ⌫)
  22. 1-D Optimal Transport Remark. If ⌦ = R, c(x, y)

    = c(|x y|), c convex, F 1 µ , F 1 ⌫ quantile functions, W(µ, ⌫)= Z 1 0 c(|F 1 µ (x) F 1 ⌫ (x)|)dx
  23. Cases 37 ρβ ρα Put Remark. If ⌦ = Rd,

    c(x, y) = kx yk2, and µ = N(mµ, ⌃µ), ⌫ = N(m⌫, ⌃⌫) then W2 2 (µ, ⌫) = kmµ m⌫ k2 + B(⌃µ, ⌃⌫)2 where B is the Bures metric B(⌃µ, ⌃⌫)2 = trace(⌃µ+⌃⌫ 2(⌃1/2 µ ⌃⌫⌃1/2 µ )1/2). The map T : x 7! m⌫ + A(x mµ) is optimal, where A = ⌃ 1 2 µ ✓ ⌃ 1 2 µ ⌃⌫⌃ 1 2 µ ◆1 2 ⌃ 1 2 µ . m 1 p 2⇡ e (x m)2 2 2 OT Between Gaussians T : x 7! m⌫ + A(x mµ)
  24. Bins-to-bins metrics: dµ(x) = ⇢(x)dx d˜ µ(x) = ˜ ⇢(x)dx

    µ ˜ µ Kullback-Leibler divergence: DKL(µ, ˜ µ) = Z ⇢(x) log ⇢(x) ˜ ⇢(x) dx Hellinger distance: D H (µ, ˜ µ)2 = Z ⇣p ⇢(x) p ˜ ⇢(x) ⌘ 2 dx Metrics on the Space of Measures
  25. Bins-to-bins metrics: dµ (x) = ⇢(x )dx dµ(x) = ⇢(x)dx

    d˜ µ(x) = ˜ ⇢(x)dx µ ˜ µ µ Kullback-Leibler divergence: DKL(µ, ˜ µ) = Z ⇢(x) log ⇢(x) ˜ ⇢(x) dx Hellinger distance: E↵ect of translation: D(µ, µ ) ⇡ cst W2(µ, µ ) = D H (µ, ˜ µ)2 = Z ⇣p ⇢(x) p ˜ ⇢(x) ⌘ 2 dx Metrics on the Space of Measures
  26. D'(↵| ) def. = Z X ' ✓ d↵ d

    ◆ d ||↵ ||B def. = max f2B Z X f(x)(d↵(x) d (x)) Csisz´ ar divergences: Dual norms: X X ↵ ↵ D' || · ||B Weak topology Strong topology ! KL, TV, 2 , Hellinger . . . ! W1, flat, RKHS ⇤ , energy dist, . . . Csiszar Divergence vs Dual Norms
  27. Csiszár divergences, a unifying way to define losses between arbitrary

    positive measures (discrete & densities). https://en.wikipedia.org/wiki/F- divergence … Csiszar Divergence
  28. -3 -2 -1 0 1 2 3 0 1 2

    3 Energy Gauss W 1 Flat -3 -2 -1 0 1 2 3 0 1 2 3 Energy Gauss W 1 Flat Dual norms: (aka Integral Probability Metrics) ||↵ ||B def. = max ⇢Z X f(x)(d↵(x) d (x)) ; f 2 B t t/2 Wasserstein 1: B = {f ; ||rf||1 6 1}. Flat norm: B = {f ; ||f||1 6 1, ||rf||1 6 1}. RKHS: B = f ; ||f||2 k 6 1 . ||↵ ||2 B = Z k(x, x0)d↵(x)d↵(x0) + Z k(x, x0)d (y)d (y0) 2 Z k(x, y)d↵(x)d (y) Energy distance: k(x, y) = ||x y|| Gaussian: k(x, y) = e | |x y| |2 2 2 ↵ ↵ Dual Norms
  29. Overview • Measures and Histograms • From Monge to Kantorovitch

    Formulations • Linear Programming and Semi-discrete • Entropic Regularization and Sinkhorn • Barycenters • Unbalanced OT and Gradient Flows • Minimum Kantorovitch Estimators • Gromov-Wasserstein
  30. Linear programming: µ = PN1 i=1 pi xi , ⌫

    = PN2 j=1 pj yi ⇠ O(N3) Hungarian/Auction: µ = 1 N PN i=1 xi , ⌫ = 1 N PN j=1 yj Algorithms
  31. Linear programming: µ = PN1 i=1 pi xi , ⌫

    = PN2 j=1 pj yi ⇠ O(N3) Hungarian/Auction: µ = 1 N PN i=1 xi , ⌫ = 1 N PN j=1 yj Algorithms µ ⌫
  32. Linear programming: µ = PN1 i=1 pi xi , ⌫

    = PN2 j=1 pj yi ⇠ O(N3) Hungarian/Auction: µ = 1 N PN i=1 xi , ⌫ = 1 N PN j=1 yj Monge-Amp` ere/Benamou-Brenier, d = || · ||2 2 . also reveal that the network simplex behaves in O(n 2) in our con- text, which is a major gain at the scale at which we typical work, i.e. thousands of particles. This finding is also useful for applica- tions that use EMD, where using the network simplex instead of the transport simplex can bring a significant performance increase. Our experiments also show that fixed-point precision further speeds up the computation. We observed that the value of the final trans- port cost is less accurate because of the limited precision, but that the particle pairing that produces the actual interpolation scheme remains unchanged. We used the fixed point method to generate the results presented in this paper. The results of the performance study are also of broader interest, as current EMD image retrieval or color transfer techniques rely on slower solvers [Rubner et al. 2000; Kanters et al. 2003; Morovic and Sun 2003]. 101 102 103 104 10−4 10−2 100 102 Problem size : number of bins per histogram Time in seconds Network Simplex (fixed point) Network Simplex (double prec.) Transport Simplex y = α x2 y = β x3 Figure 6: Log-log plot of the running times of different solvers. Figure 7: Synthetic 2D examples on a Euclidean domain. Th left and right columns show the input distributions, while the cent columns show interpolations for ↵ = 1/4, ↵ = 1/2, and ↵ = 3/ positive. A negative side effect of this choice is that interpolatin between BRDFs of equal energy conserves their log energy (§ 3. instead of their energy. Because we apply a concave remappin the interpolated value is guaranteed to be always lower, which e sures that our result does not break the energy preservation rul Algorithms µ ⌫
  33. Linear programming: µ = PN1 i=1 pi xi , ⌫

    = PN2 j=1 pj yi ⇠ O(N3) Hungarian/Auction: µ = 1 N PN i=1 xi , ⌫ = 1 N PN j=1 yj Monge-Amp` ere/Benamou-Brenier, d = || · ||2 2 . also reveal that the network simplex behaves in O(n 2) in our con- text, which is a major gain at the scale at which we typical work, i.e. thousands of particles. This finding is also useful for applica- tions that use EMD, where using the network simplex instead of the transport simplex can bring a significant performance increase. Our experiments also show that fixed-point precision further speeds up the computation. We observed that the value of the final trans- port cost is less accurate because of the limited precision, but that the particle pairing that produces the actual interpolation scheme remains unchanged. We used the fixed point method to generate the results presented in this paper. The results of the performance study are also of broader interest, as current EMD image retrieval or color transfer techniques rely on slower solvers [Rubner et al. 2000; Kanters et al. 2003; Morovic and Sun 2003]. 101 102 103 104 10−4 10−2 100 102 Problem size : number of bins per histogram Time in seconds Network Simplex (fixed point) Network Simplex (double prec.) Transport Simplex y = α x2 y = β x3 Figure 6: Log-log plot of the running times of different solvers. Figure 7: Synthetic 2D examples on a Euclidean domain. Th left and right columns show the input distributions, while the cent columns show interpolations for ↵ = 1/4, ↵ = 1/2, and ↵ = 3/ positive. A negative side effect of this choice is that interpolatin between BRDFs of equal energy conserves their log energy (§ 3. instead of their energy. Because we apply a concave remappin the interpolated value is guaranteed to be always lower, which e sures that our result does not break the energy preservation rul Algorithms µ ⌫ Semi-discrete: Laguerre cells, d = || · ||2 2 . [Levy,’15]
  34. Linear programming: µ = PN1 i=1 pi xi , ⌫

    = PN2 j=1 pj yi ⇠ O(N3) Hungarian/Auction: µ = 1 N PN i=1 xi , ⌫ = 1 N PN j=1 yj Monge-Amp` ere/Benamou-Brenier, d = || · ||2 2 . also reveal that the network simplex behaves in O(n 2) in our con- text, which is a major gain at the scale at which we typical work, i.e. thousands of particles. This finding is also useful for applica- tions that use EMD, where using the network simplex instead of the transport simplex can bring a significant performance increase. Our experiments also show that fixed-point precision further speeds up the computation. We observed that the value of the final trans- port cost is less accurate because of the limited precision, but that the particle pairing that produces the actual interpolation scheme remains unchanged. We used the fixed point method to generate the results presented in this paper. The results of the performance study are also of broader interest, as current EMD image retrieval or color transfer techniques rely on slower solvers [Rubner et al. 2000; Kanters et al. 2003; Morovic and Sun 2003]. 101 102 103 104 10−4 10−2 100 102 Problem size : number of bins per histogram Time in seconds Network Simplex (fixed point) Network Simplex (double prec.) Transport Simplex y = α x2 y = β x3 Figure 6: Log-log plot of the running times of different solvers. Figure 7: Synthetic 2D examples on a Euclidean domain. Th left and right columns show the input distributions, while the cent columns show interpolations for ↵ = 1/4, ↵ = 1/2, and ↵ = 3/ positive. A negative side effect of this choice is that interpolatin between BRDFs of equal energy conserves their log energy (§ 3. instead of their energy. Because we apply a concave remappin the interpolated value is guaranteed to be always lower, which e sures that our result does not break the energy preservation rul Algorithms µ ⌫ Semi-discrete: Laguerre cells, d = || · ||2 2 . [Levy,’15]
  35. Need for fast approximate algorithms for generic c. Linear programming:

    µ = PN1 i=1 pi xi , ⌫ = PN2 j=1 pj yi ⇠ O(N3) Hungarian/Auction: µ = 1 N PN i=1 xi , ⌫ = 1 N PN j=1 yj Monge-Amp` ere/Benamou-Brenier, d = || · ||2 2 . also reveal that the network simplex behaves in O(n 2) in our con- text, which is a major gain at the scale at which we typical work, i.e. thousands of particles. This finding is also useful for applica- tions that use EMD, where using the network simplex instead of the transport simplex can bring a significant performance increase. Our experiments also show that fixed-point precision further speeds up the computation. We observed that the value of the final trans- port cost is less accurate because of the limited precision, but that the particle pairing that produces the actual interpolation scheme remains unchanged. We used the fixed point method to generate the results presented in this paper. The results of the performance study are also of broader interest, as current EMD image retrieval or color transfer techniques rely on slower solvers [Rubner et al. 2000; Kanters et al. 2003; Morovic and Sun 2003]. 101 102 103 104 10−4 10−2 100 102 Problem size : number of bins per histogram Time in seconds Network Simplex (fixed point) Network Simplex (double prec.) Transport Simplex y = α x2 y = β x3 Figure 6: Log-log plot of the running times of different solvers. Figure 7: Synthetic 2D examples on a Euclidean domain. Th left and right columns show the input distributions, while the cent columns show interpolations for ↵ = 1/4, ↵ = 1/2, and ↵ = 3/ positive. A negative side effect of this choice is that interpolatin between BRDFs of equal energy conserves their log energy (§ 3. instead of their energy. Because we apply a concave remappin the interpolated value is guaranteed to be always lower, which e sures that our result does not break the energy preservation rul Algorithms µ ⌫ Semi-discrete: Laguerre cells, d = || · ||2 2 . [Levy,’15]
  36. C-Transform 92 Semi-discrete Optimal Transport 0 0.5 1 -0.2 0

    0.2 0.4 0.6 0.5 1 1.5 2 p = 1/2 p = 1 p = 3/2 p = 2 Figure 5.1: Top: examples of semi-discrete ¯ c-transforms g ¯ c in 1-D, for ground d the useful indicator function notation (4.42). alternate minimization on either f or g leads to the im- n of c-transform: ’ y œ Y, fc(y) def. = inf xœX c(x, y) ≠ f(x), (5.1) ’ x œ X, g¯ c(x) def. = inf yœY c(x, y) ≠ g(y), (5.2) oted ¯ c(y, x) def. = c(x, y). Indeed, one can check that œ argmax g E(f, g) and g¯ c œ argmax f E(f, g). (5.3) ese partial minimizations define maximizers on the sup- tively – and —, while the definitions (5.1) actually define he whole spaces X and Y. This is thus a way to extend in ay solutions of (2.22) on the whole spaces. When X = Rd Îx ≠ yÎp, then the c-transform (5.1) fc is the so-called n between ≠f and ηÎp. The definition of fc is also often a “Hopf-Lax formula”. (f, g) œ C(X) ◊ C(Y) ‘æ (g¯ c, fc) œ C(X) ◊ C(Y) replaces s by “better” ones (improving the dual objective E). Func- be written in the form fc and g¯ c are called c-concave and ctions. In the special case c(x, y) = Èx, yÍ in X = Y = Rd, n coincides with the usual notion of concave functions. turally Proposition 3.1 to a continuous case, one has the operations are replaced by a “soft-min”. Using (5.3), one can reformulate (2.22) as an unconstrained program over a single potential Lc (–, —) = max fœC ( X ) ⁄ X f(x)d–(x) + ⁄ Y fc(y)d—(y), = max gœC ( Y ) ⁄ X g¯ c(x)d–(x) + ⁄ Y g(y)d—(y). Since one can iterate the map (f, g) ‘æ (g¯ c, fc), it is possible to these optimization problems the constraint that f is ¯ c-concave is c-concave, which is important to ensure enough regularity on potentials and show for instance existence of solutions to (2.22) 5.2 Semi-discrete Formulation A case of particular interest is when — = q j bj ” yj is discrete (of the same construction applies if – is discrete by exchanging the (–, —)). One can adapt the definition of the ¯ c transform (5.1) setting by restricting the minimization to the support (y j ) j of — ’ g œ Rm, ’ x œ X, g ¯ c(x) def. = min jœJmK c(x, y j ) ≠ gj . This transform maps a vector g to a continuous function g ¯ c œ Note that this definition coincides with (5.1) when imposing th space X is equal to the support of —. Figure 5.1 shows some ex of such discrete ¯ c-transforms in 1-D and 2-D. Using this discrete ¯ c-transform, in this semi-discrete case, ( equivalent to the following finite dimensional optimization
  37. Semi-discrete Descent Algorithm 94 Semi-discrete Optimal Transport ¸ = 1

    ¸ = 3 ¸ = 50 ¸ = 100 Matching Figure 5.2: Iterations of the semi-discrete OT algorithm minimizing (5.8) (here a simple gradient descent is used). The support (yj ) j of the discrete measure — is indicated by the red points, while the continuous measure – is the uniform measure on a square. The blue cells display the Laguerre partition (Lg(¸) (yj )) j where g ( ¸ ) is the discrete dual potential computed at iteration ¸.
  38. 2000 4000 6000 8000 10000 -8 -6 -4 -2 0

    log E(g(?)) E(g(`)) ` Stochastic gradient descent for the semi-discrete Optimal Transport, illustration of convergence and corresponding Laguerre cells. https://arxiv.org/abs/1605.08527 Semi-discrete Stochastic Descent
  39. The simplex of histograms with N bins is ⌃ N

    def. = p 2 R+ N ; P i p i = 1 . The entropy of T 2 RN⇥N + is defined as H(T) def. = P N i,j=1 T i,j (log(T i,j ) 1). The set of couplings between histograms p 2 ⌃ N1 and q 2 ⌃ N2 is C p,q def. = T 2 (R +)N1 ⇥N2 ; T1 N2 = p, T>1 N1 = q . Here, 1 N def. = (1, . . . , 1)> 2 RN . For any tensor L = (L i,j,k,` ) i,j,k,` and matrix (T i,j ) i,j , we define the tensor- matrix multiplication as L ⌦ T def. = ⇣ X k,` L i,j,k,` T k,` ⌘ i,j . (1) 2. Gromov-Wasserstein Discrepancy 2.1. Entropic Optimal Transport Optimal transport distances are useful to compare two his- tograms (p, q) 2 ⌃ N1 ⇥ ⌃ N2 defined on the same metric 1 In our setting, sinc learning problems, distance matrices, i does not necessaril We define the Gro two measured simil and ( ¯ C, q) 2 RN2 ⇥ GW(C where E C,C The matrix T is a which the similarit L is some loss fun the similarity matr the quadratic loss the Kullback-Leibl a log(a/b) a + inition (4) of GW Entropy: Entropic Regularization Def. Regularized OT: [Cuturi NIPS’13]
  40. The simplex of histograms with N bins is ⌃ N

    def. = p 2 R+ N ; P i p i = 1 . The entropy of T 2 RN⇥N + is defined as H(T) def. = P N i,j=1 T i,j (log(T i,j ) 1). The set of couplings between histograms p 2 ⌃ N1 and q 2 ⌃ N2 is C p,q def. = T 2 (R +)N1 ⇥N2 ; T1 N2 = p, T>1 N1 = q . Here, 1 N def. = (1, . . . , 1)> 2 RN . For any tensor L = (L i,j,k,` ) i,j,k,` and matrix (T i,j ) i,j , we define the tensor- matrix multiplication as L ⌦ T def. = ⇣ X k,` L i,j,k,` T k,` ⌘ i,j . (1) 2. Gromov-Wasserstein Discrepancy 2.1. Entropic Optimal Transport Optimal transport distances are useful to compare two his- tograms (p, q) 2 ⌃ N1 ⇥ ⌃ N2 defined on the same metric 1 In our setting, sinc learning problems, distance matrices, i does not necessaril We define the Gro two measured simil and ( ¯ C, q) 2 RN2 ⇥ GW(C where E C,C The matrix T is a which the similarit L is some loss fun the similarity matr the quadratic loss the Kullback-Leibl a log(a/b) a + inition (4) of GW Entropy: Entropic Regularization Regularization impact on solution: Def. Regularized OT: [Cuturi NIPS’13] c " T" " "
  41. Overview • Measures and Histograms • From Monge to Kantorovitch

    Formulations • Entropic Regularization and Sinkhorn • Barycenters • Unbalanced OT and Gradient Flows • Minimum Kantorovitch Estimators • Gromov-Wasserstein
  42. Optimal Transport on Surfaces Level sets xi d(xi, ·) Ground

    cost: c(x, y) = dM (x, y)↵. Triangulated mesh M. Geodesic distance dM .
  43. Optimal Transport on Surfaces Level sets xi d(xi, ·) Computing

    c (Fast-Marching): N2 log(N) ! too costly. Ground cost: c(x, y) = dM (x, y)↵. Triangulated mesh M. Geodesic distance dM .
  44. Entropic Transport on Surfaces @tut(x, ·) = M ut(x, ·),

    ut=0(x, ·) = x Heat equation on M: " t
  45. Entropic Transport on Surfaces @tut(x, ·) = M ut(x, ·),

    ut=0(x, ·) = x Heat equation on M: Theorem: [Varadhan] " log(u") "!0 ! d2 M " t
  46. Entropic Transport on Surfaces @tut(x, ·) = M ut(x, ·),

    ut=0(x, ·) = x Heat equation on M: Theorem: [Varadhan] " log(u") "!0 ! d2 M " Sinkhorn kernel: K def. = e d2 M " ⇡ u" ⇡ Id " ` M ` t
  47. MRI Data Procesing [with A. Gramfort] L2 barycenter W2 2

    barycenter Ground cost c = dM : geodesic on cortical surface M.
  48. Regularization for General Measures Schr¨ odinger’s problem: Landmark computational paper:

    [Cuturi 2013]. ⇡" def. = argmin ⇡ {hdp, ⇡i + "KL(⇡|⇡0) ; ⇡ 2 ⇧(µ, ⌫)} ⇡" = argmin ⇡2⇧(µ,⌫) KL(⇡|K) K(x, y) def. = e dp(x,y) " ⇡0(x, y)
  49. Regularization for General Measures Schr¨ odinger’s problem: Landmark computational paper:

    [Cuturi 2013]. ⇡" def. = argmin ⇡ {hdp, ⇡i + "KL(⇡|⇡0) ; ⇡ 2 ⇧(µ, ⌫)} ⇡" = argmin ⇡2⇧(µ,⌫) KL(⇡|K) K(x, y) def. = e dp(x,y) " ⇡0(x, y) Proposition: [Carlier, Duval, Peyr´ e, Schmitzer 2015] ⇡" "!0 ! argmin ⇡2⇧(µ,⌫) hdp, ⇡i ⇡" "!+1 ! µ(x)⌫(y) " ⇡" "
  50. Back to Sinkhorn’s Algorithm ProxKL f1/" (˜ µ) = µ

    ProxKL f2/" (˜ ⌫) = ⌫ f1 = ◆µ f2 = ◆⌫ Optimal transport problem:
  51. Back to Sinkhorn’s Algorithm ProxKL f1/" (˜ µ) = µ

    ProxKL f2/" (˜ ⌫) = ⌫ [Sinkhorn 1967][Deming,Stephan 1940] Sinkhorn/IPFP algorithm: a(`+1) def. = µ Kb(`) and b(`+1) def. = ⌫ K⇤a(`+1) f1 = ◆µ f2 = ◆⌫ Optimal transport problem: ` ⇡(`) "
  52. Back to Sinkhorn’s Algorithm ProxKL f1/" (˜ µ) = µ

    ProxKL f2/" (˜ ⌫) = ⌫ ⇡" " [Sinkhorn 1967][Deming,Stephan 1940] Sinkhorn/IPFP algorithm: a(`+1) def. = µ Kb(`) and b(`+1) def. = ⌫ K⇤a(`+1) f1 = ◆µ f2 = ◆⌫ Optimal transport problem: Proposition: [Franklin,Lorenz 1989] ⇡(`) def. = diag(a(`))K diag(b(`)) || log(⇡ (`)) log(⇡ ?)||1 = O(1 )` , ⇠  1/" c Local rate: [Knight 2008] 1000 2000 3000 4000 5000 -2 -1.5 -1 -0.5 0 ` log(|| log(⇡(`)) log(⇡?)||1) ` ⇡(`) "
  53. 0 0.5 1 -0.2 0 0.2 0 0.1 0.2 0.3

    Á = 0 Á = 0.01 Á = 0.1 Á = 0.3 Figure 5.3: Top: examples of entropic semi-discrete ¯ c-transforms g ¯ c,Á in 1-D, for 0 0.5 1 -0.2 0 0.2 0 0.1 0.2 0.3 Á = 0 Á = 0.01 Á = 0.1 Á = 0.3 Figure 5.3: Top: examples of entropic semi-discrete ¯ c-transforms g ¯ c,Á in 1-D, for Laguerre cells “Sinkhorn” Laguerre cells Semi-discrete Optimal Transport is a multiclass SVM. Sinkhorn version is logistic regression. Decomposes space in entropic Laguerre diagrams 0 0.5 1 -0.2 0 0.2 0 0.1 0.2 0.3 Á = 0 Á = 0.01 Á = 0.1 Á = 0.3 Figure 5.3: Top: examples of entropic semi-discrete ¯ c-transforms g ¯ c,Á in 1-D, for ground cost c(x, y) = |x ≠ y| for varying Á (see colorbar). The red points are at locations (yj, ≠gj ) j . Bottom: examples of entropic semi-discrete ¯ c-transforms g ¯ c,Á in 2-D, for ground cost c(x, y) = Îx ≠ yÎ for varying Á. The black curves are the level sets of the function g ¯ c,Á, while the colors indicate the smoothed indicator function of the Laguerre cells ‰Á j . The red points are at locations yj œ R2 , and their size is proportional to gj . Semi-discrete OT and Entropy
  54. Overview • Measures and Histograms • From Monge to Kantorovitch

    Formulations • Entropic Regularization and Sinkhorn • Barycenters • Unbalanced OT and Gradient Flows • Minimum Kantorovitch Estimators • Gromov-Wasserstein
  55. Wasserstein Barycenters µ µ1 µ3 W2 (µ1 , µ )

    W 2 (µ 2 ,µ ) µ2 W2 (µ3 ,µ ) Barycenters of measures (µk)k: P k k = 1 If µk = xk then µ? = P k kxk Generalizes Euclidean barycenter: µ? 2 argmin µ P k kW2 2 (µk, µ)
  56. Wasserstein Barycenters µ µ1 µ3 W2 (µ1 , µ )

    W 2 (µ 2 ,µ ) µ2 W2 (µ3 ,µ ) Barycenters of measures (µk)k: P k k = 1 If µk = xk then µ? = P k kxk Generalizes Euclidean barycenter: µ? 2 argmin µ P k kW2 2 (µk, µ)
  57. Wasserstein Barycenters µ µ1 µ3 W2 (µ1 , µ )

    W 2 (µ 2 ,µ ) µ2 W2 (µ3 ,µ ) Barycenters of measures (µk)k: P k k = 1 If µk = xk then µ? = P k kxk Generalizes Euclidean barycenter: µ2 µ1 Mc Cann’s displacement interpolation. µ? 2 argmin µ P k kW2 2 (µk, µ)
  58. Wasserstein Barycenters µ µ1 µ3 W2 (µ1 , µ )

    W 2 (µ 2 ,µ ) µ2 W2 (µ3 ,µ ) Barycenters of measures (µk)k: P k k = 1 If µk = xk then µ? = P k kxk Generalizes Euclidean barycenter: µ2 µ1 µ3 µ2 µ1 Mc Cann’s displacement interpolation. µ? 2 argmin µ P k kW2 2 (µk, µ)
  59. Wasserstein Barycenters µ µ1 µ3 W2 (µ1 , µ )

    W 2 (µ 2 ,µ ) µ2 W2 (µ3 ,µ ) Barycenters of measures (µk)k: P k k = 1 If µk = xk then µ? = P k kxk Generalizes Euclidean barycenter: µ2 µ1 µ3 µ exists and is unique. Theorem: if µ1 does not vanish on small sets, [Agueh, Carlier, 2010] (for c(x, y) = ||x y||2) µ2 µ1 Mc Cann’s displacement interpolation. µ? 2 argmin µ P k kW2 2 (µk, µ)
  60. 110 Dynamic Formulations –0 –1 / 5 –2 / 5

    –3 / 5 –4 / 5 –1 Figure 7.2: Comparison of displacement interpolation (7.8) of discrete measures. Top: point clouds (empirical measures (–0, –1) with the same number of points). Bottom: same but with weight. For 0 < t < 1, the top example corresponds to an Displacement Interpolation
  61. Regularized Barycenters min (⇡k)k,µ ( X k k (hc, ⇡k

    i + "KL(⇡k |⇡0,k)) ; 8k, ⇡k 2 ⇧(µk, µ) )
  62. Regularized Barycenters ! Need to fix a discretization grid for

    µ, i.e. choose (⇡0,k)k min (⇡k)k,µ ( X k k (hc, ⇡k i + "KL(⇡k |⇡0,k)) ; 8k, ⇡k 2 ⇧(µk, µ) )
  63. Regularized Barycenters ! Need to fix a discretization grid for

    µ, i.e. choose (⇡0,k)k ! Sinkhorn-like algorithm [Benamou, Carlier, Cuturi, Nenna, Peyr´ e, 2015]. min (⇡k)k,µ ( X k k (hc, ⇡k i + "KL(⇡k |⇡0,k)) ; 8k, ⇡k 2 ⇧(µk, µ) )
  64. Regularized Barycenters ! Need to fix a discretization grid for

    µ, i.e. choose (⇡0,k)k ! Sinkhorn-like algorithm [Benamou, Carlier, Cuturi, Nenna, Peyr´ e, 2015]. [Solomon et al, SIGGRAPH 2015] min (⇡k)k,µ ( X k k (hc, ⇡k i + "KL(⇡k |⇡0,k)) ; 8k, ⇡k 2 ⇧(µk, µ) )
  65. Barycenter on a Surface 1 = (1, . . .

    , 1)/6 µ1 µ2 µ3 µ4 µ5 µ6 µ1 µ2 0 1
  66. Color Transfer µ ⌫ Input images: (f, g) (chrominance components)

    Input measures: f g µ(A) = U(f 1(A)), ⌫(A) = U(g 1(A))
  67. Color Transfer µ ⌫ Input images: (f, g) (chrominance components)

    Input measures: f g µ(A) = U(f 1(A)), ⌫(A) = U(g 1(A))
  68. Color Transfer µ ⌫ Input images: (f, g) (chrominance components)

    Input measures: f T T f g ˜ T g µ(A) = U(f 1(A)), ⌫(A) = U(g 1(A))
  69. Overview • Measures and Histograms • From Monge to Kantorovitch

    Formulations • Entropic Regularization and Sinkhorn • Barycenters • Unbalanced OT and Gradient Flows • Minimum Kantorovitch Estimators • Gromov-Wasserstein
  70. Unbalanced Transport (⇠, µ) 2 M+(X)2, KL(⇠|µ) def. = R

    X log ⇣ d⇠ dµ ⌘ dµ + R X (dµ d⇠) WFc(µ, ⌫) def. = min ⇡ hc, ⇡i + KL(P1]⇡|µ) + KL(P2]⇡|⌫) [Liereo, Mielke, Savar´ e 2015]
  71. Unbalanced Transport [Liereo, Mielke, Savar´ e 2015] [Chizat, Schmitzer, Peyr´

    e, Vialard 2015] Proposition: then WF1/2 c is a distance on M+(X). (⇠, µ) 2 M+(X)2, KL(⇠|µ) def. = R X log ⇣ d⇠ dµ ⌘ dµ + R X (dµ d⇠) If c(x, y) = log(cos(min(d(x, y), ⇡ 2 )) WFc(µ, ⌫) def. = min ⇡ hc, ⇡i + KL(P1]⇡|µ) + KL(P2]⇡|⌫) [Liereo, Mielke, Savar´ e 2015]
  72. Unbalanced Transport ! “Dynamic” Benamou-Brenier formulation. [Liereo, Mielke, Savar´ e

    2015][Kondratyev, Monsaingeon, Vorotnikov, 2015] [Liereo, Mielke, Savar´ e 2015] [Chizat, Schmitzer, Peyr´ e, Vialard 2015] Proposition: then WF1/2 c is a distance on M+(X). [Chizat, Schmitzer, P, Vialard 2015] (⇠, µ) 2 M+(X)2, KL(⇠|µ) def. = R X log ⇣ d⇠ dµ ⌘ dµ + R X (dµ d⇠) If c(x, y) = log(cos(min(d(x, y), ⇡ 2 )) WFc(µ, ⌫) def. = min ⇡ hc, ⇡i + KL(P1]⇡|µ) + KL(P2]⇡|⌫) [Liereo, Mielke, Savar´ e 2015] Balanced OT Unbalanced OT
  73. p = + 1 p = 2 p = 1

    Metric space (X, d), minimize F(x) on X. F(x) = ||x||2 on (X = R2, || · ||p) Implicit Euler step: xk+1 def. = argmin x2X d(xk, x)2 + ⌧F(x) xk+1 xk {x ; d(xk, x) ⇠ ⌧} xk+2 xk+3 Implicit Euler Stepping
  74. xk+1 = argmin x2X d(xk, x)2 + ⌧hrF(xk), xi p

    = + 1 p = 1 p = 2 p = + 1 p = 2 p = 1 xk+1 = argmin x2X d(xk, x)2 + ⌧F(x) Metric space (X, d), minimize F(x) on X. F(x) = ||x||2 on (X = R2, || · ||p) Implicit Explicit Implicit vs. Explicit Stepping
  75. Wasserstein Gradient Flows Implicit Euler step: [Jordan, Kinderlehrer, Otto 1998]

    µt+1 = ProxW ⌧f (µt) def. = argmin µ2M+(X) W2 2 (µt, µ) + ⌧f(µ)
  76. Wasserstein Gradient Flows Implicit Euler step: [Jordan, Kinderlehrer, Otto 1998]

    Formal limit ⌧ ! 0: @tµ = div (µr(f0(µ))) µt+1 = ProxW ⌧f (µt) def. = argmin µ2M+(X) W2 2 (µt, µ) + ⌧f(µ)
  77. Wasserstein Gradient Flows Implicit Euler step: [Jordan, Kinderlehrer, Otto 1998]

    Formal limit ⌧ ! 0: @tµ = div (µr(f0(µ))) (heat di↵usion) f(µ) = R log(dµ dx )dµ @tµ = µ µt+1 = ProxW ⌧f (µt) def. = argmin µ2M+(X) W2 2 (µt, µ) + ⌧f(µ)
  78. Wasserstein Gradient Flows Implicit Euler step: [Jordan, Kinderlehrer, Otto 1998]

    Formal limit ⌧ ! 0: @tµ = div (µr(f0(µ))) (advection) f(µ) = R wdµ @tµ = div(µrw) (heat di↵usion) f(µ) = R log(dµ dx )dµ @tµ = µ Evolution µt Evolution µt rw rw µt+1 = ProxW ⌧f (µt) def. = argmin µ2M+(X) W2 2 (µt, µ) + ⌧f(µ)
  79. Wasserstein Gradient Flows Implicit Euler step: [Jordan, Kinderlehrer, Otto 1998]

    Formal limit ⌧ ! 0: @tµ = div (µr(f0(µ))) (advection) f(µ) = R wdµ @tµ = div(µrw) (heat di↵usion) f(µ) = R log(dµ dx )dµ @tµ = µ (non-linear di↵usion) f(µ) = 1 m 1 R (dµ dx )m 1dµ @tµ = µm Evolution µt Evolution µt rw rw µt+1 = ProxW ⌧f (µt) def. = argmin µ2M+(X) W2 2 (µt, µ) + ⌧f(µ)
  80. ˆ µn µ n ! +1 ˆ µn = 1

    n Pn i=1 xi ˆ H(ˆ µn) def. = P i log(minj6=i ||xi xj ||) R log(dµ dx (x))dµ(x) H(µ) def. = n ! +1 Lagrangian Discretization of Entropy
  81. time d⇢t dt = ⇢t + r(V ⇢t) min ⇢

    E(⇢) def. = Z V (x)⇢(x)dx + Z ⇢(x) log(⇢(x))dx V (x) = ||x||2 Wasserstein flow of E: Lagrangian Discretization of Gradient Flows
  82. Generalized Entropic Regularization Primal: min ⇡ hdp, ⇡i + f1(P1]⇡)

    + f2(P2]⇡) + "KL(⇡|⇡0) Dual: (a, b) def. = (e u " , e v " ) max u,v f⇤ 1 (u) f⇤ 2 (v) "he u " , Ke v " i ⇡(x, y) = a(x)K(x, y)b(y)
  83. Generalized Entropic Regularization Primal: min ⇡ hdp, ⇡i + f1(P1]⇡)

    + f2(P2]⇡) + "KL(⇡|⇡0) Block coordinates relaxation: max u f⇤ 1 (u) "he u " , Ke v " i max v f⇤ 2 (v) "he v " , K⇤e u " i (Iv) (Iu) Dual: (a, b) def. = (e u " , e v " ) max u,v f⇤ 1 (u) f⇤ 2 (v) "he u " , Ke v " i ⇡(x, y) = a(x)K(x, y)b(y)
  84. Generalized Entropic Regularization Primal: min ⇡ hdp, ⇡i + f1(P1]⇡)

    + f2(P2]⇡) + "KL(⇡|⇡0) Block coordinates relaxation: max u f⇤ 1 (u) "he u " , Ke v " i max v f⇤ 2 (v) "he v " , K⇤e u " i (Iv) (Iu) Proposition: ProxKL f1/" (µ) def. = argmin ⌫f1(⌫) + "KL(⌫|µ) the solutions of (Iu) and (Iv) read: a = ProxKL f1/" (Kb) Kb b = ProxKL f2/" (K⇤a) K⇤a Dual: (a, b) def. = (e u " , e v " ) max u,v f⇤ 1 (u) f⇤ 2 (v) "he u " , Ke v " i ⇡(x, y) = a(x)K(x, y)b(y)
  85. Generalized Entropic Regularization Primal: ! Only matrix-vector multiplications. ! Highly

    parallelizable. ! On regular grids: only convolutions! Linear time iterations. min ⇡ hdp, ⇡i + f1(P1]⇡) + f2(P2]⇡) + "KL(⇡|⇡0) Block coordinates relaxation: max u f⇤ 1 (u) "he u " , Ke v " i max v f⇤ 2 (v) "he v " , K⇤e u " i (Iv) (Iu) Proposition: ProxKL f1/" (µ) def. = argmin ⌫f1(⌫) + "KL(⌫|µ) the solutions of (Iu) and (Iv) read: a = ProxKL f1/" (Kb) Kb b = ProxKL f2/" (K⇤a) K⇤a Dual: (a, b) def. = (e u " , e v " ) max u,v f⇤ 1 (u) f⇤ 2 (v) "he u " , Ke v " i ⇡(x, y) = a(x)K(x, y)b(y)
  86. Gradient Flows: Crowd Motion µt+1 def. = argminµ W↵ ↵

    (µt, µ) + ⌧f(µ) Congestion-inducing function: f(µ) = ◆[0,] (µ) + hw, µi [Maury, Roudne↵-Chupin, Santambrogio 2010]
  87. Gradient Flows: Crowd Motion µt+1 def. = argminµ W↵ ↵

    (µt, µ) + ⌧f(µ) Proposition: Prox1 " f (µ) = min(e "wµ, ) Congestion-inducing function: f(µ) = ◆[0,] (µ) + hw, µi [Maury, Roudne↵-Chupin, Santambrogio 2010]
  88. Gradient Flows: Crowd Motion µt+1 def. = argminµ W↵ ↵

    (µt, µ) + ⌧f(µ) Proposition: Prox1 " f (µ) = min(e "wµ, )  = ||µt=0 ||1  = 2||µt=0 ||1  = 4||µt=0 ||1 Congestion-inducing function: f(µ) = ◆[0,] (µ) + hw, µi [Maury, Roudne↵-Chupin, Santambrogio 2010] rw
  89. Crowd Motion on a Surface Potential cos(w)  = ||µt=0

    ||1  = 6||µt=0 ||1 X = triangulated mesh.
  90. Crowd Motion on a Surface Potential cos(w)  = ||µt=0

    ||1  = 6||µt=0 ||1 X = triangulated mesh.
  91. Gradient Flows: Crowd Motion with Obstacles Potential cos(w)  =

    ||µt=0 ||1  = 2||µt=0 ||1  = 4||µt=0 ||1  = 6||µt=0 ||1 X = sub-domain of R2.
  92. Gradient Flows: Crowd Motion with Obstacles Potential cos(w)  =

    ||µt=0 ||1  = 2||µt=0 ||1  = 4||µt=0 ||1  = 6||µt=0 ||1 X = sub-domain of R2.
  93. Multiple-Density Gradient Flows (µ1,t+1, µ2,t+1) def. = argmin (µ1,µ2) W↵

    ↵ (µ1,t, µ1) + W↵ ↵ (µ2,t, µ2) + ⌧f(µ1, µ2)
  94. Multiple-Density Gradient Flows Wasserstein attraction: rw (µ1,t+1, µ2,t+1) def. =

    argmin (µ1,µ2) W↵ ↵ (µ1,t, µ1) + W↵ ↵ (µ2,t, µ2) + ⌧f(µ1, µ2) f(µ1, µ2) = W↵ ↵ (µ1, µ2) + h1(µ1) + h2(µ2) hi(µ) = hw, µi
  95. Multiple-Density Gradient Flows Wasserstein attraction: rw (µ1,t+1, µ2,t+1) def. =

    argmin (µ1,µ2) W↵ ↵ (µ1,t, µ1) + W↵ ↵ (µ2,t, µ2) + ⌧f(µ1, µ2) f(µ1, µ2) = W↵ ↵ (µ1, µ2) + h1(µ1) + h2(µ2) hi(µ) = ◆[0,] (µ). hi(µ) = hw, µi
  96. Overview • Measures and Histograms • From Monge to Kantorovitch

    Formulations • Entropic Regularization and Sinkhorn • Barycenters • Unbalanced OT and Gradient Flows • Minimum Kantorovitch Estimators • Gromov-Wasserstein
  97. Discriminative vs Generative Models Z x z X Z X

    Low dimension High dimension Generative Discriminative g✓ d⇠ g✓ d⇠
  98. Discriminative vs Generative Models Z x z X Z X

    Low dimension High dimension Generative Discriminative g✓ d⇠ g✓ d⇠ classification, z =class probability Supervised: ! Learn d⇠ from labeled data (xi, zi)i.
  99. Discriminative vs Generative Models Z x z X Z X

    Low dimension High dimension Generative Discriminative g✓ d⇠ g✓ d⇠ classification, z =class probability Supervised: ! Learn d⇠ from labeled data (xi, zi)i. Compression: z = d⇠(x) is a representation. Generation: x = g✓(z) is a synthesis. Un-supervised: ! Learn (g✓, d⇠) from data (xi)i.
  100. Discriminative vs Generative Models Z x z X Z X

    Low dimension High dimension Generative Discriminative g✓ d⇠ g✓ d⇠ classification, z =class probability Supervised: ! Learn d⇠ from labeled data (xi, zi)i. Compression: z = d⇠(x) is a representation. Generation: x = g✓(z) is a synthesis. Un-supervised: ! Learn (g✓, d⇠) from data (xi)i. Density fitting g✓({zi }i) ⇡ {xi }i Auto-encoders g✓(d⇠(xi)) ⇡ xi
  101. Discriminative vs Generative Models Z x z X Z X

    Low dimension High dimension Generative Discriminative g✓ d⇠ g✓ d⇠ classification, z =class probability Supervised: ! Learn d⇠ from labeled data (xi, zi)i. Compression: z = d⇠(x) is a representation. Generation: x = g✓(z) is a synthesis. Un-supervised: ! Learn (g✓, d⇠) from data (xi)i. Density fitting g✓({zi }i) ⇡ {xi }i Auto-encoders g✓(d⇠(xi)) ⇡ xi Optimal transport map d⇠
  102. Density Fitting and Generative Models Parametric model: ✓ 7! µ✓

    µ✓ ✓ Observations: ⌫ = 1 n Pn i=1 xi
  103. Density Fitting and Generative Models Parametric model: ✓ 7! µ✓

    µ✓ ✓ Observations: ⌫ = 1 n Pn i=1 xi dµ✓(y) = f✓(y)dy Density fitting: Maximum likelihood (MLE) min ✓ c KL(⌫|µ✓) def. = X j log(f✓(yj))
  104. Density Fitting and Generative Models Parametric model: ✓ 7! µ✓

    µ✓ ✓ Observations: ⌫ = 1 n Pn i=1 xi g✓ µ✓ X Z ⇣ Generative model fit: µ✓ = g✓,]⇣ ! MLE undefined. ! Need a weaker metric. c KL(µ✓ |⌫) = +1 dµ✓(y) = f✓(y)dy Density fitting: Maximum likelihood (MLE) min ✓ c KL(⌫|µ✓) def. = X j log(f✓(yj))
  105. Loss Functions for Measures Optimal Transport Distances µ = 1

    N P i xi W(µ, ⌫)p def. = min T 2Cµ,⌫ P i,j Ti,j ||xi yj ||p ⌫ = 1 P P j yj Density fitting: min ✓ D(µ✓, ⌫) T µ ⌫
  106. Loss Functions for Measures Optimal Transport Distances µ = 1

    N P i xi W(µ, ⌫)p def. = min T 2Cµ,⌫ P i,j Ti,j ||xi yj ||p ⌫ = 1 P P j yj Maximum Mean Discrepancy (MMD) ||µ ⌫||2 k def. = 1 N2 X i,i0 k(xi, xi0 ) + 1 P2 X j,j0 k(yj, yj0 ) 2 NP X i,j k(xi, yj) Gaussian: k(x, y) = e | |x y| |2 2 2 . Energy distance: k(x, y) = ||x y||2. Density fitting: min ✓ D(µ✓, ⌫) T µ ⌫ k µ ⌫
  107. Loss Functions for Measures Optimal Transport Distances µ = 1

    N P i xi W(µ, ⌫)p def. = min T 2Cµ,⌫ P i,j Ti,j ||xi yj ||p ⌫ = 1 P P j yj Maximum Mean Discrepancy (MMD) ||µ ⌫||2 k def. = 1 N2 X i,i0 k(xi, xi0 ) + 1 P2 X j,j0 k(yj, yj0 ) 2 NP X i,j k(xi, yj) Gaussian: k(x, y) = e | |x y| |2 2 2 . Energy distance: k(x, y) = ||x y||2. Density fitting: min ✓ D(µ✓, ⌫) T µ ⌫ k µ ⌫ µ ⌫ T" ¯ W"(µ, ⌫)p def. = W"(µ, ⌫)p 1 2 W"(µ, µ)p 1 2 W"(⌫, ⌫)p W"(µ, ⌫)p def. = P i,j T" i,j ||xi yj ||p Sinkhorn divergences [Cuturi 13]
  108. Loss Functions for Measures Optimal Transport Distances µ = 1

    N P i xi W(µ, ⌫)p def. = min T 2Cµ,⌫ P i,j Ti,j ||xi yj ||p ⌫ = 1 P P j yj Maximum Mean Discrepancy (MMD) ||µ ⌫||2 k def. = 1 N2 X i,i0 k(xi, xi0 ) + 1 P2 X j,j0 k(yj, yj0 ) 2 NP X i,j k(xi, yj) Gaussian: k(x, y) = e | |x y| |2 2 2 . Energy distance: k(x, y) = ||x y||2. for k(x, y) = ||x y||p Density fitting: min ✓ D(µ✓, ⌫) Theorem: [Ramdas, G.Trillos, Cuturi 17] "!+1 ! ||µ ⌫||2 k ¯ W"(µ, ⌫)p "!0 ! W(µ, ⌫)p T µ ⌫ k µ ⌫ µ ⌫ T" ¯ W"(µ, ⌫)p def. = W"(µ, ⌫)p 1 2 W"(µ, µ)p 1 2 W"(⌫, ⌫)p W"(µ, ⌫)p def. = P i,j T" i,j ||xi yj ||p Sinkhorn divergences [Cuturi 13]
  109. Loss Functions for Measures Optimal Transport Distances µ = 1

    N P i xi W(µ, ⌫)p def. = min T 2Cµ,⌫ P i,j Ti,j ||xi yj ||p ⌫ = 1 P P j yj Maximum Mean Discrepancy (MMD) ||µ ⌫||2 k def. = 1 N2 X i,i0 k(xi, xi0 ) + 1 P2 X j,j0 k(yj, yj0 ) 2 NP X i,j k(xi, yj) Gaussian: k(x, y) = e | |x y| |2 2 2 . Energy distance: k(x, y) = ||x y||2. for k(x, y) = ||x y||p Density fitting: min ✓ D(µ✓, ⌫) Theorem: [Ramdas, G.Trillos, Cuturi 17] "!+1 ! ||µ ⌫||2 k ¯ W"(µ, ⌫)p "!0 ! W(µ, ⌫)p T µ ⌫ k µ ⌫ – Scale free (no , no heavy tail kernel). – Non-Euclidean, arbitrary ground distance. – Less biased gradient. – No curse of dimension (low sample complexity). Best of both worlds: ! cross-validate " µ ⌫ T" ¯ W"(µ, ⌫)p def. = W"(µ, ⌫)p 1 2 W"(µ, µ)p 1 2 W"(⌫, ⌫)p W"(µ, ⌫)p def. = P i,j T" i,j ||xi yj ||p Sinkhorn divergences [Cuturi 13]
  110. Deep Discriminative vs Generative Models ✓1 ✓2 ⇢ ⇢ z

    x g✓ ⇢ ⇢ z x Discriminative Generative d⇠ ⇠1 ⇠2 g✓(z) = ⇢(✓K(. . . ⇢(✓2(⇢(✓1(z) . . .) Deep networks: d⇠(x) = ⇢(⇠K(. . . ⇢(⇠2(⇢(⇠1(x) . . .)
  111. Deep Discriminative vs Generative Models ✓1 ✓2 ⇢ ⇢ z

    x g✓ ⇢ ⇢ z x Discriminative Generative z1 z2 g✓ Z x z X d⇠ d⇠ ⇠1 ⇠2 g✓(z) = ⇢(✓K(. . . ⇢(✓2(⇢(✓1(z) . . .) Deep networks: d⇠(x) = ⇢(⇠K(. . . ⇢(⇠2(⇢(⇠1(x) . . .)
  112. Overview • Measures and Histograms • From Monge to Kantorovitch

    Formulations • Entropic Regularization and Sinkhorn • Barycenters • Unbalanced OT and Gradient Flows • Minimum Kantorovitch Estimators • Gromov-Wasserstein
  113. unregistered spaces Gromov-Wasserstein Inputs: {(similarity/kernel matrix, histogram)} X Y !

    need for a fast approximate solver. ! NP-hard in general. [Memoli 2011] Def. Gromov-Wasserstein distance: [Sturm 2012]
  114. Y xi X Gromov-Wasserstein as a Metric yj X Y

    f f X Y Def. () Isometries on M: Def.
  115. Y xi X Gromov-Wasserstein as a Metric yj ! “bending-invariant”

    objects recognition. X Y f f X Y Def. () Isometries on M: Def. Prop. [Memoli 2011]
  116. X Y For Arbitrary Spaces d X (x,x 0 )

    dY (y, y0) Def. Gromov-Wasserstein distance: [Memoli 2011] [Sturm 2012] GW2 2 (dX , µ, dY , ⌫) def. = min ⇡2⇧(µ,⌫) Z X2⇥Y 2 |dX(x, x0) dY (y, y0)|2d⇡(x, y)d⇡(x0, y0) Metric-measure spaces (X, Y ): (dX , µ), (dY , ⌫)
  117. X Y For Arbitrary Spaces f X Y Prop. GW

    is a distance on mm-spaces/isometries. ! “bending-invariant” objects recognition. d X (x,x 0 ) dY (y, y0) Def. Gromov-Wasserstein distance: [Memoli 2011] [Sturm 2012] GW2 2 (dX , µ, dY , ⌫) def. = min ⇡2⇧(µ,⌫) Z X2⇥Y 2 |dX(x, x0) dY (y, y0)|2d⇡(x, y)d⇡(x0, y0) Metric-measure spaces (X, Y ): (dX , µ), (dY , ⌫)
  118. X Y For Arbitrary Spaces f X Y Prop. GW

    is a distance on mm-spaces/isometries. ! need for a fast approximate solver. ! “bending-invariant” objects recognition. ! QAP: NP-hard in general. d X (x,x 0 ) dY (y, y0) Def. Gromov-Wasserstein distance: [Memoli 2011] [Sturm 2012] GW2 2 (dX , µ, dY , ⌫) def. = min ⇡2⇧(µ,⌫) Z X2⇥Y 2 |dX(x, x0) dY (y, y0)|2d⇡(x, y)d⇡(x0, y0) Metric-measure spaces (X, Y ): (dX , µ), (dY , ⌫)
  119. Entropic Gromov Wasserstein Projected mirror descent: Def. Entropic Gromov-Wasserstein Projected

    mirror descent: Def. Prop. for ⌧ = 1/", the iteration reads repeat: until convergence. initialization: return T func T = GW(C, ¯ C, p, q)
  120. phine Vladimir G. Kim Adobe Research Suvrit Sra MIT Source

    Targets Figure 1: Entropic GW can find correspondences between a source surface (left) and a surface with similar structure, a surface with shared semantic structure, a noisy 3D point cloud, an icon, and a hand drawing. Each fuzzy map was computed using the same code. In this paper, we propose a new correspondence algorithm that minimizes distortion of long- and short-range distances alike. We study an entropically-regularized version of the Gromov-Wasserstein (GW) mapping objective function from [M´ emoli 2011] measuring the distortion of geodesic distances. The optimizer is a probabilistic matching expressed as a “fuzzy” correspondence matrix in the style of [Kim et al. 2012; Solomon et al. 2012]; we control sharpness of the correspondence via the weight of the entropic regularizer. 0 0.02 0.04 0 0.02 0 0.02 Teddies Humans Four-legged Armadillo Figure 15: MDS embedding of four classes from SHREC dataset. 0 0.5 1 0 0.5 1 1 5 10 15 20 25 30 35 40 45 PCA 1 PCA 2 Figure 16: Recovery of galloping horse sequence. 0 is the base shape) as a feature vector for shape i. We reproduce the result presented in the work of Rustamov et al., recovering the circular structure of meshes from a galloping horse animation sequence (Figure 16). Unlike Rustamov et al., however, our method does not require ground truth maps between shapes as input. 5.2 Supervised Matching An important feature of a matching tool is the ability to incorporate user input, e.g. ground truth matches of points or regions. In the GWα framework, one way to enforce these constraints is to provide a stencil S specifying a sparsity pattern for the map Γ. Incorporating Figure 18: Mapping a set of 185 images onto a two shapes while preserving color similarity. (Images from Flickr public domain collection.) Rn0×n0 + and D ∈ Rn×n + we are given symmetric weight matrices W0 ∈ Rn0×n0 + and W ∈ Rn×n + . We could solve a weighted version of the GWα matching problem (3) that prioritizes maps preserving distances corresponding to large W values: min Γ∈M ijkℓ (D0ij −Dkℓ )2ΓikΓjℓW0ijWjℓµ0iµ0jµkµℓ . (8) For instance, (W0, W) might contain confidence values expressing the quality of the entries of (D0, D). Or, W0, W could take values in {ε, 1} reducing the weight of distances that are unknown or do not need to be preserved by Γ. Following the same simplifications as §3.1, we can optimize this objective by minimizing ⟨Γ, ΛW (Γ)⟩, where ΛW (Γ) := 1 2 [D∧2 0 ⊗ W0][[µ0 ]]Γ[[µ]]W − [D0 ⊗ W0][[µ0 ]]Γ[[µ]][D ⊗ W] + 1 2 W0[[µ0 ]]Γ[[µ]][D∧2 ⊗ W] Applications of GW: Shapes Analysis Use T to define registration between: Colors distribution Shape Shape Shape
  121. phine Vladimir G. Kim Adobe Research Suvrit Sra MIT Source

    Targets Figure 1: Entropic GW can find correspondences between a source surface (left) and a surface with similar structure, a surface with shared semantic structure, a noisy 3D point cloud, an icon, and a hand drawing. Each fuzzy map was computed using the same code. In this paper, we propose a new correspondence algorithm that minimizes distortion of long- and short-range distances alike. We study an entropically-regularized version of the Gromov-Wasserstein (GW) mapping objective function from [M´ emoli 2011] measuring the distortion of geodesic distances. The optimizer is a probabilistic matching expressed as a “fuzzy” correspondence matrix in the style of [Kim et al. 2012; Solomon et al. 2012]; we control sharpness of the correspondence via the weight of the entropic regularizer. 0 0.02 0.04 0 0.02 0 0.02 Teddies Humans Four-legged Armadillo Figure 15: MDS embedding of four classes from SHREC dataset. 0 0.5 1 0 0.5 1 1 5 10 15 20 25 30 35 40 45 PCA 1 PCA 2 Figure 16: Recovery of galloping horse sequence. 0 is the base shape) as a feature vector for shape i. We reproduce the result presented in the work of Rustamov et al., recovering the circular structure of meshes from a galloping horse animation sequence (Figure 16). Unlike Rustamov et al., however, our method does not require ground truth maps between shapes as input. 5.2 Supervised Matching An important feature of a matching tool is the ability to incorporate user input, e.g. ground truth matches of points or regions. In the GWα framework, one way to enforce these constraints is to provide a stencil S specifying a sparsity pattern for the map Γ. Incorporating Figure 18: Mapping a set of 185 images onto a two shapes while preserving color similarity. (Images from Flickr public domain collection.) Rn0×n0 + and D ∈ Rn×n + we are given symmetric weight matrices W0 ∈ Rn0×n0 + and W ∈ Rn×n + . We could solve a weighted version of the GWα matching problem (3) that prioritizes maps preserving distances corresponding to large W values: min Γ∈M ijkℓ (D0ij −Dkℓ )2ΓikΓjℓW0ijWjℓµ0iµ0jµkµℓ . (8) For instance, (W0, W) might contain confidence values expressing the quality of the entries of (D0, D). Or, W0, W could take values in {ε, 1} reducing the weight of distances that are unknown or do not need to be preserved by Γ. Following the same simplifications as §3.1, we can optimize this objective by minimizing ⟨Γ, ΛW (Γ)⟩, where ΛW (Γ) := 1 2 [D∧2 0 ⊗ W0][[µ0 ]]Γ[[µ]]W − [D0 ⊗ W0][[µ0 ]]Γ[[µ]][D ⊗ W] + 1 2 W0[[µ0 ]]Γ[[µ]][D∧2 ⊗ W] Applications of GW: Shapes Analysis 0 0.02 0.04 0 0.02 0 0.02 Te Hu Fo Ar Figure 1: The database that has been used, divid MDS in 3-D Use T to define registration between: Colors distribution Shape Shape Shape Geodesic distances GW distances MDS Vizualization Shapes (Xs)s 0 0.02 0.04 0 0.02 0 0.02 Teddies Humans Four-legged Armadillo Figure 15: MDS embedding of four classes from SHREC dataset. 0 0.5 1 1 5 10 15 20 25 30 35 40 45 PCA 2 Figure 18: Mapping a set of 185 images onto a two shapes while preserving color similarity. (Images from Flickr public domain collection.) Rn0×n0 + and D ∈ Rn×n + we are given symmetric weight matrices W0 ∈ Rn0×n0 + and W ∈ Rn×n + . We could solve a weighted version of the GWα matching problem (3) that prioritizes maps preserving distances corresponding to large W values: MDS in 2-D
  122. Applications of GW: Quantum Chemistry Regression problem: ! f by

    solving DFT approximation is too costly.
  123. Applications of GW: Quantum Chemistry Regression problem: ! f by

    solving DFT approximation is too costly. [Rupp et al 2012] 2 6 6 6 6 4 3 7 7 7 7 5
  124. Applications of GW: Quantum Chemistry Regression problem: ! f by

    solving DFT approximation is too costly. GW-interpolation: [Rupp et al 2012] 2 6 6 6 6 4 3 7 7 7 7 5
  125. Gromov-Wasserstein Geodesics Def. Gromov-Wasserstein Geodesic ! X ⇥ Y is

    not practical for most applications. (need to fix the size of the geodesic embedding space) ! Extension to more than 2 input spaces? Prop. [Sturm 2012]
  126. Gromov-Wasserstein Barycenters Input: 1 2 3 Def. GW Barycenters repeat:

    until convergence. initialization: C C0 for s = 1 to S do return C Alternating minimization: On Ts
  127. Conclusion: Toward High-dimensional OT Monge Kantorovich Dantzig Brenier Otto McCann

    Villani ansport framework Sliced Wasserstein projection Applications lication to Color Transfer Source image (X) Style image (Y) Sliced Wasserstein projection of X to style image color statistics Y Source image after color transfer J. Rabin Wasserstein Regularization