Numerical Optimal Transport

Numerical Optimal Transport Gabriel Peyré É C O L E
N O R M A L E S U P É R I E U R E Outline • What is the Lasso • Lasso with an orthogonal design • From projected gradient to proximal gradien • Optimality conditions and subgradients (LAR • Coordinate descent algorithm … with some demos www.numerical-tours.com http://optimaltransport.github.io

Overview • Measures and Histograms • From Monge to Kantorovitch
Formulations • Entropic Regularization and Sinkhorn • Barycenters • Unbalanced OT and Gradient Flows • Minimum Kantorovitch Estimators • Gromov-Wasserstein

Comparing Measures and Spaces Source image (X) Style image (Y)
Sliced Wasserstein projection of X to style image color statistics Y Source image after color transfer J. Rabin Wasserstein Regularization ! images, vision, graphics and machine learning, . . . • Probability distributions and histograms

Comparing Measures and Spaces Source image (X) Style image (Y)
Sliced Wasserstein projection of X to style image color statistics Y Source image after color transfer J. Rabin Wasserstein Regularization ! images, vision, graphics and machine learning, . . . • Probability distributions and histograms • Optimal transport Optimal transport mean L2 mean

Probability Measures Positive Radon measure µ on a set X.
dµ(x) = m(x)dx Measure of sets A ⇢ X: µ(A) = R A dµ(x) > 0 X X X µ = P i µi xi 2.2. Assignment and Monge Problem 9 Discrete d = 1 Discrete d = 2 Density d = 1 Density d = 2 Figure 2.1: Schematic display of discrete distributions – = q n i =1 ai”xi (red corresponds to empirical uniform distribution ai = 1/n, and blue to arbitrary distributions) and densities d–(x) = fl– (x)dx (in violet), in both 1-D and 2-D. Discrete distributions in 1-D are displayed using vertical segments (with length equal to ai ) and in 2-D using point clouds (radius equal to ai ). is the dimension), can have a density d–(x) = fl – (x)dx w.r.t. the Lebesgue measure, often denoted fl – = d – d x , which means that d ⁄ ⁄

dµ(x) = m(x)dx Integration against continuous functions: R X g(x)dµ(x) > 0 Measure of sets A ⇢ X: µ(A) = R A dµ(x) > 0 X X X µ = P i µi xi R X gdµ = R X m(x)dx R X gdµ = P i µig(xi) dµ(x) = m(x)dx µ = P i µi xi 2.2. Assignment and Monge Problem 9 Discrete d = 1 Discrete d = 2 Density d = 1 Density d = 2 Figure 2.1: Schematic display of discrete distributions – = q n i =1 ai”xi (red corresponds to empirical uniform distribution ai = 1/n, and blue to arbitrary distributions) and densities d–(x) = fl– (x)dx (in violet), in both 1-D and 2-D. Discrete distributions in 1-D are displayed using vertical segments (with length equal to ai ) and in 2-D using point clouds (radius equal to ai ). is the dimension), can have a density d–(x) = fl – (x)dx w.r.t. the Lebesgue measure, often denoted fl – = d – d x , which means that d ⁄ ⁄

dµ(x) = m(x)dx Integration against continuous functions: R X g(x)dµ(x) > 0 Measure of sets A ⇢ X: µ(A) = R A dµ(x) > 0 X X X Probability (normalized) measure: µ(X) = R X dµ(x) = 1 µ = P i µi xi R X gdµ = R X m(x)dx R X gdµ = P i µig(xi) dµ(x) = m(x)dx µ = P i µi xi 2.2. Assignment and Monge Problem 9 Discrete d = 1 Discrete d = 2 Density d = 1 Density d = 2 Figure 2.1: Schematic display of discrete distributions – = q n i =1 ai”xi (red corresponds to empirical uniform distribution ai = 1/n, and blue to arbitrary distributions) and densities d–(x) = fl– (x)dx (in violet), in both 1-D and 2-D. Discrete distributions in 1-D are displayed using vertical segments (with length equal to ai ) and in 2-D using point clouds (radius equal to ai ). is the dimension), can have a density d–(x) = fl – (x)dx w.r.t. the Lebesgue measure, often denoted fl – = d – d x , which means that d ⁄ ⁄

Random vectors Radon measures P(X 2 A) R A dµ(x)
= P(Xn 2 A) n!+1 ! P(X 2 A) 8 set A 8 continuous function f R fdµn n!+1 ! R fdµ Weak⇤ convergence: Convergence in law: Measures and Random Variables

Random vectors Radon measures P(X 2 A) R A dµ(x)
= P(Xn 2 A) n!+1 ! P(X 2 A) 8 set A 8 continuous function f R fdµn n!+1 ! R fdµ Weak⇤ convergence: Convergence in law: Weak convergence: . . . Measures and Random Variables

Quotient space: Discrete measure: Discretization: Histogram vs. Empirical Lagrangian (point
clouds) Eulerian (histograms) µ = N X i=1 µi xi xi 2 X, X i µi = 1 Constant weights µi = 1 N XN /⌃N {(µi)i > 0 ; P i µi = 1} Convex polytope (simplex): xi X X Fixed positions xi (e.g. grid)

- f . s - t e e e l
. x Radon measures (µ, ⌫) on (X, Y ). Transfer of measure by f : X ! Y : push forward. ⌫ = f]µ deﬁned by: ⌫(A) def. = µ(f 1(A)) () R Y g(y)d⌫(y) def. = R X g(f(x))dµ(x) y = f(x) f X µ ⌫ Y A f 1(A) Push Forward

- f . s - t e e e l
. x Radon measures (µ, ⌫) on (X, Y ). Transfer of measure by f : X ! Y : push forward. ⌫ = f]µ deﬁned by: ⌫(A) def. = µ(f 1(A)) () R Y g(y)d⌫(y) def. = R X g(f(x))dµ(x) y = f(x) f X µ ⌫ Y A f 1(A) Smooth densities: dµ = ⇢(x)dx, d⌫ = ⇠(x)dx Push Forward f]µ = ⌫ () ⇢(f(x))| det(@f(x))| = ⇠(x)

µ f]µ f f ' Measures: push-forward Functions: pull-back f]'
Remark: f] and f] are adjoints Z Y 'd(f]µ) = Z X (f]')dµ = P i xi def. = P i f(xi) def. = ' f f : X ! Y f] : M(X) ! M(Y) f] : C(Y) ! C(X) Push-forward vs. Pull-back

P( lim n!+1 Xn = X) = 1 =) lim
n!+1 E(|Xn X|p) = 0 P(Xn 2 A) n!+1 ! P(X 2 A) In law 8 " > 0, P(|Xn X| > ") n!+1 ! 0 In probability Almost sure In mean =) =) (the Xn can be deﬁned on di↵erent spaces) Convergence of Random Variables

Gaspard Monge (1746-1818) (1784)

Monge Transport min ⌫=f]µ Z X c(x, f(x))dµ(x) f X
µ Y ⌫

µ Y ⌫ Theorem: [Brenier] for c(x, y) = ||x y||2, (µ, ⌫) with density, there exists a unique optimal f. One has f = r where is the unique convex function such that (r )]µ = ⌫

µ Y ⌫ Monge-Amp` ere equation: ⇢(r ) det(@2 ) = ⇠ Theorem: [Brenier] for c(x, y) = ||x y||2, (µ, ⌫) with density, there exists a unique optimal f. One has f = r where is the unique convex function such that (r )]µ = ⌫

µ Y ⌫ Monge-Amp` ere equation: ⇢(r ) det(@2 ) = ⇠ ⌫ µ f f0 ? µ ⌫ µ ⌫ f µ ⌫ f Non-uniqueness / non-existence: Theorem: [Brenier] for c(x, y) = ||x y||2, (µ, ⌫) with density, there exists a unique optimal f. One has f = r where is the unique convex function such that (r )]µ = ⌫

Leonid Kantorovich (1912-1986) Леонид Витальевич Канторович

Optimal Transport was formulated in 1930 by A.N. Tolstoi, 12
years before Kantorovich. He even solved a "large scale" 10×68 instance! Before Kantorovitch

xi yj Kantorovitch’s Formulation Points (xi)i, (yj)j T Input distributions
Couplings Def.

xi yj Kantorovitch’s Formulation Points (xi)i, (yj)j T Input distributions
Couplings Def. [Kantorovich 1942] Wasserstein Distance / EMD Def.

OT Between General Measures Couplings: P1]⇡(S) def. = ⇡(S, X)
P2]⇡(S) def. = ⇡(X, S) Marginals: ⇡ ⇧(µ, ⌫) def. = {⇡ 2 M+(X ⇥ X) ; P1]⇡ = µ, P2]⇡ = ⌫} ⇡ Optimal transport: [Kantorovitch 1942] Wp p (µ, ⌫) def. = min ⇡ ⇢ hdp, ⇡i = Z X⇥X d(x, y)pd⇡(x, y) ; ⇡ 2 ⇧(µ, ⌫)

Couplings: the 3 Settings ⇡ Discrete ⇡ Continuous ⇡ Semi-discrete
↵ ↵ ↵ ↵ ↵ ↵

Couplings ↵ ↵ ⇡ ↵ ↵ ⇡ ↵ ↵ ⇡
↵ ↵ ⇡

1-D Optimal Transport Remark. If ⌦ = R, c(x, y)
= c(|x y|), c convex, F 1 µ , F 1 ⌫ quantile functions, W(µ, ⌫)= Z 1 0 c(|F 1 µ (x) F 1 ⌫ (x)|)dx

Cases 37 ρβ ρα Put Remark. If ⌦ = Rd,
c(x, y) = kx yk2, and µ = N(mµ, ⌃µ), ⌫ = N(m⌫, ⌃⌫) then W2 2 (µ, ⌫) = kmµ m⌫ k2 + B(⌃µ, ⌃⌫)2 where B is the Bures metric B(⌃µ, ⌃⌫)2 = trace(⌃µ+⌃⌫ 2(⌃1/2 µ ⌃⌫⌃1/2 µ )1/2). The map T : x 7! m⌫ + A(x mµ) is optimal, where A = ⌃ 1 2 µ ✓ ⌃ 1 2 µ ⌃⌫⌃ 1 2 µ ◆1 2 ⌃ 1 2 µ . m 1 p 2⇡ e (x m)2 2 2 OT Between Gaussians T : x 7! m⌫ + A(x mµ)

OT on Gaussians and Bures' Distance

W1 OT and Min-cost Flows

Bins-to-bins metrics: dµ(x) = ⇢(x)dx d˜ µ(x) = ˜ ⇢(x)dx
µ ˜ µ Kullback-Leibler divergence: DKL(µ, ˜ µ) = Z ⇢(x) log ⇢(x) ˜ ⇢(x) dx Hellinger distance: D H (µ, ˜ µ)2 = Z ⇣p ⇢(x) p ˜ ⇢(x) ⌘ 2 dx Metrics on the Space of Measures

Bins-to-bins metrics: dµ (x) = ⇢(x )dx dµ(x) = ⇢(x)dx
d˜ µ(x) = ˜ ⇢(x)dx µ ˜ µ µ Kullback-Leibler divergence: DKL(µ, ˜ µ) = Z ⇢(x) log ⇢(x) ˜ ⇢(x) dx Hellinger distance: E↵ect of translation: D(µ, µ ) ⇡ cst W2(µ, µ ) = D H (µ, ˜ µ)2 = Z ⇣p ⇢(x) p ˜ ⇢(x) ⌘ 2 dx Metrics on the Space of Measures

D'(↵| ) def. = Z X ' ✓ d↵ d
◆ d ||↵ ||B def. = max f2B Z X f(x)(d↵(x) d (x)) Csisz´ ar divergences: Dual norms: X X ↵ ↵ D' || · ||B Weak topology Strong topology ! KL, TV, 2 , Hellinger . . . ! W1, ﬂat, RKHS ⇤ , energy dist, . . . Csiszar Divergence vs Dual Norms

Csiszár divergences, a unifying way to deﬁne losses between arbitrary
positive measures (discrete & densities). https://en.wikipedia.org/wiki/F- divergence … Csiszar Divergence

-3 -2 -1 0 1 2 3 0 1 2
3 Energy Gauss W 1 Flat -3 -2 -1 0 1 2 3 0 1 2 3 Energy Gauss W 1 Flat Dual norms: (aka Integral Probability Metrics) ||↵ ||B def. = max ⇢Z X f(x)(d↵(x) d (x)) ; f 2 B t t/2 Wasserstein 1: B = {f ; ||rf||1 6 1}. Flat norm: B = {f ; ||f||1 6 1, ||rf||1 6 1}. RKHS: B = f ; ||f||2 k 6 1 . ||↵ ||2 B = Z k(x, x0)d↵(x)d↵(x0) + Z k(x, x0)d (y)d (y0) 2 Z k(x, y)d↵(x)d (y) Energy distance: k(x, y) = ||x y|| Gaussian: k(x, y) = e | |x y| |2 2 2 ↵ ↵ Dual Norms

RKHS Norms aka Maximum Mean Discrepency

The Earth Mover’s Distance µ ⌫ dist(I1, I2) = W1(µ,
⌫) [Rubner’98]

29 The Word Mover’s Distance [Kusner’15] dist(D1, D2) = W2(µ,
⌫) µ ⌫

Formulations • Linear Programming and Semi-discrete • Entropic Regularization and Sinkhorn • Barycenters • Unbalanced OT and Gradient Flows • Minimum Kantorovitch Estimators • Gromov-Wasserstein

Linear programming: µ = PN1 i=1 pi xi , ⌫
= PN2 j=1 pj yi Algorithms

= PN2 j=1 pj yi ⇠ O(N3) Hungarian/Auction: µ = 1 N PN i=1 xi , ⌫ = 1 N PN j=1 yj Algorithms

= PN2 j=1 pj yi ⇠ O(N3) Hungarian/Auction: µ = 1 N PN i=1 xi , ⌫ = 1 N PN j=1 yj Algorithms µ ⌫

= PN2 j=1 pj yi ⇠ O(N3) Hungarian/Auction: µ = 1 N PN i=1 xi , ⌫ = 1 N PN j=1 yj Monge-Amp` ere/Benamou-Brenier, d = || · ||2 2 . also reveal that the network simplex behaves in O(n 2) in our con- text, which is a major gain at the scale at which we typical work, i.e. thousands of particles. This finding is also useful for applications that use EMD, where using the network simplex instead of the transport simplex can bring a significant performance increase. Our experiments also show that fixed-point precision further speeds up the computation. We observed that the value of the final transport cost is less accurate because of the limited precision, but that the particle pairing that produces the actual interpolation scheme remains unchanged. We used the fixed point method to generate the results presented in this paper. The results of the performance study are also of broader interest, as current EMD image retrieval or color transfer techniques rely on slower solvers [Rubner et al. 2000; Kanters et al. 2003; Morovic and Sun 2003]. 101 102 103 104 10−4 10−2 100 102 Problem size : number of bins per histogram Time in seconds Network Simplex (fixed point) Network Simplex (double prec.) Transport Simplex y = α x2 y = β x3 Figure 6: Log-log plot of the running times of different solvers. Figure 7: Synthetic 2D examples on a Euclidean domain. Th left and right columns show the input distributions, while the cent columns show interpolations for ↵ = 1/4, ↵ = 1/2, and ↵ = 3/ positive. A negative side effect of this choice is that interpolatin between BRDFs of equal energy conserves their log energy (§ 3. instead of their energy. Because we apply a concave remappin the interpolated value is guaranteed to be always lower, which e sures that our result does not break the energy preservation rul Algorithms µ ⌫

= PN2 j=1 pj yi ⇠ O(N3) Hungarian/Auction: µ = 1 N PN i=1 xi , ⌫ = 1 N PN j=1 yj Monge-Amp` ere/Benamou-Brenier, d = || · ||2 2 . also reveal that the network simplex behaves in O(n 2) in our con- text, which is a major gain at the scale at which we typical work, i.e. thousands of particles. This finding is also useful for applications that use EMD, where using the network simplex instead of the transport simplex can bring a significant performance increase. Our experiments also show that fixed-point precision further speeds up the computation. We observed that the value of the final transport cost is less accurate because of the limited precision, but that the particle pairing that produces the actual interpolation scheme remains unchanged. We used the fixed point method to generate the results presented in this paper. The results of the performance study are also of broader interest, as current EMD image retrieval or color transfer techniques rely on slower solvers [Rubner et al. 2000; Kanters et al. 2003; Morovic and Sun 2003]. 101 102 103 104 10−4 10−2 100 102 Problem size : number of bins per histogram Time in seconds Network Simplex (fixed point) Network Simplex (double prec.) Transport Simplex y = α x2 y = β x3 Figure 6: Log-log plot of the running times of different solvers. Figure 7: Synthetic 2D examples on a Euclidean domain. Th left and right columns show the input distributions, while the cent columns show interpolations for ↵ = 1/4, ↵ = 1/2, and ↵ = 3/ positive. A negative side effect of this choice is that interpolatin between BRDFs of equal energy conserves their log energy (§ 3. instead of their energy. Because we apply a concave remappin the interpolated value is guaranteed to be always lower, which e sures that our result does not break the energy preservation rul Algorithms µ ⌫ Semi-discrete: Laguerre cells, d = || · ||2 2 . [Levy,’15]

Need for fast approximate algorithms for generic c. Linear programming:
µ = PN1 i=1 pi xi , ⌫ = PN2 j=1 pj yi ⇠ O(N3) Hungarian/Auction: µ = 1 N PN i=1 xi , ⌫ = 1 N PN j=1 yj Monge-Amp` ere/Benamou-Brenier, d = || · ||2 2 . also reveal that the network simplex behaves in O(n 2) in our con- text, which is a major gain at the scale at which we typical work, i.e. thousands of particles. This finding is also useful for applications that use EMD, where using the network simplex instead of the transport simplex can bring a significant performance increase. Our experiments also show that fixed-point precision further speeds up the computation. We observed that the value of the final transport cost is less accurate because of the limited precision, but that the particle pairing that produces the actual interpolation scheme remains unchanged. We used the fixed point method to generate the results presented in this paper. The results of the performance study are also of broader interest, as current EMD image retrieval or color transfer techniques rely on slower solvers [Rubner et al. 2000; Kanters et al. 2003; Morovic and Sun 2003]. 101 102 103 104 10−4 10−2 100 102 Problem size : number of bins per histogram Time in seconds Network Simplex (fixed point) Network Simplex (double prec.) Transport Simplex y = α x2 y = β x3 Figure 6: Log-log plot of the running times of different solvers. Figure 7: Synthetic 2D examples on a Euclidean domain. Th left and right columns show the input distributions, while the cent columns show interpolations for ↵ = 1/4, ↵ = 1/2, and ↵ = 3/ positive. A negative side effect of this choice is that interpolatin between BRDFs of equal energy conserves their log energy (§ 3. instead of their energy. Because we apply a concave remappin the interpolated value is guaranteed to be always lower, which e sures that our result does not break the energy preservation rul Algorithms µ ⌫ Semi-discrete: Laguerre cells, d = || · ||2 2 . [Levy,’15]

C-Transform 92 Semi-discrete Optimal Transport 0 0.5 1 -0.2 0
0.2 0.4 0.6 0.5 1 1.5 2 p = 1/2 p = 1 p = 3/2 p = 2 Figure 5.1: Top: examples of semi-discrete ¯ c-transforms g ¯ c in 1-D, for ground d the useful indicator function notation (4.42). alternate minimization on either f or g leads to the im- n of c-transform: ’ y œ Y, fc(y) def. = inf xœX c(x, y) ≠ f(x), (5.1) ’ x œ X, g¯ c(x) def. = inf yœY c(x, y) ≠ g(y), (5.2) oted ¯ c(y, x) def. = c(x, y). Indeed, one can check that œ argmax g E(f, g) and g¯ c œ argmax f E(f, g). (5.3) ese partial minimizations define maximizers on the sup- tively – and —, while the definitions (5.1) actually define he whole spaces X and Y. This is thus a way to extend in ay solutions of (2.22) on the whole spaces. When X = Rd Îx ≠ yÎp, then the c-transform (5.1) fc is the so-called n between ≠f and Î·Îp. The definition of fc is also often a “Hopf-Lax formula”. (f, g) œ C(X) ◊ C(Y) ‘æ (g¯ c, fc) œ C(X) ◊ C(Y) replaces s by “better” ones (improving the dual objective E). Func- be written in the form fc and g¯ c are called c-concave and ctions. In the special case c(x, y) = Èx, yÍ in X = Y = Rd, n coincides with the usual notion of concave functions. turally Proposition 3.1 to a continuous case, one has the operations are replaced by a “soft-min”. Using (5.3), one can reformulate (2.22) as an unconstrained program over a single potential Lc (–, —) = max fœC ( X ) ⁄ X f(x)d–(x) + ⁄ Y fc(y)d—(y), = max gœC ( Y ) ⁄ X g¯ c(x)d–(x) + ⁄ Y g(y)d—(y). Since one can iterate the map (f, g) ‘æ (g¯ c, fc), it is possible to these optimization problems the constraint that f is ¯ c-concave is c-concave, which is important to ensure enough regularity on potentials and show for instance existence of solutions to (2.22) 5.2 Semi-discrete Formulation A case of particular interest is when — = q j bj ” yj is discrete (of the same construction applies if – is discrete by exchanging the (–, —)). One can adapt the definition of the ¯ c transform (5.1) setting by restricting the minimization to the support (y j ) j of — ’ g œ Rm, ’ x œ X, g ¯ c(x) def. = min jœJmK c(x, y j ) ≠ gj . This transform maps a vector g to a continuous function g ¯ c œ Note that this definition coincides with (5.1) when imposing th space X is equal to the support of —. Figure 5.1 shows some ex of such discrete ¯ c-transforms in 1-D and 2-D. Using this discrete ¯ c-transform, in this semi-discrete case, ( equivalent to the following finite dimensional optimization

Semi-discrete Descent Algorithm 94 Semi-discrete Optimal Transport ¸ = 1
¸ = 3 ¸ = 50 ¸ = 100 Matching Figure 5.2: Iterations of the semi-discrete OT algorithm minimizing (5.8) (here a simple gradient descent is used). The support (yj ) j of the discrete measure — is indicated by the red points, while the continuous measure – is the uniform measure on a square. The blue cells display the Laguerre partition (Lg(¸) (yj )) j where g ( ¸ ) is the discrete dual potential computed at iteration ¸.

2000 4000 6000 8000 10000 -8 -6 -4 -2 0
log E(g(?)) E(g(`)) ` Stochastic gradient descent for the semi-discrete Optimal Transport, illustration of convergence and corresponding Laguerre cells. https://arxiv.org/abs/1605.08527 Semi-discrete Stochastic Descent

The simplex of histograms with N bins is ⌃ N
def. = p 2 R+ N ; P i p i = 1 . The entropy of T 2 RN⇥N + is defined as H(T) def. = P N i,j=1 T i,j (log(T i,j ) 1). The set of couplings between histograms p 2 ⌃ N1 and q 2 ⌃ N2 is C p,q def. = T 2 (R +)N1 ⇥N2 ; T1 N2 = p, T>1 N1 = q . Here, 1 N def. = (1, . . . , 1)> 2 RN . For any tensor L = (L i,j,k,` ) i,j,k,` and matrix (T i,j ) i,j , we define the tensor- matrix multiplication as L ⌦ T def. = ⇣ X k,` L i,j,k,` T k,` ⌘ i,j . (1) 2. Gromov-Wasserstein Discrepancy 2.1. Entropic Optimal Transport Optimal transport distances are useful to compare two histograms (p, q) 2 ⌃ N1 ⇥ ⌃ N2 defined on the same metric 1 In our setting, sinc learning problems, distance matrices, i does not necessaril We define the Gro two measured simil and ( ¯ C, q) 2 RN2 ⇥ GW(C where E C,C The matrix T is a which the similarit L is some loss fun the similarity matr the quadratic loss the Kullback-Leibl a log(a/b) a + inition (4) of GW Entropy: Entropic Regularization Def. Regularized OT: [Cuturi NIPS’13]

The simplex of histograms with N bins is ⌃ N
def. = p 2 R+ N ; P i p i = 1 . The entropy of T 2 RN⇥N + is defined as H(T) def. = P N i,j=1 T i,j (log(T i,j ) 1). The set of couplings between histograms p 2 ⌃ N1 and q 2 ⌃ N2 is C p,q def. = T 2 (R +)N1 ⇥N2 ; T1 N2 = p, T>1 N1 = q . Here, 1 N def. = (1, . . . , 1)> 2 RN . For any tensor L = (L i,j,k,` ) i,j,k,` and matrix (T i,j ) i,j , we define the tensor- matrix multiplication as L ⌦ T def. = ⇣ X k,` L i,j,k,` T k,` ⌘ i,j . (1) 2. Gromov-Wasserstein Discrepancy 2.1. Entropic Optimal Transport Optimal transport distances are useful to compare two histograms (p, q) 2 ⌃ N1 ⇥ ⌃ N2 defined on the same metric 1 In our setting, sinc learning problems, distance matrices, i does not necessaril We define the Gro two measured simil and ( ¯ C, q) 2 RN2 ⇥ GW(C where E C,C The matrix T is a which the similarit L is some loss fun the similarity matr the quadratic loss the Kullback-Leibl a log(a/b) a + inition (4) of GW Entropy: Entropic Regularization Regularization impact on solution: Def. Regularized OT: [Cuturi NIPS’13] c " T" " "

Sinkhorn’s Algorithm

Sinkhorn’s Algorithm Only matrix/vector multiplications. ! Parallelizable. ! Streams well
on GPU.

Optimal Transport on Surfaces Triangulated mesh M. Geodesic distance dM
.

Optimal Transport on Surfaces Level sets xi d(xi, ·) Ground
cost: c(x, y) = dM (x, y)↵. Triangulated mesh M. Geodesic distance dM .

Optimal Transport on Surfaces Level sets xi d(xi, ·) Computing
c (Fast-Marching): N2 log(N) ! too costly. Ground cost: c(x, y) = dM (x, y)↵. Triangulated mesh M. Geodesic distance dM .

Entropic Transport on Surfaces @tut(x, ·) = M ut(x, ·),
ut=0(x, ·) = x Heat equation on M: " t

ut=0(x, ·) = x Heat equation on M: Theorem: [Varadhan] " log(u") "!0 ! d2 M " t

ut=0(x, ·) = x Heat equation on M: Theorem: [Varadhan] " log(u") "!0 ! d2 M " Sinkhorn kernel: K def. = e d2 M " ⇡ u" ⇡ Id " ` M ` t

MRI Data Procesing [with A. Gramfort] L2 barycenter W2 2
barycenter Ground cost c = dM : geodesic on cortical surface M.

Regularization for General Measures Schr¨ odinger’s problem: Landmark computational paper:
[Cuturi 2013]. ⇡" def. = argmin ⇡ {hdp, ⇡i + "KL(⇡|⇡0) ; ⇡ 2 ⇧(µ, ⌫)} ⇡" = argmin ⇡2⇧(µ,⌫) KL(⇡|K) K(x, y) def. = e dp(x,y) " ⇡0(x, y)

Regularization for General Measures Schr¨ odinger’s problem: Landmark computational paper:
[Cuturi 2013]. ⇡" def. = argmin ⇡ {hdp, ⇡i + "KL(⇡|⇡0) ; ⇡ 2 ⇧(µ, ⌫)} ⇡" = argmin ⇡2⇧(µ,⌫) KL(⇡|K) K(x, y) def. = e dp(x,y) " ⇡0(x, y) Proposition: [Carlier, Duval, Peyr´ e, Schmitzer 2015] ⇡" "!0 ! argmin ⇡2⇧(µ,⌫) hdp, ⇡i ⇡" "!+1 ! µ(x)⌫(y) " ⇡" "

Back to Sinkhorn’s Algorithm ProxKL f1/" (˜ µ) = µ
ProxKL f2/" (˜ ⌫) = ⌫ f1 = ◆µ f2 = ◆⌫ Optimal transport problem:

ProxKL f2/" (˜ ⌫) = ⌫ [Sinkhorn 1967][Deming,Stephan 1940] Sinkhorn/IPFP algorithm: a(`+1) def. = µ Kb(`) and b(`+1) def. = ⌫ K⇤a(`+1) f1 = ◆µ f2 = ◆⌫ Optimal transport problem: ` ⇡(`) "

ProxKL f2/" (˜ ⌫) = ⌫ ⇡" " [Sinkhorn 1967][Deming,Stephan 1940] Sinkhorn/IPFP algorithm: a(`+1) def. = µ Kb(`) and b(`+1) def. = ⌫ K⇤a(`+1) f1 = ◆µ f2 = ◆⌫ Optimal transport problem: Proposition: [Franklin,Lorenz 1989] ⇡(`) def. = diag(a(`))K diag(b(`)) || log(⇡ (`)) log(⇡ ?)||1 = O(1 )` , ⇠  1/" c Local rate: [Knight 2008] 1000 2000 3000 4000 5000 -2 -1.5 -1 -0.5 0 ` log(|| log(⇡(`)) log(⇡?)||1) ` ⇡(`) "

0 0.5 1 -0.2 0 0.2 0 0.1 0.2 0.3
Á = 0 Á = 0.01 Á = 0.1 Á = 0.3 Figure 5.3: Top: examples of entropic semi-discrete ¯ c-transforms g ¯ c,Á in 1-D, for 0 0.5 1 -0.2 0 0.2 0 0.1 0.2 0.3 Á = 0 Á = 0.01 Á = 0.1 Á = 0.3 Figure 5.3: Top: examples of entropic semi-discrete ¯ c-transforms g ¯ c,Á in 1-D, for Laguerre cells “Sinkhorn” Laguerre cells Semi-discrete Optimal Transport is a multiclass SVM. Sinkhorn version is logistic regression. Decomposes space in entropic Laguerre diagrams 0 0.5 1 -0.2 0 0.2 0 0.1 0.2 0.3 Á = 0 Á = 0.01 Á = 0.1 Á = 0.3 Figure 5.3: Top: examples of entropic semi-discrete ¯ c-transforms g ¯ c,Á in 1-D, for ground cost c(x, y) = |x ≠ y| for varying Á (see colorbar). The red points are at locations (yj, ≠gj ) j . Bottom: examples of entropic semi-discrete ¯ c-transforms g ¯ c,Á in 2-D, for ground cost c(x, y) = Îx ≠ yÎ for varying Á. The black curves are the level sets of the function g ¯ c,Á, while the colors indicate the smoothed indicator function of the Laguerre cells ‰Á j . The red points are at locations yj œ R2 , and their size is proportional to gj . Semi-discrete OT and Entropy

Wasserstein Barycenters Barycenters of measures (µk)k: P k k =
1 µ? 2 argmin µ P k kW2 2 (µk, µ)

Wasserstein Barycenters µ µ1 µ3 W2 (µ1 , µ )
W 2 (µ 2 ,µ ) µ2 W2 (µ3 ,µ ) Barycenters of measures (µk)k: P k k = 1 If µk = xk then µ? = P k kxk Generalizes Euclidean barycenter: µ? 2 argmin µ P k kW2 2 (µk, µ)

W 2 (µ 2 ,µ ) µ2 W2 (µ3 ,µ ) Barycenters of measures (µk)k: P k k = 1 If µk = xk then µ? = P k kxk Generalizes Euclidean barycenter: µ2 µ1 Mc Cann’s displacement interpolation. µ? 2 argmin µ P k kW2 2 (µk, µ)

W 2 (µ 2 ,µ ) µ2 W2 (µ3 ,µ ) Barycenters of measures (µk)k: P k k = 1 If µk = xk then µ? = P k kxk Generalizes Euclidean barycenter: µ2 µ1 µ3 µ2 µ1 Mc Cann’s displacement interpolation. µ? 2 argmin µ P k kW2 2 (µk, µ)

W 2 (µ 2 ,µ ) µ2 W2 (µ3 ,µ ) Barycenters of measures (µk)k: P k k = 1 If µk = xk then µ? = P k kxk Generalizes Euclidean barycenter: µ2 µ1 µ3 µ exists and is unique. Theorem: if µ1 does not vanish on small sets, [Agueh, Carlier, 2010] (for c(x, y) = ||x y||2) µ2 µ1 Mc Cann’s displacement interpolation. µ? 2 argmin µ P k kW2 2 (µk, µ)

Displacement Interpolation

110 Dynamic Formulations –0 –1 / 5 –2 / 5
–3 / 5 –4 / 5 –1 Figure 7.2: Comparison of displacement interpolation (7.8) of discrete measures. Top: point clouds (empirical measures (–0, –1) with the same number of points). Bottom: same but with weight. For 0 < t < 1, the top example corresponds to an Displacement Interpolation

2 ⌃3 Wasserstein mean L2 mean Wasserstein Barycenters

Regularized Barycenters min (⇡k)k,µ ( X k k (hc, ⇡k
i + "KL(⇡k |⇡0,k)) ; 8k, ⇡k 2 ⇧(µk, µ) )

Regularized Barycenters ! Need to ﬁx a discretization grid for
µ, i.e. choose (⇡0,k)k min (⇡k)k,µ ( X k k (hc, ⇡k i + "KL(⇡k |⇡0,k)) ; 8k, ⇡k 2 ⇧(µk, µ) )

µ, i.e. choose (⇡0,k)k ! Sinkhorn-like algorithm [Benamou, Carlier, Cuturi, Nenna, Peyr´ e, 2015]. min (⇡k)k,µ ( X k k (hc, ⇡k i + "KL(⇡k |⇡0,k)) ; 8k, ⇡k 2 ⇧(µk, µ) )

µ, i.e. choose (⇡0,k)k ! Sinkhorn-like algorithm [Benamou, Carlier, Cuturi, Nenna, Peyr´ e, 2015]. [Solomon et al, SIGGRAPH 2015] min (⇡k)k,µ ( X k k (hc, ⇡k i + "KL(⇡k |⇡0,k)) ; 8k, ⇡k 2 ⇧(µk, µ) )

Barycenters of 2D Shapes

Barycenters of 3D Shapes

Barycenter on a Surface 1 µ1 µ2 0 1

Barycenter on a Surface 1 µ1 µ2 µ3 µ4 µ5
µ6 µ1 µ2 0 1

Barycenter on a Surface 1 = (1, . . .
, 1)/6 µ1 µ2 µ3 µ4 µ5 µ6 µ1 µ2 0 1

Color Transfer µ ⌫ Input images: (f, g) (chrominance components)
Input measures: f g µ(A) = U(f 1(A)), ⌫(A) = U(g 1(A))

Color Transfer µ ⌫ Input images: (f, g) (chrominance components)
Input measures: f T T f g ˜ T g µ(A) = U(f 1(A)), ⌫(A) = U(g 1(A))

Topic Models [Rolet’16]

Unbalanced Transport (⇠, µ) 2 M+(X)2, KL(⇠|µ) def. = R
X log ⇣ d⇠ dµ ⌘ dµ + R X (dµ d⇠) WFc(µ, ⌫) def. = min ⇡ hc, ⇡i + KL(P1]⇡|µ) + KL(P2]⇡|⌫) [Liereo, Mielke, Savar´ e 2015]

Unbalanced Transport [Liereo, Mielke, Savar´ e 2015] [Chizat, Schmitzer, Peyr´
e, Vialard 2015] Proposition: then WF1/2 c is a distance on M+(X). (⇠, µ) 2 M+(X)2, KL(⇠|µ) def. = R X log ⇣ d⇠ dµ ⌘ dµ + R X (dµ d⇠) If c(x, y) = log(cos(min(d(x, y), ⇡ 2 )) WFc(µ, ⌫) def. = min ⇡ hc, ⇡i + KL(P1]⇡|µ) + KL(P2]⇡|⌫) [Liereo, Mielke, Savar´ e 2015]

Unbalanced Transport ! “Dynamic” Benamou-Brenier formulation. [Liereo, Mielke, Savar´ e
2015][Kondratyev, Monsaingeon, Vorotnikov, 2015] [Liereo, Mielke, Savar´ e 2015] [Chizat, Schmitzer, Peyr´ e, Vialard 2015] Proposition: then WF1/2 c is a distance on M+(X). [Chizat, Schmitzer, P, Vialard 2015] (⇠, µ) 2 M+(X)2, KL(⇠|µ) def. = R X log ⇣ d⇠ dµ ⌘ dµ + R X (dµ d⇠) If c(x, y) = log(cos(min(d(x, y), ⇡ 2 )) WFc(µ, ⌫) def. = min ⇡ hc, ⇡i + KL(P1]⇡|µ) + KL(P2]⇡|⌫) [Liereo, Mielke, Savar´ e 2015] Balanced OT Unbalanced OT

p = + 1 p = 2 p = 1
Metric space (X, d), minimize F(x) on X. F(x) = ||x||2 on (X = R2, || · ||p) Implicit Euler step: xk+1 def. = argmin x2X d(xk, x)2 + ⌧F(x) xk+1 xk {x ; d(xk, x) ⇠ ⌧} xk+2 xk+3 Implicit Euler Stepping

xk+1 = argmin x2X d(xk, x)2 + ⌧hrF(xk), xi p
= + 1 p = 1 p = 2 p = + 1 p = 2 p = 1 xk+1 = argmin x2X d(xk, x)2 + ⌧F(x) Metric space (X, d), minimize F(x) on X. F(x) = ||x||2 on (X = R2, || · ||p) Implicit Explicit Implicit vs. Explicit Stepping

Wasserstein Gradient Flows Implicit Euler step: [Jordan, Kinderlehrer, Otto 1998]
µt+1 = ProxW ⌧f (µt) def. = argmin µ2M+(X) W2 2 (µt, µ) + ⌧f(µ)

Formal limit ⌧ ! 0: @tµ = div (µr(f0(µ))) µt+1 = ProxW ⌧f (µt) def. = argmin µ2M+(X) W2 2 (µt, µ) + ⌧f(µ)

Formal limit ⌧ ! 0: @tµ = div (µr(f0(µ))) (heat di↵usion) f(µ) = R log(dµ dx )dµ @tµ = µ µt+1 = ProxW ⌧f (µt) def. = argmin µ2M+(X) W2 2 (µt, µ) + ⌧f(µ)

Formal limit ⌧ ! 0: @tµ = div (µr(f0(µ))) (advection) f(µ) = R wdµ @tµ = div(µrw) (heat di↵usion) f(µ) = R log(dµ dx )dµ @tµ = µ Evolution µt Evolution µt rw rw µt+1 = ProxW ⌧f (µt) def. = argmin µ2M+(X) W2 2 (µt, µ) + ⌧f(µ)

Formal limit ⌧ ! 0: @tµ = div (µr(f0(µ))) (advection) f(µ) = R wdµ @tµ = div(µrw) (heat di↵usion) f(µ) = R log(dµ dx )dµ @tµ = µ (non-linear di↵usion) f(µ) = 1 m 1 R (dµ dx )m 1dµ @tµ = µm Evolution µt Evolution µt rw rw µt+1 = ProxW ⌧f (µt) def. = argmin µ2M+(X) W2 2 (µt, µ) + ⌧f(µ)

Eulerian vs. Lagrangian Discretization

ˆ µn µ n ! +1 ˆ µn = 1
n Pn i=1 xi ˆ H(ˆ µn) def. = P i log(minj6=i ||xi xj ||) R log(dµ dx (x))dµ(x) H(µ) def. = n ! +1 Lagrangian Discretization of Entropy

time d⇢t dt = ⇢t + r(V ⇢t) min ⇢
E(⇢) def. = Z V (x)⇢(x)dx + Z ⇢(x) log(⇢(x))dx V (x) = ||x||2 Wasserstein ﬂow of E: Lagrangian Discretization of Gradient Flows

Generalized Entropic Regularization Primal: min ⇡ hdp, ⇡i + f1(P1]⇡)
+ f2(P2]⇡) + "KL(⇡|⇡0)

+ f2(P2]⇡) + "KL(⇡|⇡0) Dual: (a, b) def. = (e u " , e v " ) max u,v f⇤ 1 (u) f⇤ 2 (v) "he u " , Ke v " i ⇡(x, y) = a(x)K(x, y)b(y)

+ f2(P2]⇡) + "KL(⇡|⇡0) Block coordinates relaxation: max u f⇤ 1 (u) "he u " , Ke v " i max v f⇤ 2 (v) "he v " , K⇤e u " i (Iv) (Iu) Dual: (a, b) def. = (e u " , e v " ) max u,v f⇤ 1 (u) f⇤ 2 (v) "he u " , Ke v " i ⇡(x, y) = a(x)K(x, y)b(y)

+ f2(P2]⇡) + "KL(⇡|⇡0) Block coordinates relaxation: max u f⇤ 1 (u) "he u " , Ke v " i max v f⇤ 2 (v) "he v " , K⇤e u " i (Iv) (Iu) Proposition: ProxKL f1/" (µ) def. = argmin ⌫f1(⌫) + "KL(⌫|µ) the solutions of (Iu) and (Iv) read: a = ProxKL f1/" (Kb) Kb b = ProxKL f2/" (K⇤a) K⇤a Dual: (a, b) def. = (e u " , e v " ) max u,v f⇤ 1 (u) f⇤ 2 (v) "he u " , Ke v " i ⇡(x, y) = a(x)K(x, y)b(y)

Generalized Entropic Regularization Primal: ! Only matrix-vector multiplications. ! Highly
parallelizable. ! On regular grids: only convolutions! Linear time iterations. min ⇡ hdp, ⇡i + f1(P1]⇡) + f2(P2]⇡) + "KL(⇡|⇡0) Block coordinates relaxation: max u f⇤ 1 (u) "he u " , Ke v " i max v f⇤ 2 (v) "he v " , K⇤e u " i (Iv) (Iu) Proposition: ProxKL f1/" (µ) def. = argmin ⌫f1(⌫) + "KL(⌫|µ) the solutions of (Iu) and (Iv) read: a = ProxKL f1/" (Kb) Kb b = ProxKL f2/" (K⇤a) K⇤a Dual: (a, b) def. = (e u " , e v " ) max u,v f⇤ 1 (u) f⇤ 2 (v) "he u " , Ke v " i ⇡(x, y) = a(x)K(x, y)b(y)

Gradient Flows: Crowd Motion µt+1 def. = argminµ W↵ ↵
(µt, µ) + ⌧f(µ) Congestion-inducing function: f(µ) = ◆[0,] (µ) + hw, µi [Maury, Roudne↵-Chupin, Santambrogio 2010]

(µt, µ) + ⌧f(µ) Proposition: Prox1 " f (µ) = min(e "wµ, ) Congestion-inducing function: f(µ) = ◆[0,] (µ) + hw, µi [Maury, Roudne↵-Chupin, Santambrogio 2010]

(µt, µ) + ⌧f(µ) Proposition: Prox1 " f (µ) = min(e "wµ, )  = ||µt=0 ||1  = 2||µt=0 ||1  = 4||µt=0 ||1 Congestion-inducing function: f(µ) = ◆[0,] (µ) + hw, µi [Maury, Roudne↵-Chupin, Santambrogio 2010] rw

Crowd Motion on a Surface Potential cos(w)  = ||µt=0
||1  = 6||µt=0 ||1 X = triangulated mesh.

Gradient Flows: Crowd Motion with Obstacles Potential cos(w)  =
||µt=0 ||1  = 2||µt=0 ||1  = 4||µt=0 ||1  = 6||µt=0 ||1 X = sub-domain of R2.

Multiple-Density Gradient Flows (µ1,t+1, µ2,t+1) def. = argmin (µ1,µ2) W↵
↵ (µ1,t, µ1) + W↵ ↵ (µ2,t, µ2) + ⌧f(µ1, µ2)

Multiple-Density Gradient Flows Wasserstein attraction: rw (µ1,t+1, µ2,t+1) def. =
argmin (µ1,µ2) W↵ ↵ (µ1,t, µ1) + W↵ ↵ (µ2,t, µ2) + ⌧f(µ1, µ2) f(µ1, µ2) = W↵ ↵ (µ1, µ2) + h1(µ1) + h2(µ2) hi(µ) = hw, µi

Multiple-Density Gradient Flows Wasserstein attraction: rw (µ1,t+1, µ2,t+1) def. =
argmin (µ1,µ2) W↵ ↵ (µ1,t, µ1) + W↵ ↵ (µ2,t, µ2) + ⌧f(µ1, µ2) f(µ1, µ2) = W↵ ↵ (µ1, µ2) + h1(µ1) + h2(µ2) hi(µ) = ◆[0,] (µ). hi(µ) = hw, µi

Discriminative vs Generative Models Z x z X Z X
Low dimension High dimension Generative Discriminative g✓ d⇠ g✓ d⇠

Low dimension High dimension Generative Discriminative g✓ d⇠ g✓ d⇠ classiﬁcation, z =class probability Supervised: ! Learn d⇠ from labeled data (xi, zi)i.

Low dimension High dimension Generative Discriminative g✓ d⇠ g✓ d⇠ classiﬁcation, z =class probability Supervised: ! Learn d⇠ from labeled data (xi, zi)i. Compression: z = d⇠(x) is a representation. Generation: x = g✓(z) is a synthesis. Un-supervised: ! Learn (g✓, d⇠) from data (xi)i.

Low dimension High dimension Generative Discriminative g✓ d⇠ g✓ d⇠ classiﬁcation, z =class probability Supervised: ! Learn d⇠ from labeled data (xi, zi)i. Compression: z = d⇠(x) is a representation. Generation: x = g✓(z) is a synthesis. Un-supervised: ! Learn (g✓, d⇠) from data (xi)i. Density ﬁtting g✓({zi }i) ⇡ {xi }i Auto-encoders g✓(d⇠(xi)) ⇡ xi

Low dimension High dimension Generative Discriminative g✓ d⇠ g✓ d⇠ classiﬁcation, z =class probability Supervised: ! Learn d⇠ from labeled data (xi, zi)i. Compression: z = d⇠(x) is a representation. Generation: x = g✓(z) is a synthesis. Un-supervised: ! Learn (g✓, d⇠) from data (xi)i. Density ﬁtting g✓({zi }i) ⇡ {xi }i Auto-encoders g✓(d⇠(xi)) ⇡ xi Optimal transport map d⇠

Density Fitting and Generative Models Parametric model: ✓ 7! µ✓
µ✓ ✓ Observations: ⌫ = 1 n Pn i=1 xi

µ✓ ✓ Observations: ⌫ = 1 n Pn i=1 xi dµ✓(y) = f✓(y)dy Density ﬁtting: Maximum likelihood (MLE) min ✓ c KL(⌫|µ✓) def. = X j log(f✓(yj))

µ✓ ✓ Observations: ⌫ = 1 n Pn i=1 xi g✓ µ✓ X Z ⇣ Generative model fit: µ✓ = g✓,]⇣ ! MLE undefined. ! Need a weaker metric. c KL(µ✓ |⌫) = +1 dµ✓(y) = f✓(y)dy Density fitting: Maximum likelihood (MLE) min ✓ c KL(⌫|µ✓) def. = X j log(f✓(yj))

Loss Functions for Measures Optimal Transport Distances µ = 1
N P i xi W(µ, ⌫)p def. = min T 2Cµ,⌫ P i,j Ti,j ||xi yj ||p ⌫ = 1 P P j yj Density ﬁtting: min ✓ D(µ✓, ⌫) T µ ⌫

N P i xi W(µ, ⌫)p def. = min T 2Cµ,⌫ P i,j Ti,j ||xi yj ||p ⌫ = 1 P P j yj Maximum Mean Discrepancy (MMD) ||µ ⌫||2 k def. = 1 N2 X i,i0 k(xi, xi0 ) + 1 P2 X j,j0 k(yj, yj0 ) 2 NP X i,j k(xi, yj) Gaussian: k(x, y) = e | |x y| |2 2 2 . Energy distance: k(x, y) = ||x y||2. Density ﬁtting: min ✓ D(µ✓, ⌫) T µ ⌫ k µ ⌫

N P i xi W(µ, ⌫)p def. = min T 2Cµ,⌫ P i,j Ti,j ||xi yj ||p ⌫ = 1 P P j yj Maximum Mean Discrepancy (MMD) ||µ ⌫||2 k def. = 1 N2 X i,i0 k(xi, xi0 ) + 1 P2 X j,j0 k(yj, yj0 ) 2 NP X i,j k(xi, yj) Gaussian: k(x, y) = e | |x y| |2 2 2 . Energy distance: k(x, y) = ||x y||2. Density ﬁtting: min ✓ D(µ✓, ⌫) T µ ⌫ k µ ⌫ µ ⌫ T" ¯ W"(µ, ⌫)p def. = W"(µ, ⌫)p 1 2 W"(µ, µ)p 1 2 W"(⌫, ⌫)p W"(µ, ⌫)p def. = P i,j T" i,j ||xi yj ||p Sinkhorn divergences [Cuturi 13]

N P i xi W(µ, ⌫)p def. = min T 2Cµ,⌫ P i,j Ti,j ||xi yj ||p ⌫ = 1 P P j yj Maximum Mean Discrepancy (MMD) ||µ ⌫||2 k def. = 1 N2 X i,i0 k(xi, xi0 ) + 1 P2 X j,j0 k(yj, yj0 ) 2 NP X i,j k(xi, yj) Gaussian: k(x, y) = e | |x y| |2 2 2 . Energy distance: k(x, y) = ||x y||2. for k(x, y) = ||x y||p Density ﬁtting: min ✓ D(µ✓, ⌫) Theorem: [Ramdas, G.Trillos, Cuturi 17] "!+1 ! ||µ ⌫||2 k ¯ W"(µ, ⌫)p "!0 ! W(µ, ⌫)p T µ ⌫ k µ ⌫ µ ⌫ T" ¯ W"(µ, ⌫)p def. = W"(µ, ⌫)p 1 2 W"(µ, µ)p 1 2 W"(⌫, ⌫)p W"(µ, ⌫)p def. = P i,j T" i,j ||xi yj ||p Sinkhorn divergences [Cuturi 13]

N P i xi W(µ, ⌫)p def. = min T 2Cµ,⌫ P i,j Ti,j ||xi yj ||p ⌫ = 1 P P j yj Maximum Mean Discrepancy (MMD) ||µ ⌫||2 k def. = 1 N2 X i,i0 k(xi, xi0 ) + 1 P2 X j,j0 k(yj, yj0 ) 2 NP X i,j k(xi, yj) Gaussian: k(x, y) = e | |x y| |2 2 2 . Energy distance: k(x, y) = ||x y||2. for k(x, y) = ||x y||p Density ﬁtting: min ✓ D(µ✓, ⌫) Theorem: [Ramdas, G.Trillos, Cuturi 17] "!+1 ! ||µ ⌫||2 k ¯ W"(µ, ⌫)p "!0 ! W(µ, ⌫)p T µ ⌫ k µ ⌫ – Scale free (no , no heavy tail kernel). – Non-Euclidean, arbitrary ground distance. – Less biased gradient. – No curse of dimension (low sample complexity). Best of both worlds: ! cross-validate " µ ⌫ T" ¯ W"(µ, ⌫)p def. = W"(µ, ⌫)p 1 2 W"(µ, µ)p 1 2 W"(⌫, ⌫)p W"(µ, ⌫)p def. = P i,j T" i,j ||xi yj ||p Sinkhorn divergences [Cuturi 13]

Deep Discriminative vs Generative Models ✓1 ✓2 ⇢ ⇢ z
x g✓ ⇢ ⇢ z x Discriminative Generative d⇠ ⇠1 ⇠2 g✓(z) = ⇢(✓K(. . . ⇢(✓2(⇢(✓1(z) . . .) Deep networks: d⇠(x) = ⇢(⇠K(. . . ⇢(⇠2(⇢(⇠1(x) . . .)

Deep Discriminative vs Generative Models ✓1 ✓2 ⇢ ⇢ z
x g✓ ⇢ ⇢ z x Discriminative Generative z1 z2 g✓ Z x z X d⇠ d⇠ ⇠1 ⇠2 g✓(z) = ⇢(✓K(. . . ⇢(✓2(⇢(✓1(z) . . .) Deep networks: d⇠(x) = ⇢(⇠K(. . . ⇢(⇠2(⇢(⇠1(x) . . .)

Examples of Image Generation [Credit ArXiv:1511.06434] g✓ Z X

unregistered spaces Gromov-Wasserstein Inputs: {(similarity/kernel matrix, histogram)} X Y

unregistered spaces Gromov-Wasserstein Inputs: {(similarity/kernel matrix, histogram)} X Y [Memoli
2011] Def. Gromov-Wasserstein distance: [Sturm 2012]

unregistered spaces Gromov-Wasserstein Inputs: {(similarity/kernel matrix, histogram)} X Y !
need for a fast approximate solver. ! NP-hard in general. [Memoli 2011] Def. Gromov-Wasserstein distance: [Sturm 2012]

Y xi X Gromov-Wasserstein as a Metric yj Def.

Y xi X Gromov-Wasserstein as a Metric yj X Y
f f X Y Def. () Isometries on M: Def.

Y xi X Gromov-Wasserstein as a Metric yj ! “bending-invariant”
objects recognition. X Y f f X Y Def. () Isometries on M: Def. Prop. [Memoli 2011]

X Y For Arbitrary Spaces Metric-measure spaces (X, Y ):
(dX , µ), (dY , ⌫)

X Y For Arbitrary Spaces d X (x,x 0 )
dY (y, y0) Def. Gromov-Wasserstein distance: [Memoli 2011] [Sturm 2012] GW2 2 (dX , µ, dY , ⌫) def. = min ⇡2⇧(µ,⌫) Z X2⇥Y 2 |dX(x, x0) dY (y, y0)|2d⇡(x, y)d⇡(x0, y0) Metric-measure spaces (X, Y ): (dX , µ), (dY , ⌫)

X Y For Arbitrary Spaces f X Y Prop. GW
is a distance on mm-spaces/isometries. ! “bending-invariant” objects recognition. d X (x,x 0 ) dY (y, y0) Def. Gromov-Wasserstein distance: [Memoli 2011] [Sturm 2012] GW2 2 (dX , µ, dY , ⌫) def. = min ⇡2⇧(µ,⌫) Z X2⇥Y 2 |dX(x, x0) dY (y, y0)|2d⇡(x, y)d⇡(x0, y0) Metric-measure spaces (X, Y ): (dX , µ), (dY , ⌫)

X Y For Arbitrary Spaces f X Y Prop. GW
is a distance on mm-spaces/isometries. ! need for a fast approximate solver. ! “bending-invariant” objects recognition. ! QAP: NP-hard in general. d X (x,x 0 ) dY (y, y0) Def. Gromov-Wasserstein distance: [Memoli 2011] [Sturm 2012] GW2 2 (dX , µ, dY , ⌫) def. = min ⇡2⇧(µ,⌫) Z X2⇥Y 2 |dX(x, x0) dY (y, y0)|2d⇡(x, y)d⇡(x0, y0) Metric-measure spaces (X, Y ): (dX , µ), (dY , ⌫)

Entropic Gromov Wasserstein Projected mirror descent: Def. Entropic Gromov-Wasserstein

Entropic Gromov Wasserstein Projected mirror descent: Def. Entropic Gromov-Wasserstein Projected
mirror descent: Def. Prop. for ⌧ = 1/", the iteration reads repeat: until convergence. initialization: return T func T = GW(C, ¯ C, p, q)

phine Vladimir G. Kim Adobe Research Suvrit Sra MIT Source
Targets Figure 1: Entropic GW can find correspondences between a source surface (left) and a surface with similar structure, a surface with shared semantic structure, a noisy 3D point cloud, an icon, and a hand drawing. Each fuzzy map was computed using the same code. In this paper, we propose a new correspondence algorithm that minimizes distortion of long- and short-range distances alike. We study an entropically-regularized version of the Gromov-Wasserstein (GW) mapping objective function from [M´ emoli 2011] measuring the distortion of geodesic distances. The optimizer is a probabilistic matching expressed as a “fuzzy” correspondence matrix in the style of [Kim et al. 2012; Solomon et al. 2012]; we control sharpness of the correspondence via the weight of the entropic regularizer. 0 0.02 0.04 0 0.02 0 0.02 Teddies Humans Four-legged Armadillo Figure 15: MDS embedding of four classes from SHREC dataset. 0 0.5 1 0 0.5 1 1 5 10 15 20 25 30 35 40 45 PCA 1 PCA 2 Figure 16: Recovery of galloping horse sequence. 0 is the base shape) as a feature vector for shape i. We reproduce the result presented in the work of Rustamov et al., recovering the circular structure of meshes from a galloping horse animation sequence (Figure 16). Unlike Rustamov et al., however, our method does not require ground truth maps between shapes as input. 5.2 Supervised Matching An important feature of a matching tool is the ability to incorporate user input, e.g. ground truth matches of points or regions. In the GWα framework, one way to enforce these constraints is to provide a stencil S specifying a sparsity pattern for the map Γ. Incorporating Figure 18: Mapping a set of 185 images onto a two shapes while preserving color similarity. (Images from Flickr public domain collection.) Rn0×n0 + and D ∈ Rn×n + we are given symmetric weight matrices W0 ∈ Rn0×n0 + and W ∈ Rn×n + . We could solve a weighted version of the GWα matching problem (3) that prioritizes maps preserving distances corresponding to large W values: min Γ∈M ijkℓ (D0ij −Dkℓ )2ΓikΓjℓW0ijWjℓµ0iµ0jµkµℓ . (8) For instance, (W0, W) might contain confidence values expressing the quality of the entries of (D0, D). Or, W0, W could take values in {ε, 1} reducing the weight of distances that are unknown or do not need to be preserved by Γ. Following the same simplifications as §3.1, we can optimize this objective by minimizing ⟨Γ, ΛW (Γ)⟩, where ΛW (Γ) := 1 2 [D∧2 0 ⊗ W0][[µ0 ]]Γ[[µ]]W − [D0 ⊗ W0][[µ0 ]]Γ[[µ]][D ⊗ W] + 1 2 W0[[µ0 ]]Γ[[µ]][D∧2 ⊗ W] Applications of GW: Shapes Analysis Use T to define registration between: Colors distribution Shape Shape Shape

phine Vladimir G. Kim Adobe Research Suvrit Sra MIT Source
Targets Figure 1: Entropic GW can find correspondences between a source surface (left) and a surface with similar structure, a surface with shared semantic structure, a noisy 3D point cloud, an icon, and a hand drawing. Each fuzzy map was computed using the same code. In this paper, we propose a new correspondence algorithm that minimizes distortion of long- and short-range distances alike. We study an entropically-regularized version of the Gromov-Wasserstein (GW) mapping objective function from [M´ emoli 2011] measuring the distortion of geodesic distances. The optimizer is a probabilistic matching expressed as a “fuzzy” correspondence matrix in the style of [Kim et al. 2012; Solomon et al. 2012]; we control sharpness of the correspondence via the weight of the entropic regularizer. 0 0.02 0.04 0 0.02 0 0.02 Teddies Humans Four-legged Armadillo Figure 15: MDS embedding of four classes from SHREC dataset. 0 0.5 1 0 0.5 1 1 5 10 15 20 25 30 35 40 45 PCA 1 PCA 2 Figure 16: Recovery of galloping horse sequence. 0 is the base shape) as a feature vector for shape i. We reproduce the result presented in the work of Rustamov et al., recovering the circular structure of meshes from a galloping horse animation sequence (Figure 16). Unlike Rustamov et al., however, our method does not require ground truth maps between shapes as input. 5.2 Supervised Matching An important feature of a matching tool is the ability to incorporate user input, e.g. ground truth matches of points or regions. In the GWα framework, one way to enforce these constraints is to provide a stencil S specifying a sparsity pattern for the map Γ. Incorporating Figure 18: Mapping a set of 185 images onto a two shapes while preserving color similarity. (Images from Flickr public domain collection.) Rn0×n0 + and D ∈ Rn×n + we are given symmetric weight matrices W0 ∈ Rn0×n0 + and W ∈ Rn×n + . We could solve a weighted version of the GWα matching problem (3) that prioritizes maps preserving distances corresponding to large W values: min Γ∈M ijkℓ (D0ij −Dkℓ )2ΓikΓjℓW0ijWjℓµ0iµ0jµkµℓ . (8) For instance, (W0, W) might contain confidence values expressing the quality of the entries of (D0, D). Or, W0, W could take values in {ε, 1} reducing the weight of distances that are unknown or do not need to be preserved by Γ. Following the same simplifications as §3.1, we can optimize this objective by minimizing ⟨Γ, ΛW (Γ)⟩, where ΛW (Γ) := 1 2 [D∧2 0 ⊗ W0][[µ0 ]]Γ[[µ]]W − [D0 ⊗ W0][[µ0 ]]Γ[[µ]][D ⊗ W] + 1 2 W0[[µ0 ]]Γ[[µ]][D∧2 ⊗ W] Applications of GW: Shapes Analysis 0 0.02 0.04 0 0.02 0 0.02 Te Hu Fo Ar Figure 1: The database that has been used, divid MDS in 3-D Use T to define registration between: Colors distribution Shape Shape Shape Geodesic distances GW distances MDS Vizualization Shapes (Xs)s 0 0.02 0.04 0 0.02 0 0.02 Teddies Humans Four-legged Armadillo Figure 15: MDS embedding of four classes from SHREC dataset. 0 0.5 1 1 5 10 15 20 25 30 35 40 45 PCA 2 Figure 18: Mapping a set of 185 images onto a two shapes while preserving color similarity. (Images from Flickr public domain collection.) Rn0×n0 + and D ∈ Rn×n + we are given symmetric weight matrices W0 ∈ Rn0×n0 + and W ∈ Rn×n + . We could solve a weighted version of the GWα matching problem (3) that prioritizes maps preserving distances corresponding to large W values: MDS in 2-D

Applications of GW: Quantum Chemistry Regression problem: ! f by
solving DFT approximation is too costly.

solving DFT approximation is too costly. [Rupp et al 2012] 2 6 6 6 6 4 3 7 7 7 7 5

solving DFT approximation is too costly. GW-interpolation: [Rupp et al 2012] 2 6 6 6 6 4 3 7 7 7 7 5

Gromov-Wasserstein Geodesics Def. Gromov-Wasserstein Geodesic

Gromov-Wasserstein Geodesics Def. Gromov-Wasserstein Geodesic Prop. [Sturm 2012]

Gromov-Wasserstein Geodesics Def. Gromov-Wasserstein Geodesic ! X ⇥ Y is
not practical for most applications. (need to ﬁx the size of the geodesic embedding space) ! Extension to more than 2 input spaces? Prop. [Sturm 2012]

Gromov-Wasserstein Barycenters Input: 1 2 3 Def. GW Barycenters

Gromov-Wasserstein Barycenters Input: 1 2 3 Def. GW Barycenters repeat:
until convergence. initialization: C C0 for s = 1 to S do return C Alternating minimization: On Ts

Conclusion: Toward High-dimensional OT Monge Kantorovich Dantzig Brenier Otto McCann
Villani ansport framework Sliced Wasserstein projection Applications lication to Color Transfer Source image (X) Style image (Y) Sliced Wasserstein projection of X to style image color statistics Y Source image after color transfer J. Rabin Wasserstein Regularization

Numerical Optimal Transport

Numerical Optimal Transport

More Decks by Gabriel Peyré

Other Decks in Research

Featured

Transcript