N O R M A L E S U P É R I E U R E Outline • What is the Lasso • Lasso with an orthogonal design • From projected gradient to proximal gradien • Optimality conditions and subgradients (LAR • Coordinate descent algorithm … with some demos www.numerical-tours.com http://optimaltransport.github.io
Sliced Wasserstein projection of X to style image color statistics Y Source image after color transfer J. Rabin Wasserstein Regularization ! images, vision, graphics and machine learning, . . . • Probability distributions and histograms
Sliced Wasserstein projection of X to style image color statistics Y Source image after color transfer J. Rabin Wasserstein Regularization ! images, vision, graphics and machine learning, . . . • Probability distributions and histograms • Optimal transport Optimal transport mean L2 mean
dµ(x) = m(x)dx Measure of sets A ⇢ X: µ(A) = R A dµ(x) > 0 X X X µ = P i µi xi 2.2. Assignment and Monge Problem 9 Discrete d = 1 Discrete d = 2 Density d = 1 Density d = 2 Figure 2.1: Schematic display of discrete distributions – = q n i =1 ai”xi (red cor- responds to empirical uniform distribution ai = 1/n, and blue to arbitrary distri- butions) and densities d–(x) = fl– (x)dx (in violet), in both 1-D and 2-D. Discrete distributions in 1-D are displayed using vertical segments (with length equal to ai ) and in 2-D using point clouds (radius equal to ai ). is the dimension), can have a density d–(x) = fl – (x)dx w.r.t. the Lebesgue measure, often denoted fl – = d – d x , which means that d ⁄ ⁄
dµ(x) = m(x)dx Integration against continuous functions: R X g(x)dµ(x) > 0 Measure of sets A ⇢ X: µ(A) = R A dµ(x) > 0 X X X µ = P i µi xi R X gdµ = R X m(x)dx R X gdµ = P i µig(xi) dµ(x) = m(x)dx µ = P i µi xi 2.2. Assignment and Monge Problem 9 Discrete d = 1 Discrete d = 2 Density d = 1 Density d = 2 Figure 2.1: Schematic display of discrete distributions – = q n i =1 ai”xi (red cor- responds to empirical uniform distribution ai = 1/n, and blue to arbitrary distri- butions) and densities d–(x) = fl– (x)dx (in violet), in both 1-D and 2-D. Discrete distributions in 1-D are displayed using vertical segments (with length equal to ai ) and in 2-D using point clouds (radius equal to ai ). is the dimension), can have a density d–(x) = fl – (x)dx w.r.t. the Lebesgue measure, often denoted fl – = d – d x , which means that d ⁄ ⁄
dµ(x) = m(x)dx Integration against continuous functions: R X g(x)dµ(x) > 0 Measure of sets A ⇢ X: µ(A) = R A dµ(x) > 0 X X X Probability (normalized) measure: µ(X) = R X dµ(x) = 1 µ = P i µi xi R X gdµ = R X m(x)dx R X gdµ = P i µig(xi) dµ(x) = m(x)dx µ = P i µi xi 2.2. Assignment and Monge Problem 9 Discrete d = 1 Discrete d = 2 Density d = 1 Density d = 2 Figure 2.1: Schematic display of discrete distributions – = q n i =1 ai”xi (red cor- responds to empirical uniform distribution ai = 1/n, and blue to arbitrary distri- butions) and densities d–(x) = fl– (x)dx (in violet), in both 1-D and 2-D. Discrete distributions in 1-D are displayed using vertical segments (with length equal to ai ) and in 2-D using point clouds (radius equal to ai ). is the dimension), can have a density d–(x) = fl – (x)dx w.r.t. the Lebesgue measure, often denoted fl – = d – d x , which means that d ⁄ ⁄
= P(Xn 2 A) n!+1 ! P(X 2 A) 8 set A 8 continuous function f R fdµn n!+1 ! R fdµ Weak⇤ convergence: Convergence in law: Weak convergence: . . . Measures and Random Variables
clouds) Eulerian (histograms) µ = N X i=1 µi xi xi 2 X, X i µi = 1 Constant weights µi = 1 N XN /⌃N {(µi)i > 0 ; P i µi = 1} Convex polytope (simplex): xi X X Fixed positions xi (e.g. grid)
. x Radon measures (µ, ⌫) on (X, Y ). Transfer of measure by f : X ! Y : push forward. ⌫ = f]µ defined by: ⌫(A) def. = µ(f 1(A)) () R Y g(y)d⌫(y) def. = R X g(f(x))dµ(x) y = f(x) f X µ ⌫ Y A f 1(A) Push Forward
. x Radon measures (µ, ⌫) on (X, Y ). Transfer of measure by f : X ! Y : push forward. ⌫ = f]µ defined by: ⌫(A) def. = µ(f 1(A)) () R Y g(y)d⌫(y) def. = R X g(f(x))dµ(x) y = f(x) f X µ ⌫ Y A f 1(A) Smooth densities: dµ = ⇢(x)dx, d⌫ = ⇠(x)dx Push Forward f]µ = ⌫ () ⇢(f(x))| det(@f(x))| = ⇠(x)
Remark: f] and f] are adjoints Z Y 'd(f]µ) = Z X (f]')dµ = P i xi def. = P i f(xi) def. = ' f f : X ! Y f] : M(X) ! M(Y) f] : C(Y) ! C(X) Push-forward vs. Pull-back
n!+1 E(|Xn X|p) = 0 P(Xn 2 A) n!+1 ! P(X 2 A) In law 8 " > 0, P(|Xn X| > ") n!+1 ! 0 In probability Almost sure In mean =) =) (the Xn can be defined on di↵erent spaces) Convergence of Random Variables
µ Y ⌫ Theorem: [Brenier] for c(x, y) = ||x y||2, (µ, ⌫) with density, there exists a unique optimal f. One has f = r where is the unique convex function such that (r )]µ = ⌫
µ Y ⌫ Monge-Amp` ere equation: ⇢(r ) det(@2 ) = ⇠ Theorem: [Brenier] for c(x, y) = ||x y||2, (µ, ⌫) with density, there exists a unique optimal f. One has f = r where is the unique convex function such that (r )]µ = ⌫
µ Y ⌫ Monge-Amp` ere equation: ⇢(r ) det(@2 ) = ⇠ ⌫ µ f f0 ? µ ⌫ µ ⌫ f µ ⌫ f Non-uniqueness / non-existence: Theorem: [Brenier] for c(x, y) = ||x y||2, (µ, ⌫) with density, there exists a unique optimal f. One has f = r where is the unique convex function such that (r )]µ = ⌫
µ ˜ µ Kullback-Leibler divergence: DKL(µ, ˜ µ) = Z ⇢(x) log ⇢(x) ˜ ⇢(x) dx Hellinger distance: D H (µ, ˜ µ)2 = Z ⇣p ⇢(x) p ˜ ⇢(x) ⌘ 2 dx Metrics on the Space of Measures
◆ d ||↵ ||B def. = max f2B Z X f(x)(d↵(x) d (x)) Csisz´ ar divergences: Dual norms: X X ↵ ↵ D' || · ||B Weak topology Strong topology ! KL, TV, 2 , Hellinger . . . ! W1, flat, RKHS ⇤ , energy dist, . . . Csiszar Divergence vs Dual Norms
3 Energy Gauss W 1 Flat -3 -2 -1 0 1 2 3 0 1 2 3 Energy Gauss W 1 Flat Dual norms: (aka Integral Probability Metrics) ||↵ ||B def. = max ⇢Z X f(x)(d↵(x) d (x)) ; f 2 B t t/2 Wasserstein 1: B = {f ; ||rf||1 6 1}. Flat norm: B = {f ; ||f||1 6 1, ||rf||1 6 1}. RKHS: B = f ; ||f||2 k 6 1 . ||↵ ||2 B = Z k(x, x0)d↵(x)d↵(x0) + Z k(x, x0)d (y)d (y0) 2 Z k(x, y)d↵(x)d (y) Energy distance: k(x, y) = ||x y|| Gaussian: k(x, y) = e | |x y| |2 2 2 ↵ ↵ Dual Norms
Formulations • Linear Programming and Semi-discrete • Entropic Regularization and Sinkhorn • Barycenters • Unbalanced OT and Gradient Flows • Minimum Kantorovitch Estimators • Gromov-Wasserstein
= PN2 j=1 pj yi ⇠ O(N3) Hungarian/Auction: µ = 1 N PN i=1 xi , ⌫ = 1 N PN j=1 yj Monge-Amp` ere/Benamou-Brenier, d = || · ||2 2 . also reveal that the network simplex behaves in O(n 2) in our con- text, which is a major gain at the scale at which we typical work, i.e. thousands of particles. This finding is also useful for applica- tions that use EMD, where using the network simplex instead of the transport simplex can bring a significant performance increase. Our experiments also show that fixed-point precision further speeds up the computation. We observed that the value of the final trans- port cost is less accurate because of the limited precision, but that the particle pairing that produces the actual interpolation scheme remains unchanged. We used the fixed point method to generate the results presented in this paper. The results of the performance study are also of broader interest, as current EMD image retrieval or color transfer techniques rely on slower solvers [Rubner et al. 2000; Kanters et al. 2003; Morovic and Sun 2003]. 101 102 103 104 10−4 10−2 100 102 Problem size : number of bins per histogram Time in seconds Network Simplex (fixed point) Network Simplex (double prec.) Transport Simplex y = α x2 y = β x3 Figure 6: Log-log plot of the running times of different solvers. Figure 7: Synthetic 2D examples on a Euclidean domain. Th left and right columns show the input distributions, while the cent columns show interpolations for ↵ = 1/4, ↵ = 1/2, and ↵ = 3/ positive. A negative side effect of this choice is that interpolatin between BRDFs of equal energy conserves their log energy (§ 3. instead of their energy. Because we apply a concave remappin the interpolated value is guaranteed to be always lower, which e sures that our result does not break the energy preservation rul Algorithms µ ⌫
= PN2 j=1 pj yi ⇠ O(N3) Hungarian/Auction: µ = 1 N PN i=1 xi , ⌫ = 1 N PN j=1 yj Monge-Amp` ere/Benamou-Brenier, d = || · ||2 2 . also reveal that the network simplex behaves in O(n 2) in our con- text, which is a major gain at the scale at which we typical work, i.e. thousands of particles. This finding is also useful for applica- tions that use EMD, where using the network simplex instead of the transport simplex can bring a significant performance increase. Our experiments also show that fixed-point precision further speeds up the computation. We observed that the value of the final trans- port cost is less accurate because of the limited precision, but that the particle pairing that produces the actual interpolation scheme remains unchanged. We used the fixed point method to generate the results presented in this paper. The results of the performance study are also of broader interest, as current EMD image retrieval or color transfer techniques rely on slower solvers [Rubner et al. 2000; Kanters et al. 2003; Morovic and Sun 2003]. 101 102 103 104 10−4 10−2 100 102 Problem size : number of bins per histogram Time in seconds Network Simplex (fixed point) Network Simplex (double prec.) Transport Simplex y = α x2 y = β x3 Figure 6: Log-log plot of the running times of different solvers. Figure 7: Synthetic 2D examples on a Euclidean domain. Th left and right columns show the input distributions, while the cent columns show interpolations for ↵ = 1/4, ↵ = 1/2, and ↵ = 3/ positive. A negative side effect of this choice is that interpolatin between BRDFs of equal energy conserves their log energy (§ 3. instead of their energy. Because we apply a concave remappin the interpolated value is guaranteed to be always lower, which e sures that our result does not break the energy preservation rul Algorithms µ ⌫ Semi-discrete: Laguerre cells, d = || · ||2 2 . [Levy,’15]
= PN2 j=1 pj yi ⇠ O(N3) Hungarian/Auction: µ = 1 N PN i=1 xi , ⌫ = 1 N PN j=1 yj Monge-Amp` ere/Benamou-Brenier, d = || · ||2 2 . also reveal that the network simplex behaves in O(n 2) in our con- text, which is a major gain at the scale at which we typical work, i.e. thousands of particles. This finding is also useful for applica- tions that use EMD, where using the network simplex instead of the transport simplex can bring a significant performance increase. Our experiments also show that fixed-point precision further speeds up the computation. We observed that the value of the final trans- port cost is less accurate because of the limited precision, but that the particle pairing that produces the actual interpolation scheme remains unchanged. We used the fixed point method to generate the results presented in this paper. The results of the performance study are also of broader interest, as current EMD image retrieval or color transfer techniques rely on slower solvers [Rubner et al. 2000; Kanters et al. 2003; Morovic and Sun 2003]. 101 102 103 104 10−4 10−2 100 102 Problem size : number of bins per histogram Time in seconds Network Simplex (fixed point) Network Simplex (double prec.) Transport Simplex y = α x2 y = β x3 Figure 6: Log-log plot of the running times of different solvers. Figure 7: Synthetic 2D examples on a Euclidean domain. Th left and right columns show the input distributions, while the cent columns show interpolations for ↵ = 1/4, ↵ = 1/2, and ↵ = 3/ positive. A negative side effect of this choice is that interpolatin between BRDFs of equal energy conserves their log energy (§ 3. instead of their energy. Because we apply a concave remappin the interpolated value is guaranteed to be always lower, which e sures that our result does not break the energy preservation rul Algorithms µ ⌫ Semi-discrete: Laguerre cells, d = || · ||2 2 . [Levy,’15]
µ = PN1 i=1 pi xi , ⌫ = PN2 j=1 pj yi ⇠ O(N3) Hungarian/Auction: µ = 1 N PN i=1 xi , ⌫ = 1 N PN j=1 yj Monge-Amp` ere/Benamou-Brenier, d = || · ||2 2 . also reveal that the network simplex behaves in O(n 2) in our con- text, which is a major gain at the scale at which we typical work, i.e. thousands of particles. This finding is also useful for applica- tions that use EMD, where using the network simplex instead of the transport simplex can bring a significant performance increase. Our experiments also show that fixed-point precision further speeds up the computation. We observed that the value of the final trans- port cost is less accurate because of the limited precision, but that the particle pairing that produces the actual interpolation scheme remains unchanged. We used the fixed point method to generate the results presented in this paper. The results of the performance study are also of broader interest, as current EMD image retrieval or color transfer techniques rely on slower solvers [Rubner et al. 2000; Kanters et al. 2003; Morovic and Sun 2003]. 101 102 103 104 10−4 10−2 100 102 Problem size : number of bins per histogram Time in seconds Network Simplex (fixed point) Network Simplex (double prec.) Transport Simplex y = α x2 y = β x3 Figure 6: Log-log plot of the running times of different solvers. Figure 7: Synthetic 2D examples on a Euclidean domain. Th left and right columns show the input distributions, while the cent columns show interpolations for ↵ = 1/4, ↵ = 1/2, and ↵ = 3/ positive. A negative side effect of this choice is that interpolatin between BRDFs of equal energy conserves their log energy (§ 3. instead of their energy. Because we apply a concave remappin the interpolated value is guaranteed to be always lower, which e sures that our result does not break the energy preservation rul Algorithms µ ⌫ Semi-discrete: Laguerre cells, d = || · ||2 2 . [Levy,’15]
0.2 0.4 0.6 0.5 1 1.5 2 p = 1/2 p = 1 p = 3/2 p = 2 Figure 5.1: Top: examples of semi-discrete ¯ c-transforms g ¯ c in 1-D, for ground d the useful indicator function notation (4.42). alternate minimization on either f or g leads to the im- n of c-transform: ’ y œ Y, fc(y) def. = inf xœX c(x, y) ≠ f(x), (5.1) ’ x œ X, g¯ c(x) def. = inf yœY c(x, y) ≠ g(y), (5.2) oted ¯ c(y, x) def. = c(x, y). Indeed, one can check that œ argmax g E(f, g) and g¯ c œ argmax f E(f, g). (5.3) ese partial minimizations define maximizers on the sup- tively – and —, while the definitions (5.1) actually define he whole spaces X and Y. This is thus a way to extend in ay solutions of (2.22) on the whole spaces. When X = Rd Îx ≠ yÎp, then the c-transform (5.1) fc is the so-called n between ≠f and ηÎp. The definition of fc is also often a “Hopf-Lax formula”. (f, g) œ C(X) ◊ C(Y) ‘æ (g¯ c, fc) œ C(X) ◊ C(Y) replaces s by “better” ones (improving the dual objective E). Func- be written in the form fc and g¯ c are called c-concave and ctions. In the special case c(x, y) = Èx, yÍ in X = Y = Rd, n coincides with the usual notion of concave functions. turally Proposition 3.1 to a continuous case, one has the operations are replaced by a “soft-min”. Using (5.3), one can reformulate (2.22) as an unconstrained program over a single potential Lc (–, —) = max fœC ( X ) ⁄ X f(x)d–(x) + ⁄ Y fc(y)d—(y), = max gœC ( Y ) ⁄ X g¯ c(x)d–(x) + ⁄ Y g(y)d—(y). Since one can iterate the map (f, g) ‘æ (g¯ c, fc), it is possible to these optimization problems the constraint that f is ¯ c-concave is c-concave, which is important to ensure enough regularity on potentials and show for instance existence of solutions to (2.22) 5.2 Semi-discrete Formulation A case of particular interest is when — = q j bj ” yj is discrete (of the same construction applies if – is discrete by exchanging the (–, —)). One can adapt the definition of the ¯ c transform (5.1) setting by restricting the minimization to the support (y j ) j of — ’ g œ Rm, ’ x œ X, g ¯ c(x) def. = min jœJmK c(x, y j ) ≠ gj . This transform maps a vector g to a continuous function g ¯ c œ Note that this definition coincides with (5.1) when imposing th space X is equal to the support of —. Figure 5.1 shows some ex of such discrete ¯ c-transforms in 1-D and 2-D. Using this discrete ¯ c-transform, in this semi-discrete case, ( equivalent to the following finite dimensional optimization
¸ = 3 ¸ = 50 ¸ = 100 Matching Figure 5.2: Iterations of the semi-discrete OT algorithm minimizing (5.8) (here a simple gradient descent is used). The support (yj ) j of the discrete measure — is indicated by the red points, while the continuous measure – is the uniform measure on a square. The blue cells display the Laguerre partition (Lg(¸) (yj )) j where g ( ¸ ) is the discrete dual potential computed at iteration ¸.
def. = p 2 R+ N ; P i p i = 1 . The entropy of T 2 RN⇥N + is defined as H(T) def. = P N i,j=1 T i,j (log(T i,j ) 1). The set of couplings between histograms p 2 ⌃ N1 and q 2 ⌃ N2 is C p,q def. = T 2 (R +)N1 ⇥N2 ; T1 N2 = p, T>1 N1 = q . Here, 1 N def. = (1, . . . , 1)> 2 RN . For any tensor L = (L i,j,k,` ) i,j,k,` and matrix (T i,j ) i,j , we define the tensor- matrix multiplication as L ⌦ T def. = ⇣ X k,` L i,j,k,` T k,` ⌘ i,j . (1) 2. Gromov-Wasserstein Discrepancy 2.1. Entropic Optimal Transport Optimal transport distances are useful to compare two his- tograms (p, q) 2 ⌃ N1 ⇥ ⌃ N2 defined on the same metric 1 In our setting, sinc learning problems, distance matrices, i does not necessaril We define the Gro two measured simil and ( ¯ C, q) 2 RN2 ⇥ GW(C where E C,C The matrix T is a which the similarit L is some loss fun the similarity matr the quadratic loss the Kullback-Leibl a log(a/b) a + inition (4) of GW Entropy: Entropic Regularization Def. Regularized OT: [Cuturi NIPS’13]
def. = p 2 R+ N ; P i p i = 1 . The entropy of T 2 RN⇥N + is defined as H(T) def. = P N i,j=1 T i,j (log(T i,j ) 1). The set of couplings between histograms p 2 ⌃ N1 and q 2 ⌃ N2 is C p,q def. = T 2 (R +)N1 ⇥N2 ; T1 N2 = p, T>1 N1 = q . Here, 1 N def. = (1, . . . , 1)> 2 RN . For any tensor L = (L i,j,k,` ) i,j,k,` and matrix (T i,j ) i,j , we define the tensor- matrix multiplication as L ⌦ T def. = ⇣ X k,` L i,j,k,` T k,` ⌘ i,j . (1) 2. Gromov-Wasserstein Discrepancy 2.1. Entropic Optimal Transport Optimal transport distances are useful to compare two his- tograms (p, q) 2 ⌃ N1 ⇥ ⌃ N2 defined on the same metric 1 In our setting, sinc learning problems, distance matrices, i does not necessaril We define the Gro two measured simil and ( ¯ C, q) 2 RN2 ⇥ GW(C where E C,C The matrix T is a which the similarit L is some loss fun the similarity matr the quadratic loss the Kullback-Leibl a log(a/b) a + inition (4) of GW Entropy: Entropic Regularization Regularization impact on solution: Def. Regularized OT: [Cuturi NIPS’13] c " T" " "
Á = 0 Á = 0.01 Á = 0.1 Á = 0.3 Figure 5.3: Top: examples of entropic semi-discrete ¯ c-transforms g ¯ c,Á in 1-D, for 0 0.5 1 -0.2 0 0.2 0 0.1 0.2 0.3 Á = 0 Á = 0.01 Á = 0.1 Á = 0.3 Figure 5.3: Top: examples of entropic semi-discrete ¯ c-transforms g ¯ c,Á in 1-D, for Laguerre cells “Sinkhorn” Laguerre cells Semi-discrete Optimal Transport is a multiclass SVM. Sinkhorn version is logistic regression. Decomposes space in entropic Laguerre diagrams 0 0.5 1 -0.2 0 0.2 0 0.1 0.2 0.3 Á = 0 Á = 0.01 Á = 0.1 Á = 0.3 Figure 5.3: Top: examples of entropic semi-discrete ¯ c-transforms g ¯ c,Á in 1-D, for ground cost c(x, y) = |x ≠ y| for varying Á (see colorbar). The red points are at locations (yj, ≠gj ) j . Bottom: examples of entropic semi-discrete ¯ c-transforms g ¯ c,Á in 2-D, for ground cost c(x, y) = Îx ≠ yÎ for varying Á. The black curves are the level sets of the function g ¯ c,Á, while the colors indicate the smoothed indicator function of the Laguerre cells ‰Á j . The red points are at locations yj œ R2 , and their size is proportional to gj . Semi-discrete OT and Entropy
W 2 (µ 2 ,µ ) µ2 W2 (µ3 ,µ ) Barycenters of measures (µk)k: P k k = 1 If µk = xk then µ? = P k kxk Generalizes Euclidean barycenter: µ? 2 argmin µ P k kW2 2 (µk, µ)
W 2 (µ 2 ,µ ) µ2 W2 (µ3 ,µ ) Barycenters of measures (µk)k: P k k = 1 If µk = xk then µ? = P k kxk Generalizes Euclidean barycenter: µ? 2 argmin µ P k kW2 2 (µk, µ)
W 2 (µ 2 ,µ ) µ2 W2 (µ3 ,µ ) Barycenters of measures (µk)k: P k k = 1 If µk = xk then µ? = P k kxk Generalizes Euclidean barycenter: µ2 µ1 Mc Cann’s displacement interpolation. µ? 2 argmin µ P k kW2 2 (µk, µ)
W 2 (µ 2 ,µ ) µ2 W2 (µ3 ,µ ) Barycenters of measures (µk)k: P k k = 1 If µk = xk then µ? = P k kxk Generalizes Euclidean barycenter: µ2 µ1 µ3 µ2 µ1 Mc Cann’s displacement interpolation. µ? 2 argmin µ P k kW2 2 (µk, µ)
W 2 (µ 2 ,µ ) µ2 W2 (µ3 ,µ ) Barycenters of measures (µk)k: P k k = 1 If µk = xk then µ? = P k kxk Generalizes Euclidean barycenter: µ2 µ1 µ3 µ exists and is unique. Theorem: if µ1 does not vanish on small sets, [Agueh, Carlier, 2010] (for c(x, y) = ||x y||2) µ2 µ1 Mc Cann’s displacement interpolation. µ? 2 argmin µ P k kW2 2 (µk, µ)
–3 / 5 –4 / 5 –1 Figure 7.2: Comparison of displacement interpolation (7.8) of discrete measures. Top: point clouds (empirical measures (–0, –1) with the same number of points). Bottom: same but with weight. For 0 < t < 1, the top example corresponds to an Displacement Interpolation
µ, i.e. choose (⇡0,k)k ! Sinkhorn-like algorithm [Benamou, Carlier, Cuturi, Nenna, Peyr´ e, 2015]. min (⇡k)k,µ ( X k k (hc, ⇡k i + "KL(⇡k |⇡0,k)) ; 8k, ⇡k 2 ⇧(µk, µ) )
µ, i.e. choose (⇡0,k)k ! Sinkhorn-like algorithm [Benamou, Carlier, Cuturi, Nenna, Peyr´ e, 2015]. [Solomon et al, SIGGRAPH 2015] min (⇡k)k,µ ( X k k (hc, ⇡k i + "KL(⇡k |⇡0,k)) ; 8k, ⇡k 2 ⇧(µk, µ) )
e, Vialard 2015] Proposition: then WF1/2 c is a distance on M+(X). (⇠, µ) 2 M+(X)2, KL(⇠|µ) def. = R X log ⇣ d⇠ dµ ⌘ dµ + R X (dµ d⇠) If c(x, y) = log(cos(min(d(x, y), ⇡ 2 )) WFc(µ, ⌫) def. = min ⇡ hc, ⇡i + KL(P1]⇡|µ) + KL(P2]⇡|⌫) [Liereo, Mielke, Savar´ e 2015]
2015][Kondratyev, Monsaingeon, Vorotnikov, 2015] [Liereo, Mielke, Savar´ e 2015] [Chizat, Schmitzer, Peyr´ e, Vialard 2015] Proposition: then WF1/2 c is a distance on M+(X). [Chizat, Schmitzer, P, Vialard 2015] (⇠, µ) 2 M+(X)2, KL(⇠|µ) def. = R X log ⇣ d⇠ dµ ⌘ dµ + R X (dµ d⇠) If c(x, y) = log(cos(min(d(x, y), ⇡ 2 )) WFc(µ, ⌫) def. = min ⇡ hc, ⇡i + KL(P1]⇡|µ) + KL(P2]⇡|⌫) [Liereo, Mielke, Savar´ e 2015] Balanced OT Unbalanced OT
+ f2(P2]⇡) + "KL(⇡|⇡0) Block coordinates relaxation: max u f⇤ 1 (u) "he u " , Ke v " i max v f⇤ 2 (v) "he v " , K⇤e u " i (Iv) (Iu) Dual: (a, b) def. = (e u " , e v " ) max u,v f⇤ 1 (u) f⇤ 2 (v) "he u " , Ke v " i ⇡(x, y) = a(x)K(x, y)b(y)
+ f2(P2]⇡) + "KL(⇡|⇡0) Block coordinates relaxation: max u f⇤ 1 (u) "he u " , Ke v " i max v f⇤ 2 (v) "he v " , K⇤e u " i (Iv) (Iu) Proposition: ProxKL f1/" (µ) def. = argmin ⌫f1(⌫) + "KL(⌫|µ) the solutions of (Iu) and (Iv) read: a = ProxKL f1/" (Kb) Kb b = ProxKL f2/" (K⇤a) K⇤a Dual: (a, b) def. = (e u " , e v " ) max u,v f⇤ 1 (u) f⇤ 2 (v) "he u " , Ke v " i ⇡(x, y) = a(x)K(x, y)b(y)
parallelizable. ! On regular grids: only convolutions! Linear time iterations. min ⇡ hdp, ⇡i + f1(P1]⇡) + f2(P2]⇡) + "KL(⇡|⇡0) Block coordinates relaxation: max u f⇤ 1 (u) "he u " , Ke v " i max v f⇤ 2 (v) "he v " , K⇤e u " i (Iv) (Iu) Proposition: ProxKL f1/" (µ) def. = argmin ⌫f1(⌫) + "KL(⌫|µ) the solutions of (Iu) and (Iv) read: a = ProxKL f1/" (Kb) Kb b = ProxKL f2/" (K⇤a) K⇤a Dual: (a, b) def. = (e u " , e v " ) max u,v f⇤ 1 (u) f⇤ 2 (v) "he u " , Ke v " i ⇡(x, y) = a(x)K(x, y)b(y)
Low dimension High dimension Generative Discriminative g✓ d⇠ g✓ d⇠ classification, z =class probability Supervised: ! Learn d⇠ from labeled data (xi, zi)i.
Low dimension High dimension Generative Discriminative g✓ d⇠ g✓ d⇠ classification, z =class probability Supervised: ! Learn d⇠ from labeled data (xi, zi)i. Compression: z = d⇠(x) is a representation. Generation: x = g✓(z) is a synthesis. Un-supervised: ! Learn (g✓, d⇠) from data (xi)i.
Low dimension High dimension Generative Discriminative g✓ d⇠ g✓ d⇠ classification, z =class probability Supervised: ! Learn d⇠ from labeled data (xi, zi)i. Compression: z = d⇠(x) is a representation. Generation: x = g✓(z) is a synthesis. Un-supervised: ! Learn (g✓, d⇠) from data (xi)i. Density fitting g✓({zi }i) ⇡ {xi }i Auto-encoders g✓(d⇠(xi)) ⇡ xi
Low dimension High dimension Generative Discriminative g✓ d⇠ g✓ d⇠ classification, z =class probability Supervised: ! Learn d⇠ from labeled data (xi, zi)i. Compression: z = d⇠(x) is a representation. Generation: x = g✓(z) is a synthesis. Un-supervised: ! Learn (g✓, d⇠) from data (xi)i. Density fitting g✓({zi }i) ⇡ {xi }i Auto-encoders g✓(d⇠(xi)) ⇡ xi Optimal transport map d⇠
µ✓ ✓ Observations: ⌫ = 1 n Pn i=1 xi g✓ µ✓ X Z ⇣ Generative model fit: µ✓ = g✓,]⇣ ! MLE undefined. ! Need a weaker metric. c KL(µ✓ |⌫) = +1 dµ✓(y) = f✓(y)dy Density fitting: Maximum likelihood (MLE) min ✓ c KL(⌫|µ✓) def. = X j log(f✓(yj))
N P i xi W(µ, ⌫)p def. = min T 2Cµ,⌫ P i,j Ti,j ||xi yj ||p ⌫ = 1 P P j yj Maximum Mean Discrepancy (MMD) ||µ ⌫||2 k def. = 1 N2 X i,i0 k(xi, xi0 ) + 1 P2 X j,j0 k(yj, yj0 ) 2 NP X i,j k(xi, yj) Gaussian: k(x, y) = e | |x y| |2 2 2 . Energy distance: k(x, y) = ||x y||2. Density fitting: min ✓ D(µ✓, ⌫) T µ ⌫ k µ ⌫
x g✓ ⇢ ⇢ z x Discriminative Generative z1 z2 g✓ Z x z X d⇠ d⇠ ⇠1 ⇠2 g✓(z) = ⇢(✓K(. . . ⇢(✓2(⇢(✓1(z) . . .) Deep networks: d⇠(x) = ⇢(⇠K(. . . ⇢(⇠2(⇢(⇠1(x) . . .)
is a distance on mm-spaces/isometries. ! need for a fast approximate solver. ! “bending-invariant” objects recognition. ! QAP: NP-hard in general. d X (x,x 0 ) dY (y, y0) Def. Gromov-Wasserstein distance: [Memoli 2011] [Sturm 2012] GW2 2 (dX , µ, dY , ⌫) def. = min ⇡2⇧(µ,⌫) Z X2⇥Y 2 |dX(x, x0) dY (y, y0)|2d⇡(x, y)d⇡(x0, y0) Metric-measure spaces (X, Y ): (dX , µ), (dY , ⌫)
Targets Figure 1: Entropic GW can find correspondences between a source surface (left) and a surface with similar structure, a surface with shared semantic structure, a noisy 3D point cloud, an icon, and a hand drawing. Each fuzzy map was computed using the same code. In this paper, we propose a new correspondence algorithm that minimizes distortion of long- and short-range distances alike. We study an entropically-regularized version of the Gromov-Wasserstein (GW) mapping objective function from [M´ emoli 2011] measuring the distortion of geodesic distances. The optimizer is a probabilistic matching expressed as a “fuzzy” correspondence matrix in the style of [Kim et al. 2012; Solomon et al. 2012]; we control sharpness of the correspondence via the weight of the entropic regularizer. 0 0.02 0.04 0 0.02 0 0.02 Teddies Humans Four-legged Armadillo Figure 15: MDS embedding of four classes from SHREC dataset. 0 0.5 1 0 0.5 1 1 5 10 15 20 25 30 35 40 45 PCA 1 PCA 2 Figure 16: Recovery of galloping horse sequence. 0 is the base shape) as a feature vector for shape i. We reproduce the result presented in the work of Rustamov et al., recovering the circular structure of meshes from a galloping horse animation sequence (Figure 16). Unlike Rustamov et al., however, our method does not require ground truth maps between shapes as input. 5.2 Supervised Matching An important feature of a matching tool is the ability to incorporate user input, e.g. ground truth matches of points or regions. In the GWα framework, one way to enforce these constraints is to provide a stencil S specifying a sparsity pattern for the map Γ. Incorporating Figure 18: Mapping a set of 185 images onto a two shapes while preserving color similarity. (Images from Flickr public domain collection.) Rn0×n0 + and D ∈ Rn×n + we are given symmetric weight matrices W0 ∈ Rn0×n0 + and W ∈ Rn×n + . We could solve a weighted version of the GWα matching problem (3) that prioritizes maps preserving distances corresponding to large W values: min Γ∈M ijkℓ (D0ij −Dkℓ )2ΓikΓjℓW0ijWjℓµ0iµ0jµkµℓ . (8) For instance, (W0, W) might contain confidence values expressing the quality of the entries of (D0, D). Or, W0, W could take values in {ε, 1} reducing the weight of distances that are unknown or do not need to be preserved by Γ. Following the same simplifications as §3.1, we can optimize this objective by minimizing ⟨Γ, ΛW (Γ)⟩, where ΛW (Γ) := 1 2 [D∧2 0 ⊗ W0][[µ0 ]]Γ[[µ]]W − [D0 ⊗ W0][[µ0 ]]Γ[[µ]][D ⊗ W] + 1 2 W0[[µ0 ]]Γ[[µ]][D∧2 ⊗ W] Applications of GW: Shapes Analysis Use T to define registration between: Colors distribution Shape Shape Shape
Targets Figure 1: Entropic GW can find correspondences between a source surface (left) and a surface with similar structure, a surface with shared semantic structure, a noisy 3D point cloud, an icon, and a hand drawing. Each fuzzy map was computed using the same code. In this paper, we propose a new correspondence algorithm that minimizes distortion of long- and short-range distances alike. We study an entropically-regularized version of the Gromov-Wasserstein (GW) mapping objective function from [M´ emoli 2011] measuring the distortion of geodesic distances. The optimizer is a probabilistic matching expressed as a “fuzzy” correspondence matrix in the style of [Kim et al. 2012; Solomon et al. 2012]; we control sharpness of the correspondence via the weight of the entropic regularizer. 0 0.02 0.04 0 0.02 0 0.02 Teddies Humans Four-legged Armadillo Figure 15: MDS embedding of four classes from SHREC dataset. 0 0.5 1 0 0.5 1 1 5 10 15 20 25 30 35 40 45 PCA 1 PCA 2 Figure 16: Recovery of galloping horse sequence. 0 is the base shape) as a feature vector for shape i. We reproduce the result presented in the work of Rustamov et al., recovering the circular structure of meshes from a galloping horse animation sequence (Figure 16). Unlike Rustamov et al., however, our method does not require ground truth maps between shapes as input. 5.2 Supervised Matching An important feature of a matching tool is the ability to incorporate user input, e.g. ground truth matches of points or regions. In the GWα framework, one way to enforce these constraints is to provide a stencil S specifying a sparsity pattern for the map Γ. Incorporating Figure 18: Mapping a set of 185 images onto a two shapes while preserving color similarity. (Images from Flickr public domain collection.) Rn0×n0 + and D ∈ Rn×n + we are given symmetric weight matrices W0 ∈ Rn0×n0 + and W ∈ Rn×n + . We could solve a weighted version of the GWα matching problem (3) that prioritizes maps preserving distances corresponding to large W values: min Γ∈M ijkℓ (D0ij −Dkℓ )2ΓikΓjℓW0ijWjℓµ0iµ0jµkµℓ . (8) For instance, (W0, W) might contain confidence values expressing the quality of the entries of (D0, D). Or, W0, W could take values in {ε, 1} reducing the weight of distances that are unknown or do not need to be preserved by Γ. Following the same simplifications as §3.1, we can optimize this objective by minimizing ⟨Γ, ΛW (Γ)⟩, where ΛW (Γ) := 1 2 [D∧2 0 ⊗ W0][[µ0 ]]Γ[[µ]]W − [D0 ⊗ W0][[µ0 ]]Γ[[µ]][D ⊗ W] + 1 2 W0[[µ0 ]]Γ[[µ]][D∧2 ⊗ W] Applications of GW: Shapes Analysis 0 0.02 0.04 0 0.02 0 0.02 Te Hu Fo Ar Figure 1: The database that has been used, divid MDS in 3-D Use T to define registration between: Colors distribution Shape Shape Shape Geodesic distances GW distances MDS Vizualization Shapes (Xs)s 0 0.02 0.04 0 0.02 0 0.02 Teddies Humans Four-legged Armadillo Figure 15: MDS embedding of four classes from SHREC dataset. 0 0.5 1 1 5 10 15 20 25 30 35 40 45 PCA 2 Figure 18: Mapping a set of 185 images onto a two shapes while preserving color similarity. (Images from Flickr public domain collection.) Rn0×n0 + and D ∈ Rn×n + we are given symmetric weight matrices W0 ∈ Rn0×n0 + and W ∈ Rn×n + . We could solve a weighted version of the GWα matching problem (3) that prioritizes maps preserving distances corresponding to large W values: MDS in 2-D
Villani ansport framework Sliced Wasserstein projection Applications lication to Color Transfer Source image (X) Style image (Y) Sliced Wasserstein projection of X to style image color statistics Y Source image after color transfer J. Rabin Wasserstein Regularization