From Monge-Kantorovich to Gromov-Wasserstein - Optimal Transport and Barycenters Between Several Metric Spaces

From Monge-Kantorovich to Gromov-Wasserstein - Optimal Transport and Barycenters Between Several Metric Spaces

Talk @ICML 2016, updated for the Mokatao days.

E34ded36efe4b7abb12510d4e525fee8?s=128

Gabriel Peyré

June 21, 2016
Tweet

Transcript

  1. From Monge to Gromov-Wasserstein Optimal Transport and Barycenters Between Several

    Metric Spaces Gabriel Peyré Marco Cuturi Justin Solomon É C O L E N O R M A L E S U P É R I E U R E RESEARCH UNIVERSITY PARIS
  2. Comparing Measures and Spaces Source image (X) Style image (Y)

    Sliced Wasserstein projection of X to style image color statistics Y Source image after color transfer J. Rabin Wasserstein Regularization ! images, vision, graphics and machine learning, . . . • Probability distributions and histograms
  3. Comparing Measures and Spaces Source image (X) Style image (Y)

    Sliced Wasserstein projection of X to style image color statistics Y Source image after color transfer J. Rabin Wasserstein Regularization ! images, vision, graphics and machine learning, . . . • Probability distributions and histograms Optimal transport mean L2 mean • Optimal transport
  4. Comparing Measures and Spaces Source image (X) Style image (Y)

    Sliced Wasserstein projection of X to style image color statistics Y Source image after color transfer J. Rabin Wasserstein Regularization ! images, vision, graphics and machine learning, . . . • Probability distributions and histograms Optimal transport mean L2 mean • Optimal transport for Correspondence Problems auphine Vladimir G. Kim Adobe Research Suvrit Sra MIT Source Targets nce Problems G. Kim search Suvrit Sra MIT Targets
  5. 1. Entropy Regularized Gromov-Wasserstein 2. Gromov-Wasserstein Barycenters 0. Entropy Regularized

    Optimal Transport
  6. xi yj Couplings and Optimal Transport (EMD) Points (xi)i, (yj)j

    T Input distributions Couplings Def.
  7. xi yj Couplings and Optimal Transport (EMD) Points (xi)i, (yj)j

    T Input distributions Couplings Def. [Kantorovich 1942] Wasserstein Distance / EMD Def.
  8. 1.3. Notation The simplex of histograms with N bins is

    ⌃N def. = p 2 R+ N ; P i pi = 1 . The entropy of T 2 RN⇥N + is defined as H ( T ) def. = PN i,j =1 Ti,j (log( Ti,j ) 1) . The set of couplings between histograms p 2 ⌃N1 and q 2 ⌃N2 is Cp,q def. = T 2 ( R +) N1 ⇥N2 ; T1N2 = p, T>1N1 = q . Here, 1N def. = (1 , . . . , 1) > 2 RN . For any tensor L = ( Li,j,k,` )i,j,k,` and matrix ( Ti,j )i,j, we define the tensor- matrix multiplication as L ⌦ T def. = ⇣ X k,` Li,j,k,`Tk,` ⌘ i,j . (1) 2. Gromov-Wasserstein Discrepancy 2.1. Entropic Optimal Transport Optimal transport distances are useful to compare two his- tograms ( p, q ) 2 ⌃N1 ⇥ ⌃N2 defined on the same metric 1 https://github.com/gpeyre/ In our setting, since we learning problems, we d distance matrices, i.e., th does not necessarily sati We define the Gromov- two measured similarity and ( ¯ C, q ) 2 RN2 ⇥N2 ⇥ GW( C, ¯ C, p where EC, ¯ C( T ) The matrix T is a cou which the similarity ma L is some loss function the similarity matrices. the quadratic loss L ( a, the Kullback-Leibler di a log( a/b ) a + b (wh inition (4) of GW ext by (M´ emoli, 2011), sin Entropy: Entropy and Sinkhorn Regularization impact on solution: Def. Regularized OT: [Cuturi NIPS’13] c " T" "
  9. 1.3. Notation The simplex of histograms with N bins is

    ⌃N def. = p 2 R+ N ; P i pi = 1 . The entropy of T 2 RN⇥N + is defined as H ( T ) def. = PN i,j =1 Ti,j (log( Ti,j ) 1) . The set of couplings between histograms p 2 ⌃N1 and q 2 ⌃N2 is Cp,q def. = T 2 ( R +) N1 ⇥N2 ; T1N2 = p, T>1N1 = q . Here, 1N def. = (1 , . . . , 1) > 2 RN . For any tensor L = ( Li,j,k,` )i,j,k,` and matrix ( Ti,j )i,j, we define the tensor- matrix multiplication as L ⌦ T def. = ⇣ X k,` Li,j,k,`Tk,` ⌘ i,j . (1) 2. Gromov-Wasserstein Discrepancy 2.1. Entropic Optimal Transport Optimal transport distances are useful to compare two his- tograms ( p, q ) 2 ⌃N1 ⇥ ⌃N2 defined on the same metric 1 https://github.com/gpeyre/ In our setting, since we learning problems, we d distance matrices, i.e., th does not necessarily sati We define the Gromov- two measured similarity and ( ¯ C, q ) 2 RN2 ⇥N2 ⇥ GW( C, ¯ C, p where EC, ¯ C( T ) The matrix T is a cou which the similarity ma L is some loss function the similarity matrices. the quadratic loss L ( a, the Kullback-Leibler di a log( a/b ) a + b (wh inition (4) of GW ext by (M´ emoli, 2011), sin Entropy: Entropy and Sinkhorn Regularization impact on solution: Def. Regularized OT: [Cuturi NIPS’13] c " T" " repeat: until convergence. initialization: (a, b) (1N1 , 1N2 ) return T = diag(a)K diag(b) Fixed point algorithm: Only matrix/vector multiplications. ! Parallelizable. ! Streams well on GPU.
  10. [Solomon et al, SIGGRAPH 2015] Generalizations

  11. [Solomon et al, SIGGRAPH 2015] Generalizations [Liereo, Mielke, Savar´ e

    2015] [Chizat, Schmitzer, Peyr´ e, Vialard 2015]
  12. [Solomon et al, SIGGRAPH 2015] Generalizations [Liereo, Mielke, Savar´ e

    2015] [Chizat, Schmitzer, Peyr´ e, Vialard 2015]
  13. 2. Gromov-Wasserstein Barycenters 1. Entropy Regularized Gromov-Wasserstein 0. Entropy Regularized

    Optimal Transport
  14. unregistered spaces Gromov-Wasserstein Inputs: { (similarity/kernel matrix, histogram) } X

    Y
  15. unregistered spaces Gromov-Wasserstein Inputs: { (similarity/kernel matrix, histogram) } X

    Y [Memoli 2011] Def. Gromov-Wasserstein distance:
  16. unregistered spaces Gromov-Wasserstein Inputs: { (similarity/kernel matrix, histogram) } X

    Y [Memoli 2011] Def. Gromov-Wasserstein distance: ! need for a fast approximate solver. ! NP-hard in general.
  17. Y xi X Gromov-Wasserstein as a Metric yj Def.

  18. Y xi X Gromov-Wasserstein as a Metric yj X Y

    f f X Y Def. () Isometries on M : Def.
  19. Y xi X Gromov-Wasserstein as a Metric yj ! “bending-invariant”

    objects recognition. X Y f f X Y Def. () Isometries on M : Def. Prop. [Memoli 2011]
  20. Entropic Gromov Wasserstein Projected mirror descent: Def. Entropic Gromov-Wasserstein

  21. Entropic Gromov Wasserstein Projected mirror descent: Def. Entropic Gromov-Wasserstein Projected

    mirror descent: Def.
  22. Entropic Gromov Wasserstein Prop. T converges to a stationary point

    for ⌧ small enough Projected mirror descent: Def. Entropic Gromov-Wasserstein Projected mirror descent: Def. iterations
  23. Sinkhorn and Entropic Gromov Wasserstein Projected mirror descent:

  24. Sinkhorn and Entropic Gromov Wasserstein Projected mirror descent: Prop.

  25. Sinkhorn and Entropic Gromov Wasserstein Projected mirror descent: Q-Softassign [Gold,

    Rangarajan 1996] Prop. for ⌧ = 1 /" , the iteration reads Prop. repeat: until convergence. initialization: return T func T = GW(C, ¯ C, p, q)
  26. phine Vladimir G. Kim Adobe Research Suvrit Sra MIT Source

    Targets Figure 1: Entropic GW can find correspondences between a source surface (left) and a surface with similar structure, a surface with shared semantic structure, a noisy 3D point cloud, an icon, and a hand drawing. Each fuzzy map was computed using the same code. In this paper, we propose a new correspondence algorithm that minimizes distortion of long- and short-range distances alike. We study an entropically-regularized version of the Gromov-Wasserstein (GW) mapping objective function from [M´ emoli 2011] measuring the distortion of geodesic distances. The optimizer is a probabilistic matching expressed as a “fuzzy” correspondence matrix in the style of [Kim et al. 2012; Solomon et al. 2012]; we control sharpness of the correspondence via the weight of the entropic regularizer. 0 0.02 0.04 0 0.02 0 0.02 Teddies Humans Four-legged Armadillo Figure 15: MDS embedding of four classes from SHREC dataset. 0 0.5 1 0 0.5 1 1 5 10 15 20 25 30 35 40 45 PCA 1 PCA 2 Figure 16: Recovery of galloping horse sequence. 0 is the base shape) as a feature vector for shape i. We reproduce the result presented in the work of Rustamov et al., recovering the circular structure of meshes from a galloping horse animation sequence (Figure 16). Unlike Rustamov et al., however, our method does not require ground truth maps between shapes as input. 5.2 Supervised Matching An important feature of a matching tool is the ability to incorporate user input, e.g. ground truth matches of points or regions. In the GWα framework, one way to enforce these constraints is to provide a stencil S specifying a sparsity pattern for the map Γ. Incorporating Figure 18: Mapping a set of 185 images onto a two shapes while preserving color similarity. (Images from Flickr public domain collection.) Rn0×n0 + and D ∈ Rn×n + we are given symmetric weight matrices W0 ∈ Rn0×n0 + and W ∈ Rn×n + . We could solve a weighted version of the GWα matching problem (3) that prioritizes maps preserving distances corresponding to large W values: min Γ∈M ijkℓ (D0ij −Dkℓ )2ΓikΓjℓW0ijWjℓµ0iµ0jµkµℓ . (8) For instance, (W0, W) might contain confidence values expressing the quality of the entries of (D0, D). Or, W0, W could take values in {ε, 1} reducing the weight of distances that are unknown or do not need to be preserved by Γ. Following the same simplifications as §3.1, we can optimize this objective by minimizing ⟨Γ, ΛW (Γ)⟩, where ΛW (Γ) := 1 2 [D∧2 0 ⊗ W0][[µ0 ]]Γ[[µ]]W − [D0 ⊗ W0][[µ0 ]]Γ[[µ]][D ⊗ W] + 1 2 W0[[µ0 ]]Γ[[µ]][D∧2 ⊗ W] Applications of GW: Shapes Analysis Use T to define registration between: Colors distribution Shape Shape Shape
  27. phine Vladimir G. Kim Adobe Research Suvrit Sra MIT Source

    Targets Figure 1: Entropic GW can find correspondences between a source surface (left) and a surface with similar structure, a surface with shared semantic structure, a noisy 3D point cloud, an icon, and a hand drawing. Each fuzzy map was computed using the same code. In this paper, we propose a new correspondence algorithm that minimizes distortion of long- and short-range distances alike. We study an entropically-regularized version of the Gromov-Wasserstein (GW) mapping objective function from [M´ emoli 2011] measuring the distortion of geodesic distances. The optimizer is a probabilistic matching expressed as a “fuzzy” correspondence matrix in the style of [Kim et al. 2012; Solomon et al. 2012]; we control sharpness of the correspondence via the weight of the entropic regularizer. 0 0.02 0.04 0 0.02 0 0.02 Teddies Humans Four-legged Armadillo Figure 15: MDS embedding of four classes from SHREC dataset. 0 0.5 1 0 0.5 1 1 5 10 15 20 25 30 35 40 45 PCA 1 PCA 2 Figure 16: Recovery of galloping horse sequence. 0 is the base shape) as a feature vector for shape i. We reproduce the result presented in the work of Rustamov et al., recovering the circular structure of meshes from a galloping horse animation sequence (Figure 16). Unlike Rustamov et al., however, our method does not require ground truth maps between shapes as input. 5.2 Supervised Matching An important feature of a matching tool is the ability to incorporate user input, e.g. ground truth matches of points or regions. In the GWα framework, one way to enforce these constraints is to provide a stencil S specifying a sparsity pattern for the map Γ. Incorporating Figure 18: Mapping a set of 185 images onto a two shapes while preserving color similarity. (Images from Flickr public domain collection.) Rn0×n0 + and D ∈ Rn×n + we are given symmetric weight matrices W0 ∈ Rn0×n0 + and W ∈ Rn×n + . We could solve a weighted version of the GWα matching problem (3) that prioritizes maps preserving distances corresponding to large W values: min Γ∈M ijkℓ (D0ij −Dkℓ )2ΓikΓjℓW0ijWjℓµ0iµ0jµkµℓ . (8) For instance, (W0, W) might contain confidence values expressing the quality of the entries of (D0, D). Or, W0, W could take values in {ε, 1} reducing the weight of distances that are unknown or do not need to be preserved by Γ. Following the same simplifications as §3.1, we can optimize this objective by minimizing ⟨Γ, ΛW (Γ)⟩, where ΛW (Γ) := 1 2 [D∧2 0 ⊗ W0][[µ0 ]]Γ[[µ]]W − [D0 ⊗ W0][[µ0 ]]Γ[[µ]][D ⊗ W] + 1 2 W0[[µ0 ]]Γ[[µ]][D∧2 ⊗ W] Applications of GW: Shapes Analysis 0 0.02 0.04 0 0.02 0 0.02 Te Hu Fo Ar Figure 1: The database that has been used, divide MDS in 3-D Use T to define registration between: Colors distribution Shape Shape Shape Geodesic distances GW distances MDS Vizualization Shapes (Xs)s 0 0.02 0.04 0 0.02 0 0.02 Teddies Humans Four-legged Armadillo Figure 15: MDS embedding of four classes from SHREC dataset. 0 0.5 1 1 5 10 15 20 25 30 35 40 45 PCA 2 Figure 18: Mapping a set of 185 images onto a two shapes while preserving color similarity. (Images from Flickr public domain collection.) Rn0×n0 + and D ∈ Rn×n + we are given symmetric weight matrices W0 ∈ Rn0×n0 + and W ∈ Rn×n + . We could solve a weighted version of the GWα matching problem (3) that prioritizes maps preserving distances corresponding to large W values: MDS in 2-D
  28. Applications of GW: Quantum Chemistry Regression problem: ! f by

    solving DFT approximation is too costly.
  29. Applications of GW: Quantum Chemistry Regression problem: ! f by

    solving DFT approximation is too costly. [Rupp et al 2012] 2 6 6 6 6 4 3 7 7 7 7 5
  30. Applications of GW: Quantum Chemistry Regression problem: ! f by

    solving DFT approximation is too costly. GW-interpolation: [Rupp et al 2012] 2 6 6 6 6 4 3 7 7 7 7 5
  31. 1. Entropy Regularized Gromov-Wasserstein 2. Gromov-Wasserstein Barycenters 0. Entropy Regularized

    Optimal Transport
  32. Gromov-Wasserstein Geodesics Def. Gromov-Wasserstein Geodesic

  33. Gromov-Wasserstein Geodesics Def. Gromov-Wasserstein Geodesic Prop. [Sturm 2012]

  34. Gromov-Wasserstein Geodesics Def. Gromov-Wasserstein Geodesic ! X ⇥ Y is

    not practical for most applications. (need to fix the size of the geodesic embedding space) ! Extension to more than 2 input spaces? Prop. [Sturm 2012]
  35. Gromov-Wasserstein Barycenters Input: 1 2 3 Def. GW Barycenters

  36. Gromov-Wasserstein Barycenters Input: 1 2 3 Def. GW Barycenters repeat:

    until convergence. initialization: C C0 for s = 1 to S do return C Alternating minimization: On Ts
  37. t 0 1 GW Barycenters of Shapes Shapes (Xs)s Euclidean

    Distances q K>a . ce ag( a ) K diag( b ) . ing (13). s T > s h2( Cs) Ts pp> ⌘ mula (15), one has to show that UsTs diag(1 /p ) is conditionally s matrices are. This is indeed x such that hx, 1N i = 0 , one sxs, xsi where xs def. = Ts x p , and > s 1Ns i = hx p , pi = 0 , one has Ux, xi > 0 , which proves the ince the cone of infinitely divisi- that the output of our barycenter s infinitely divisible. finitely divisible kernels is given input data clouds are shown on the left, and an MDS embedding of the barycenter distance matrix is shown on the right. Figure 2. Barycenter example for shape data from (Thakoor et al., 2007). 4. Experiments 4.1. Point Clouds Embedded barycenters. Figure 1 provides an example illustrating the behavior of our GW barycenter approxima- tion. In this experiment, we extract 500 point clouds of handwritten digits from the dataset (LeCun et al., 1998), rotated arbitrarily in the plane. We represent each digit as a symmetric Euclidean distance matrix and optimize for a 500 ⇥ 500 barycenter using Algorithm 1 (uniform weights, " = 1 ⇥ 10 3); notice that most of the input point clouds consist of fewer than 500 points. We then visualize the
  38. GW Barycenters of Shapes Geod. Eucl. t 0 1

  39. xi yj Conclusion Optimal transport Registered spaces Un-registered spaces Gromov-Wasserstein

  40. xi yj Conclusion Optimal transport Registered spaces Un-registered spaces Gromov-Wasserstein

    Entropy: makes the problem tractable
  41. xi yj Conclusion Optimal transport Registered spaces Un-registered spaces Gromov-Wasserstein

    Entropy: makes the problem tractable e↵ective! Entropy: surprisingly (GW highly non-convex)
  42. xi yj Conclusion Optimal transport Registered spaces Un-registered spaces Gromov-Wasserstein

    Entropy: makes the problem tractable e↵ective! Entropy: surprisingly (GW highly non-convex) 0 0.02 0.04 0 0.02 0 0.02 Figure 15: MDS embedding of four classes from 0 0.5 0 0.5 1 1 10 15 20 25 35 45 PCA 1 PCA 2 Figure 16: Recovery of galloping horse s 0 is the base shape) as a feature vector for shape the result presented in the work of Rustamov e the circular structure of meshes from a galloping sequence (Figure 16). Unlike Rustamov et al., how does not require ground truth maps between shape Figure 1: The database that has been used, divided i 3 Participants Each participant was asked to submit up to 3 runs of his/her algorith dissimilarity matrices; each run could be for example the result of a d or the use of a different similarity metric. We remind that the entry ( represent the distance between models i and j. Figure 1: The database that has been used, divi 3 Participants Each participant was asked to submit up to 3 runs of his/her al dissimilarity matrices; each run could be for example the result o or the use of a different similarity metric. We remind that the en represent the distance between models i and j. This track saw 5 groups of participants: 1. Ceyhun Burak Akgül, Francis Schmitt, Bülent Sankur and Y 2. Mohamed Chaouch and Anne Verroust-Blondet with 2 matr 3. Thibault Napoléon, Tomasz Adamek, Francis Schmitt and N 4. Petros Daras and Athanasios Mademlis sent 1 matrix; 5. Tony Tung and Francis Schmitt with 3 matrices. Figure 1: The database that has been use 3 Participants Each participant was asked to submit up to 3 runs of his/ dissimilarity matrices; each run could be for example the re Applications: – unregistered data – quantum chemistry – shapes 2 6 6 6 6 4 3 7 7 7 7 5 [Rupp et al 2012]
  43. xi yj Conclusion Optimal transport Registered spaces Un-registered spaces Gromov-Wasserstein

    Entropy: makes the problem tractable e↵ective! Entropy: surprisingly (GW highly non-convex) 0 0.02 0.04 0 0.02 0 0.02 Figure 15: MDS embedding of four classes from 0 0.5 0 0.5 1 1 10 15 20 25 35 45 PCA 1 PCA 2 Figure 16: Recovery of galloping horse s 0 is the base shape) as a feature vector for shape the result presented in the work of Rustamov e the circular structure of meshes from a galloping sequence (Figure 16). Unlike Rustamov et al., how does not require ground truth maps between shape Figure 1: The database that has been used, divided i 3 Participants Each participant was asked to submit up to 3 runs of his/her algorith dissimilarity matrices; each run could be for example the result of a d or the use of a different similarity metric. We remind that the entry ( represent the distance between models i and j. Figure 1: The database that has been used, divi 3 Participants Each participant was asked to submit up to 3 runs of his/her al dissimilarity matrices; each run could be for example the result o or the use of a different similarity metric. We remind that the en represent the distance between models i and j. This track saw 5 groups of participants: 1. Ceyhun Burak Akgül, Francis Schmitt, Bülent Sankur and Y 2. Mohamed Chaouch and Anne Verroust-Blondet with 2 matr 3. Thibault Napoléon, Tomasz Adamek, Francis Schmitt and N 4. Petros Daras and Athanasios Mademlis sent 1 matrix; 5. Tony Tung and Francis Schmitt with 3 matrices. Figure 1: The database that has been use 3 Participants Each participant was asked to submit up to 3 runs of his/ dissimilarity matrices; each run could be for example the re Applications: – unregistered data – quantum chemistry – shapes Theoretical analysis of entropic GW. Large scale applications. Future works: 2 6 6 6 6 4 3 7 7 7 7 5 [Rupp et al 2012]