$30 off During Our Annual Pro Sale. View Details »

Numerical Optimal Transport

Gabriel Peyré
December 07, 2017

Numerical Optimal Transport

Slides for a course.

Gabriel Peyré

December 07, 2017
Tweet

More Decks by Gabriel Peyré

Other Decks in Research

Transcript

  1. Numerical
    Optimal Transport
    Gabriel Peyré
    É C O L E N O R M A L E
    S U P É R I E U R E
    Outline
    • What is the Lasso
    • Lasso with an orthogonal design
    • From projected gradient to proximal gradien
    • Optimality conditions and subgradients (LAR
    • Coordinate descent algorithm
    … with some demos
    www.numerical-tours.com
    http://optimaltransport.github.io

    View Slide

  2. Overview
    • Measures and Histograms
    • From Monge to Kantorovitch Formulations
    • Entropic Regularization and Sinkhorn
    • Barycenters
    • Unbalanced OT and Gradient Flows
    • Minimum Kantorovitch Estimators
    • Gromov-Wasserstein

    View Slide

  3. Comparing Measures and Spaces
    Source image (X)
    Style image (Y)
    Sliced Wasserstein projection of X to style
    image color statistics Y
    Source image after color transfer
    J. Rabin Wasserstein Regularization
    ! images, vision, graphics and machine learning, . . .
    • Probability distributions and histograms

    View Slide

  4. Comparing Measures and Spaces
    Source image (X)
    Style image (Y)
    Sliced Wasserstein projection of X to style
    image color statistics Y
    Source image after color transfer
    J. Rabin Wasserstein Regularization
    ! images, vision, graphics and machine learning, . . .
    • Probability distributions and histograms
    • Optimal transport
    Optimal transport mean
    L2 mean

    View Slide

  5. Probability Measures
    Positive Radon measure µ on a set X.
    dµ(x) = m(x)dx
    Measure of sets A ⇢ X: µ(A) =
    R
    A
    dµ(x) > 0
    X X X
    µ =
    P
    i
    µi xi
    2.2. Assignment and Monge Problem 9
    Discrete d = 1 Discrete d = 2 Density d = 1 Density d = 2
    Figure 2.1: Schematic display of discrete distributions – =
    q
    n
    i
    =1 ai”xi
    (red cor-
    responds to empirical uniform distribution ai
    = 1/n, and blue to arbitrary distri-
    butions) and densities d–(x) = fl–
    (x)dx (in violet), in both 1-D and 2-D. Discrete
    distributions in 1-D are displayed using vertical segments (with length equal to ai
    )
    and in 2-D using point clouds (radius equal to ai
    ).
    is the dimension), can have a density d–(x) = fl

    (x)dx w.r.t. the
    Lebesgue measure, often denoted fl

    = d

    d
    x
    , which means that
    d
    ⁄ ⁄

    View Slide

  6. Probability Measures
    Positive Radon measure µ on a set X.
    dµ(x) = m(x)dx
    Integration against continuous functions:
    R
    X
    g(x)dµ(x) > 0
    Measure of sets A ⇢ X: µ(A) =
    R
    A
    dµ(x) > 0
    X X X
    µ =
    P
    i
    µi xi
    R
    X
    gdµ =
    R
    X
    m(x)dx
    R
    X
    gdµ =
    P
    i
    µig(xi)
    dµ(x) = m(x)dx
    µ =
    P
    i
    µi xi
    2.2. Assignment and Monge Problem 9
    Discrete d = 1 Discrete d = 2 Density d = 1 Density d = 2
    Figure 2.1: Schematic display of discrete distributions – =
    q
    n
    i
    =1 ai”xi
    (red cor-
    responds to empirical uniform distribution ai
    = 1/n, and blue to arbitrary distri-
    butions) and densities d–(x) = fl–
    (x)dx (in violet), in both 1-D and 2-D. Discrete
    distributions in 1-D are displayed using vertical segments (with length equal to ai
    )
    and in 2-D using point clouds (radius equal to ai
    ).
    is the dimension), can have a density d–(x) = fl

    (x)dx w.r.t. the
    Lebesgue measure, often denoted fl

    = d

    d
    x
    , which means that
    d
    ⁄ ⁄

    View Slide

  7. Probability Measures
    Positive Radon measure µ on a set X.
    dµ(x) = m(x)dx
    Integration against continuous functions:
    R
    X
    g(x)dµ(x) > 0
    Measure of sets A ⇢ X: µ(A) =
    R
    A
    dµ(x) > 0
    X X X
    Probability (normalized) measure: µ(X) =
    R
    X
    dµ(x) = 1
    µ =
    P
    i
    µi xi
    R
    X
    gdµ =
    R
    X
    m(x)dx
    R
    X
    gdµ =
    P
    i
    µig(xi)
    dµ(x) = m(x)dx
    µ =
    P
    i
    µi xi
    2.2. Assignment and Monge Problem 9
    Discrete d = 1 Discrete d = 2 Density d = 1 Density d = 2
    Figure 2.1: Schematic display of discrete distributions – =
    q
    n
    i
    =1 ai”xi
    (red cor-
    responds to empirical uniform distribution ai
    = 1/n, and blue to arbitrary distri-
    butions) and densities d–(x) = fl–
    (x)dx (in violet), in both 1-D and 2-D. Discrete
    distributions in 1-D are displayed using vertical segments (with length equal to ai
    )
    and in 2-D using point clouds (radius equal to ai
    ).
    is the dimension), can have a density d–(x) = fl

    (x)dx w.r.t. the
    Lebesgue measure, often denoted fl

    = d

    d
    x
    , which means that
    d
    ⁄ ⁄

    View Slide

  8. Random vectors Radon measures
    P(X 2 A)
    R
    A
    dµ(x)
    =
    P(Xn
    2 A) n!+1
    ! P(X 2 A)
    8 set A 8 continuous function f
    R
    fdµn
    n!+1
    !
    R
    fdµ
    Weak⇤ convergence: Convergence in law:
    Measures and Random Variables

    View Slide

  9. Random vectors Radon measures
    P(X 2 A)
    R
    A
    dµ(x)
    =
    P(Xn
    2 A) n!+1
    ! P(X 2 A)
    8 set A 8 continuous function f
    R
    fdµn
    n!+1
    !
    R
    fdµ
    Weak⇤ convergence: Convergence in law:
    Weak convergence:
    . . .
    Measures and Random Variables

    View Slide

  10. Quotient space:
    Discrete measure:
    Discretization: Histogram vs. Empirical
    Lagrangian (point clouds) Eulerian (histograms)
    µ =
    N
    X
    i=1
    µi xi
    xi
    2 X,
    X
    i
    µi = 1
    Constant weights µi = 1
    N
    XN /⌃N
    {(µi)i
    > 0 ;
    P
    i
    µi = 1}
    Convex polytope (simplex):
    xi
    X X
    Fixed positions xi (e.g. grid)

    View Slide

  11. -
    f
    .
    s
    -
    t
    e
    e
    e
    l
    .
    x
    Radon measures (µ, ⌫) on (X, Y ).
    Transfer of measure by f : X ! Y : push forward.
    ⌫ = f]µ defined by:
    ⌫(A) def.
    = µ(f 1(A))
    ()
    R
    Y
    g(y)d⌫(y) def.
    =
    R
    X
    g(f(x))dµ(x)
    y = f(x)
    f
    X
    µ ⌫
    Y
    A
    f 1(A)
    Push Forward

    View Slide

  12. -
    f
    .
    s
    -
    t
    e
    e
    e
    l
    .
    x
    Radon measures (µ, ⌫) on (X, Y ).
    Transfer of measure by f : X ! Y : push forward.
    ⌫ = f]µ defined by:
    ⌫(A) def.
    = µ(f 1(A))
    ()
    R
    Y
    g(y)d⌫(y) def.
    =
    R
    X
    g(f(x))dµ(x)
    y = f(x)
    f
    X
    µ ⌫
    Y
    A
    f 1(A)
    Smooth densities: dµ = ⇢(x)dx, d⌫ = ⇠(x)dx
    Push Forward
    f]µ = ⌫ () ⇢(f(x))| det(@f(x))| = ⇠(x)

    View Slide

  13. µ
    f]µ
    f f
    '
    Measures:
    push-forward
    Functions:
    pull-back
    f]'
    Remark: f] and f] are adjoints
    Z
    Y
    'd(f]µ) =
    Z
    X
    (f]')dµ
    =
    P
    i xi
    def.
    =
    P
    i f(xi)
    def.
    = ' f
    f : X ! Y
    f] : M(X) ! M(Y) f] : C(Y) ! C(X)
    Push-forward vs. Pull-back

    View Slide

  14. P( lim
    n!+1
    Xn = X) = 1
    =)
    lim
    n!+1
    E(|Xn X|p) = 0
    P(Xn
    2 A) n!+1
    ! P(X 2 A)
    In law
    8 " > 0, P(|Xn X| > ") n!+1
    ! 0
    In probability
    Almost sure
    In mean
    =)
    =)
    (the Xn can be defined on di↵erent spaces)
    Convergence of Random Variables

    View Slide

  15. Overview
    • Measures and Histograms
    • From Monge to Kantorovitch Formulations
    • Entropic Regularization and Sinkhorn
    • Barycenters
    • Unbalanced OT and Gradient Flows
    • Minimum Kantorovitch Estimators
    • Gromov-Wasserstein

    View Slide

  16. Gaspard Monge (1746-1818)
    (1784)

    View Slide

  17. Monge Transport
    min
    ⌫=f]µ
    Z
    X
    c(x, f(x))dµ(x) f
    X
    µ
    Y

    View Slide

  18. Monge Transport
    min
    ⌫=f]µ
    Z
    X
    c(x, f(x))dµ(x) f
    X
    µ
    Y

    Theorem: [Brenier] for c(x, y) = ||x y||2, (µ, ⌫) with density,
    there exists a unique optimal f. One has f = r where
    is the unique convex function such that (r )]µ = ⌫

    View Slide

  19. Monge Transport
    min
    ⌫=f]µ
    Z
    X
    c(x, f(x))dµ(x) f
    X
    µ
    Y

    Monge-Amp`
    ere equation: ⇢(r ) det(@2 ) = ⇠
    Theorem: [Brenier] for c(x, y) = ||x y||2, (µ, ⌫) with density,
    there exists a unique optimal f. One has f = r where
    is the unique convex function such that (r )]µ = ⌫

    View Slide

  20. Monge Transport
    min
    ⌫=f]µ
    Z
    X
    c(x, f(x))dµ(x) f
    X
    µ
    Y

    Monge-Amp`
    ere equation: ⇢(r ) det(@2 ) = ⇠

    µ
    f f0
    ?
    µ

    µ

    f
    µ

    f
    Non-uniqueness / non-existence:
    Theorem: [Brenier] for c(x, y) = ||x y||2, (µ, ⌫) with density,
    there exists a unique optimal f. One has f = r where
    is the unique convex function such that (r )]µ = ⌫

    View Slide

  21. Leonid Kantorovich (1912-1986)
    Леонид Витальевич Канторович

    View Slide

  22. Optimal Transport was formulated in 1930 by A.N. Tolstoi,
    12 years before Kantorovich. He even solved a "large
    scale" 10×68 instance!
    Before Kantorovitch

    View Slide

  23. xi
    yj
    Kantorovitch’s Formulation
    Points (xi)i, (yj)j
    T
    Input distributions
    Couplings
    Def.

    View Slide

  24. xi
    yj
    Kantorovitch’s Formulation
    Points (xi)i, (yj)j
    T
    Input distributions
    Couplings
    Def.
    [Kantorovich 1942]
    Wasserstein Distance / EMD
    Def.

    View Slide

  25. OT Between General Measures
    Couplings:
    P1]⇡(S) def.
    = ⇡(S, X) P2]⇡(S) def.
    = ⇡(X, S)
    Marginals:

    ⇧(µ, ⌫) def.
    = {⇡ 2 M+(X ⇥ X) ; P1]⇡ = µ, P2]⇡ = ⌫}

    Optimal transport: [Kantorovitch 1942]
    Wp
    p
    (µ, ⌫) def.
    = min


    hdp, ⇡i =
    Z
    X⇥X
    d(x, y)pd⇡(x, y) ; ⇡ 2 ⇧(µ, ⌫)

    View Slide

  26. Couplings: the 3 Settings

    Discrete

    Continuous

    Semi-discrete
    ↵ ↵ ↵
    ↵ ↵ ↵

    View Slide

  27. Couplings












    View Slide

  28. 1-D Optimal Transport
    Remark. If ⌦ = R, c(x, y) = c(|x y|),
    c convex, F 1
    µ
    , F 1

    quantile functions,
    W(µ, ⌫)=
    Z 1
    0
    c(|F 1
    µ
    (x) F 1

    (x)|)dx

    View Slide

  29. Cases 37
    ρβ
    ρα
    Put
    Remark. If ⌦ = Rd, c(x, y) = kx yk2, and
    µ = N(mµ, ⌃µ), ⌫ = N(m⌫, ⌃⌫) then
    W2
    2
    (µ, ⌫) = kmµ m⌫
    k2 + B(⌃µ, ⌃⌫)2
    where B is the Bures metric
    B(⌃µ, ⌃⌫)2 = trace(⌃µ+⌃⌫ 2(⌃1/2
    µ
    ⌃⌫⌃1/2
    µ
    )1/2).
    The map T : x 7! m⌫ + A(x mµ) is optimal,
    where A = ⌃
    1
    2
    µ


    1
    2
    µ ⌃⌫⌃
    1
    2
    µ
    ◆1
    2

    1
    2
    µ .
    m
    1
    p
    2⇡
    e (x m)2
    2 2
    OT Between Gaussians
    T : x 7! m⌫ + A(x mµ)

    View Slide

  30. OT on Gaussians and Bures' Distance

    View Slide

  31. W1 OT and Min-cost Flows

    View Slide

  32. Bins-to-bins metrics:
    dµ(x) = ⇢(x)dx

    µ(x) = ˜
    ⇢(x)dx
    µ
    ˜
    µ
    Kullback-Leibler divergence:
    DKL(µ, ˜
    µ) =
    Z
    ⇢(x) log
    ⇢(x)
    ˜
    ⇢(x)
    dx
    Hellinger distance:
    D
    H
    (µ, ˜
    µ)2 =
    Z ⇣p
    ⇢(x)
    p
    ˜
    ⇢(x)

    2
    dx
    Metrics on the Space of Measures

    View Slide

  33. Bins-to-bins metrics:
    dµ (x) = ⇢(x )dx
    dµ(x) = ⇢(x)dx

    µ(x) = ˜
    ⇢(x)dx
    µ
    ˜
    µ
    µ
    Kullback-Leibler divergence:
    DKL(µ, ˜
    µ) =
    Z
    ⇢(x) log
    ⇢(x)
    ˜
    ⇢(x)
    dx
    Hellinger distance:
    E↵ect of translation:
    D(µ, µ ) ⇡ cst W2(µ, µ ) =
    D
    H
    (µ, ˜
    µ)2 =
    Z ⇣p
    ⇢(x)
    p
    ˜
    ⇢(x)

    2
    dx
    Metrics on the Space of Measures

    View Slide

  34. D'(↵| ) def.
    =
    Z
    X
    '

    d↵
    d

    d ||↵ ||B
    def.
    = max
    f2B
    Z
    X
    f(x)(d↵(x) d (x))
    Csisz´
    ar divergences: Dual norms:
    X X


    D'
    || · ||B
    Weak topology
    Strong topology
    ! KL, TV, 2
    , Hellinger . . . ! W1, flat, RKHS

    , energy dist, . . .
    Csiszar Divergence vs Dual Norms

    View Slide

  35. Csiszár divergences, a unifying way to define losses between arbitrary
    positive measures (discrete & densities). https://en.wikipedia.org/wiki/F-
    divergence …
    Csiszar Divergence

    View Slide

  36. -3 -2 -1 0 1 2 3
    0
    1
    2
    3
    Energy
    Gauss
    W
    1
    Flat
    -3 -2 -1 0 1 2 3
    0
    1
    2
    3
    Energy
    Gauss
    W
    1
    Flat
    Dual norms: (aka Integral Probability Metrics)
    ||↵ ||B
    def.
    = max
    ⇢Z
    X
    f(x)(d↵(x) d (x)) ; f 2 B
    t t/2
    Wasserstein 1: B = {f ; ||rf||1
    6 1}.
    Flat norm: B = {f ; ||f||1
    6 1, ||rf||1
    6 1}.
    RKHS: B = f ; ||f||2
    k
    6
    1 .
    ||↵ ||2
    B
    =
    Z
    k(x, x0)d↵(x)d↵(x0) +
    Z
    k(x, x0)d (y)d (y0) 2
    Z
    k(x, y)d↵(x)d (y)
    Energy distance: k(x, y) = ||x y||
    Gaussian: k(x, y) = e |
    |x y|
    |2
    2 2
    ↵ ↵
    Dual Norms

    View Slide

  37. RKHS Norms aka Maximum Mean Discrepency

    View Slide

  38. The Earth Mover’s Distance
    µ ⌫
    dist(I1, I2) = W1(µ, ⌫)
    [Rubner’98]

    View Slide

  39. 29
    The Word Mover’s Distance
    [Kusner’15] dist(D1, D2) = W2(µ, ⌫)
    µ

    View Slide

  40. Overview
    • Measures and Histograms
    • From Monge to Kantorovitch Formulations
    • Linear Programming and Semi-discrete
    • Entropic Regularization and Sinkhorn
    • Barycenters
    • Unbalanced OT and Gradient Flows
    • Minimum Kantorovitch Estimators
    • Gromov-Wasserstein

    View Slide

  41. Linear programming:
    µ =
    PN1
    i=1
    pi xi
    , ⌫ =
    PN2
    j=1
    pj yi
    Algorithms

    View Slide

  42. Linear programming:
    µ =
    PN1
    i=1
    pi xi
    , ⌫ =
    PN2
    j=1
    pj yi
    ⇠ O(N3)
    Hungarian/Auction:
    µ = 1
    N
    PN
    i=1 xi
    , ⌫ = 1
    N
    PN
    j=1 yj
    Algorithms

    View Slide

  43. Linear programming:
    µ =
    PN1
    i=1
    pi xi
    , ⌫ =
    PN2
    j=1
    pj yi
    ⇠ O(N3)
    Hungarian/Auction:
    µ = 1
    N
    PN
    i=1 xi
    , ⌫ = 1
    N
    PN
    j=1 yj
    Algorithms
    µ

    View Slide

  44. Linear programming:
    µ =
    PN1
    i=1
    pi xi
    , ⌫ =
    PN2
    j=1
    pj yi
    ⇠ O(N3)
    Hungarian/Auction:
    µ = 1
    N
    PN
    i=1 xi
    , ⌫ = 1
    N
    PN
    j=1 yj
    Monge-Amp`
    ere/Benamou-Brenier, d = || · ||2
    2
    .
    also reveal that the network simplex behaves in O(n
    2) in our con-
    text, which is a major gain at the scale at which we typical work,
    i.e. thousands of particles. This finding is also useful for applica-
    tions that use EMD, where using the network simplex instead of
    the transport simplex can bring a significant performance increase.
    Our experiments also show that fixed-point precision further speeds
    up the computation. We observed that the value of the final trans-
    port cost is less accurate because of the limited precision, but that
    the particle pairing that produces the actual interpolation scheme
    remains unchanged. We used the fixed point method to generate
    the results presented in this paper. The results of the performance
    study are also of broader interest, as current EMD image retrieval
    or color transfer techniques rely on slower solvers [Rubner et al.
    2000; Kanters et al. 2003; Morovic and Sun 2003].
    101 102 103 104
    10−4
    10−2
    100
    102
    Problem size : number of bins per histogram
    Time in seconds
    Network Simplex (fixed point)
    Network Simplex (double prec.)
    Transport Simplex
    y = α x2
    y = β x3
    Figure 6: Log-log plot of the running times of different solvers.
    Figure 7: Synthetic 2D examples on a Euclidean domain. Th
    left and right columns show the input distributions, while the cent
    columns show interpolations for ↵ = 1/4, ↵ = 1/2, and ↵ = 3/
    positive. A negative side effect of this choice is that interpolatin
    between BRDFs of equal energy conserves their log energy (§ 3.
    instead of their energy. Because we apply a concave remappin
    the interpolated value is guaranteed to be always lower, which e
    sures that our result does not break the energy preservation rul
    Algorithms
    µ

    View Slide

  45. Linear programming:
    µ =
    PN1
    i=1
    pi xi
    , ⌫ =
    PN2
    j=1
    pj yi
    ⇠ O(N3)
    Hungarian/Auction:
    µ = 1
    N
    PN
    i=1 xi
    , ⌫ = 1
    N
    PN
    j=1 yj
    Monge-Amp`
    ere/Benamou-Brenier, d = || · ||2
    2
    .
    also reveal that the network simplex behaves in O(n
    2) in our con-
    text, which is a major gain at the scale at which we typical work,
    i.e. thousands of particles. This finding is also useful for applica-
    tions that use EMD, where using the network simplex instead of
    the transport simplex can bring a significant performance increase.
    Our experiments also show that fixed-point precision further speeds
    up the computation. We observed that the value of the final trans-
    port cost is less accurate because of the limited precision, but that
    the particle pairing that produces the actual interpolation scheme
    remains unchanged. We used the fixed point method to generate
    the results presented in this paper. The results of the performance
    study are also of broader interest, as current EMD image retrieval
    or color transfer techniques rely on slower solvers [Rubner et al.
    2000; Kanters et al. 2003; Morovic and Sun 2003].
    101 102 103 104
    10−4
    10−2
    100
    102
    Problem size : number of bins per histogram
    Time in seconds
    Network Simplex (fixed point)
    Network Simplex (double prec.)
    Transport Simplex
    y = α x2
    y = β x3
    Figure 6: Log-log plot of the running times of different solvers.
    Figure 7: Synthetic 2D examples on a Euclidean domain. Th
    left and right columns show the input distributions, while the cent
    columns show interpolations for ↵ = 1/4, ↵ = 1/2, and ↵ = 3/
    positive. A negative side effect of this choice is that interpolatin
    between BRDFs of equal energy conserves their log energy (§ 3.
    instead of their energy. Because we apply a concave remappin
    the interpolated value is guaranteed to be always lower, which e
    sures that our result does not break the energy preservation rul
    Algorithms
    µ

    Semi-discrete: Laguerre cells, d = || · ||2
    2
    .
    [Levy,’15]

    View Slide

  46. Linear programming:
    µ =
    PN1
    i=1
    pi xi
    , ⌫ =
    PN2
    j=1
    pj yi
    ⇠ O(N3)
    Hungarian/Auction:
    µ = 1
    N
    PN
    i=1 xi
    , ⌫ = 1
    N
    PN
    j=1 yj
    Monge-Amp`
    ere/Benamou-Brenier, d = || · ||2
    2
    .
    also reveal that the network simplex behaves in O(n
    2) in our con-
    text, which is a major gain at the scale at which we typical work,
    i.e. thousands of particles. This finding is also useful for applica-
    tions that use EMD, where using the network simplex instead of
    the transport simplex can bring a significant performance increase.
    Our experiments also show that fixed-point precision further speeds
    up the computation. We observed that the value of the final trans-
    port cost is less accurate because of the limited precision, but that
    the particle pairing that produces the actual interpolation scheme
    remains unchanged. We used the fixed point method to generate
    the results presented in this paper. The results of the performance
    study are also of broader interest, as current EMD image retrieval
    or color transfer techniques rely on slower solvers [Rubner et al.
    2000; Kanters et al. 2003; Morovic and Sun 2003].
    101 102 103 104
    10−4
    10−2
    100
    102
    Problem size : number of bins per histogram
    Time in seconds
    Network Simplex (fixed point)
    Network Simplex (double prec.)
    Transport Simplex
    y = α x2
    y = β x3
    Figure 6: Log-log plot of the running times of different solvers.
    Figure 7: Synthetic 2D examples on a Euclidean domain. Th
    left and right columns show the input distributions, while the cent
    columns show interpolations for ↵ = 1/4, ↵ = 1/2, and ↵ = 3/
    positive. A negative side effect of this choice is that interpolatin
    between BRDFs of equal energy conserves their log energy (§ 3.
    instead of their energy. Because we apply a concave remappin
    the interpolated value is guaranteed to be always lower, which e
    sures that our result does not break the energy preservation rul
    Algorithms
    µ

    Semi-discrete: Laguerre cells, d = || · ||2
    2
    .
    [Levy,’15]

    View Slide

  47. Need for fast approximate algorithms for generic c.
    Linear programming:
    µ =
    PN1
    i=1
    pi xi
    , ⌫ =
    PN2
    j=1
    pj yi
    ⇠ O(N3)
    Hungarian/Auction:
    µ = 1
    N
    PN
    i=1 xi
    , ⌫ = 1
    N
    PN
    j=1 yj
    Monge-Amp`
    ere/Benamou-Brenier, d = || · ||2
    2
    .
    also reveal that the network simplex behaves in O(n
    2) in our con-
    text, which is a major gain at the scale at which we typical work,
    i.e. thousands of particles. This finding is also useful for applica-
    tions that use EMD, where using the network simplex instead of
    the transport simplex can bring a significant performance increase.
    Our experiments also show that fixed-point precision further speeds
    up the computation. We observed that the value of the final trans-
    port cost is less accurate because of the limited precision, but that
    the particle pairing that produces the actual interpolation scheme
    remains unchanged. We used the fixed point method to generate
    the results presented in this paper. The results of the performance
    study are also of broader interest, as current EMD image retrieval
    or color transfer techniques rely on slower solvers [Rubner et al.
    2000; Kanters et al. 2003; Morovic and Sun 2003].
    101 102 103 104
    10−4
    10−2
    100
    102
    Problem size : number of bins per histogram
    Time in seconds
    Network Simplex (fixed point)
    Network Simplex (double prec.)
    Transport Simplex
    y = α x2
    y = β x3
    Figure 6: Log-log plot of the running times of different solvers.
    Figure 7: Synthetic 2D examples on a Euclidean domain. Th
    left and right columns show the input distributions, while the cent
    columns show interpolations for ↵ = 1/4, ↵ = 1/2, and ↵ = 3/
    positive. A negative side effect of this choice is that interpolatin
    between BRDFs of equal energy conserves their log energy (§ 3.
    instead of their energy. Because we apply a concave remappin
    the interpolated value is guaranteed to be always lower, which e
    sures that our result does not break the energy preservation rul
    Algorithms
    µ

    Semi-discrete: Laguerre cells, d = || · ||2
    2
    .
    [Levy,’15]

    View Slide

  48. C-Transform
    92 Semi-discrete Optimal Transport
    0 0.5 1
    -0.2
    0
    0.2
    0.4
    0.6
    0.5
    1
    1.5
    2
    p = 1/2 p = 1 p = 3/2 p = 2
    Figure 5.1: Top: examples of semi-discrete ¯
    c-transforms g
    ¯
    c in 1-D, for ground
    d the useful indicator function notation (4.42).
    alternate minimization on either f or g leads to the im-
    n of c-transform:
    ’ y œ Y, fc(y) def.
    = inf
    xœX
    c(x, y) ≠ f(x), (5.1)
    ’ x œ X, g¯
    c(x) def.
    = inf
    yœY
    c(x, y) ≠ g(y), (5.2)
    oted ¯
    c(y, x) def.
    = c(x, y). Indeed, one can check that
    œ argmax
    g
    E(f, g) and g¯
    c œ argmax
    f
    E(f, g). (5.3)
    ese partial minimizations define maximizers on the sup-
    tively – and —, while the definitions (5.1) actually define
    he whole spaces X and Y. This is thus a way to extend in
    ay solutions of (2.22) on the whole spaces. When X = Rd
    Îx ≠ yÎp, then the c-transform (5.1) fc is the so-called
    n between ≠f and ηÎp. The definition of fc is also often
    a “Hopf-Lax formula”.
    (f, g) œ C(X) ◊ C(Y) ‘æ (g¯
    c, fc) œ C(X) ◊ C(Y) replaces
    s by “better” ones (improving the dual objective E). Func-
    be written in the form fc and g¯
    c are called c-concave and
    ctions. In the special case c(x, y) = Èx, yÍ in X = Y = Rd,
    n coincides with the usual notion of concave functions.
    turally Proposition 3.1 to a continuous case, one has the
    operations are replaced by a “soft-min”.
    Using (5.3), one can reformulate (2.22) as an unconstrained
    program over a single potential
    Lc
    (–, —) = max
    fœC
    (
    X
    )

    X
    f(x)d–(x) +

    Y
    fc(y)d—(y),
    = max
    gœC
    (
    Y
    )

    X

    c(x)d–(x) +

    Y
    g(y)d—(y).
    Since one can iterate the map (f, g) ‘æ (g¯
    c, fc), it is possible to
    these optimization problems the constraint that f is ¯
    c-concave
    is c-concave, which is important to ensure enough regularity on
    potentials and show for instance existence of solutions to (2.22)
    5.2 Semi-discrete Formulation
    A case of particular interest is when — =
    q
    j bj

    yj
    is discrete (of
    the same construction applies if – is discrete by exchanging the
    (–, —)). One can adapt the definition of the ¯
    c transform (5.1)
    setting by restricting the minimization to the support (y
    j
    )
    j
    of —
    ’ g œ Rm, ’ x œ X, g
    ¯
    c(x) def.
    = min
    jœJmK
    c(x, y
    j
    ) ≠ gj
    .
    This transform maps a vector g to a continuous function g
    ¯
    c œ
    Note that this definition coincides with (5.1) when imposing th
    space X is equal to the support of —. Figure 5.1 shows some ex
    of such discrete ¯
    c-transforms in 1-D and 2-D.
    Using this discrete ¯
    c-transform, in this semi-discrete case, (
    equivalent to the following finite dimensional optimization

    View Slide

  49. Semi-discrete Descent Algorithm
    94 Semi-discrete Optimal Transport
    ¸ = 1 ¸ = 3 ¸ = 50 ¸ = 100 Matching
    Figure 5.2: Iterations of the semi-discrete OT algorithm minimizing (5.8) (here
    a simple gradient descent is used). The support (yj
    )
    j
    of the discrete measure — is
    indicated by the red points, while the continuous measure – is the uniform measure
    on a square. The blue cells display the Laguerre partition (Lg(¸) (yj
    ))
    j
    where g
    (
    ¸
    )
    is
    the discrete dual potential computed at iteration ¸.

    View Slide

  50. 2000 4000 6000 8000 10000
    -8
    -6
    -4
    -2
    0 log E(g(?)) E(g(`))
    `
    Stochastic gradient descent for the semi-discrete Optimal Transport,
    illustration of convergence and corresponding Laguerre cells.
    https://arxiv.org/abs/1605.08527
    Semi-discrete Stochastic Descent

    View Slide

  51. The simplex of histograms with N bins is ⌃
    N
    def.
    =
    p 2 R+
    N
    ;
    P
    i
    p
    i
    = 1 . The entropy of T 2 RN⇥N
    +
    is
    defined as H(T) def.
    =
    P
    N
    i,j=1
    T
    i,j
    (log(T
    i,j
    ) 1). The set
    of couplings between histograms p 2 ⌃
    N1
    and q 2 ⌃
    N2
    is
    C
    p,q
    def.
    = T 2 (R
    +)N1
    ⇥N2 ; T1
    N2
    = p, T>1
    N1
    = q .
    Here, 1
    N
    def.
    = (1, . . . , 1)> 2 RN . For any tensor L =
    (L
    i,j,k,`
    )
    i,j,k,`
    and matrix (T
    i,j
    )
    i,j
    , we define the tensor-
    matrix multiplication as
    L ⌦ T def.
    =
    ⇣ X
    k,`
    L
    i,j,k,`
    T
    k,`

    i,j
    . (1)
    2. Gromov-Wasserstein Discrepancy
    2.1. Entropic Optimal Transport
    Optimal transport distances are useful to compare two his-
    tograms (p, q) 2 ⌃
    N1
    ⇥ ⌃
    N2
    defined on the same metric
    1
    In our setting, sinc
    learning problems,
    distance matrices, i
    does not necessaril
    We define the Gro
    two measured simil
    and ( ¯
    C, q) 2 RN2

    GW(C
    where E
    C,C
    The matrix T is a
    which the similarit
    L is some loss fun
    the similarity matr
    the quadratic loss
    the Kullback-Leibl
    a log(a/b) a +
    inition (4) of GW
    Entropy:
    Entropic Regularization
    Def. Regularized OT: [Cuturi NIPS’13]

    View Slide

  52. The simplex of histograms with N bins is ⌃
    N
    def.
    =
    p 2 R+
    N
    ;
    P
    i
    p
    i
    = 1 . The entropy of T 2 RN⇥N
    +
    is
    defined as H(T) def.
    =
    P
    N
    i,j=1
    T
    i,j
    (log(T
    i,j
    ) 1). The set
    of couplings between histograms p 2 ⌃
    N1
    and q 2 ⌃
    N2
    is
    C
    p,q
    def.
    = T 2 (R
    +)N1
    ⇥N2 ; T1
    N2
    = p, T>1
    N1
    = q .
    Here, 1
    N
    def.
    = (1, . . . , 1)> 2 RN . For any tensor L =
    (L
    i,j,k,`
    )
    i,j,k,`
    and matrix (T
    i,j
    )
    i,j
    , we define the tensor-
    matrix multiplication as
    L ⌦ T def.
    =
    ⇣ X
    k,`
    L
    i,j,k,`
    T
    k,`

    i,j
    . (1)
    2. Gromov-Wasserstein Discrepancy
    2.1. Entropic Optimal Transport
    Optimal transport distances are useful to compare two his-
    tograms (p, q) 2 ⌃
    N1
    ⇥ ⌃
    N2
    defined on the same metric
    1
    In our setting, sinc
    learning problems,
    distance matrices, i
    does not necessaril
    We define the Gro
    two measured simil
    and ( ¯
    C, q) 2 RN2

    GW(C
    where E
    C,C
    The matrix T is a
    which the similarit
    L is some loss fun
    the similarity matr
    the quadratic loss
    the Kullback-Leibl
    a log(a/b) a +
    inition (4) of GW
    Entropy:
    Entropic Regularization
    Regularization impact on solution:
    Def. Regularized OT: [Cuturi NIPS’13]
    c
    "
    T"
    " "

    View Slide

  53. Overview
    • Measures and Histograms
    • From Monge to Kantorovitch Formulations
    • Entropic Regularization and Sinkhorn
    • Barycenters
    • Unbalanced OT and Gradient Flows
    • Minimum Kantorovitch Estimators
    • Gromov-Wasserstein

    View Slide

  54. Sinkhorn’s Algorithm

    View Slide

  55. Sinkhorn’s Algorithm

    View Slide

  56. Sinkhorn’s Algorithm
    Only matrix/vector multiplications. ! Parallelizable.
    ! Streams well on GPU.

    View Slide

  57. Sinkhorn’s Algorithm
    Only matrix/vector multiplications. ! Parallelizable.
    ! Streams well on GPU.

    View Slide

  58. Optimal Transport on Surfaces
    Triangulated mesh M. Geodesic distance dM .

    View Slide

  59. Optimal Transport on Surfaces
    Level sets
    xi
    d(xi, ·)
    Ground cost: c(x, y) = dM (x, y)↵.
    Triangulated mesh M. Geodesic distance dM .

    View Slide

  60. Optimal Transport on Surfaces
    Level sets
    xi
    d(xi, ·)
    Computing c (Fast-Marching): N2 log(N) ! too costly.
    Ground cost: c(x, y) = dM (x, y)↵.
    Triangulated mesh M. Geodesic distance dM .

    View Slide

  61. Entropic Transport on Surfaces
    @tut(x, ·) = M ut(x, ·), ut=0(x, ·) = x
    Heat equation on M:
    "
    t

    View Slide

  62. Entropic Transport on Surfaces
    @tut(x, ·) = M ut(x, ·), ut=0(x, ·) = x
    Heat equation on M:
    Theorem: [Varadhan] " log(u") "!0
    ! d2
    M
    "
    t

    View Slide

  63. Entropic Transport on Surfaces
    @tut(x, ·) = M ut(x, ·), ut=0(x, ·) = x
    Heat equation on M:
    Theorem: [Varadhan] " log(u") "!0
    ! d2
    M
    "
    Sinkhorn kernel: K def.
    = e
    d2
    M
    " ⇡ u"
    ⇡ Id "
    ` M
    `
    t

    View Slide

  64. MRI Data Procesing [with A. Gramfort]
    L2 barycenter
    W2
    2
    barycenter
    Ground cost c = dM : geodesic on cortical surface M.

    View Slide

  65. Regularization for General Measures
    Schr¨
    odinger’s problem:
    Landmark computational paper: [Cuturi 2013].
    ⇡"
    def.
    = argmin

    {hdp, ⇡i + "KL(⇡|⇡0) ; ⇡ 2 ⇧(µ, ⌫)}
    ⇡" = argmin
    ⇡2⇧(µ,⌫)
    KL(⇡|K)
    K(x, y) def.
    = e dp(x,y)
    " ⇡0(x, y)

    View Slide

  66. Regularization for General Measures
    Schr¨
    odinger’s problem:
    Landmark computational paper: [Cuturi 2013].
    ⇡"
    def.
    = argmin

    {hdp, ⇡i + "KL(⇡|⇡0) ; ⇡ 2 ⇧(µ, ⌫)}
    ⇡" = argmin
    ⇡2⇧(µ,⌫)
    KL(⇡|K)
    K(x, y) def.
    = e dp(x,y)
    " ⇡0(x, y)
    Proposition: [Carlier, Duval, Peyr´
    e, Schmitzer 2015]
    ⇡"
    "!0
    ! argmin
    ⇡2⇧(µ,⌫)
    hdp, ⇡i ⇡"
    "!+1
    ! µ(x)⌫(y)
    "
    ⇡"
    "

    View Slide

  67. Back to Sinkhorn’s Algorithm
    ProxKL
    f1/"

    µ) = µ
    ProxKL
    f2/"

    ⌫) = ⌫
    f1 = ◆µ
    f2 = ◆⌫
    Optimal transport problem:

    View Slide

  68. Back to Sinkhorn’s Algorithm
    ProxKL
    f1/"

    µ) = µ
    ProxKL
    f2/"

    ⌫) = ⌫
    [Sinkhorn 1967][Deming,Stephan 1940]
    Sinkhorn/IPFP algorithm:
    a(`+1) def.
    =
    µ
    Kb(`)
    and b(`+1) def.
    =

    K⇤a(`+1)
    f1 = ◆µ
    f2 = ◆⌫
    Optimal transport problem:
    `
    ⇡(`)
    "

    View Slide

  69. Back to Sinkhorn’s Algorithm
    ProxKL
    f1/"

    µ) = µ
    ProxKL
    f2/"

    ⌫) = ⌫
    ⇡"
    "
    [Sinkhorn 1967][Deming,Stephan 1940]
    Sinkhorn/IPFP algorithm:
    a(`+1) def.
    =
    µ
    Kb(`)
    and b(`+1) def.
    =

    K⇤a(`+1)
    f1 = ◆µ
    f2 = ◆⌫
    Optimal transport problem:
    Proposition:
    [Franklin,Lorenz 1989]
    ⇡(`) def.
    = diag(a(`))K diag(b(`))
    || log(⇡
    (`)) log(⇡
    ?)||1 = O(1 )`
    , ⇠  1/"
    c
    Local rate: [Knight 2008]
    1000 2000 3000 4000 5000
    -2
    -1.5
    -1
    -0.5
    0
    `
    log(|| log(⇡(`)) log(⇡?)||1)
    `
    ⇡(`)
    "

    View Slide

  70. 0 0.5 1
    -0.2
    0
    0.2
    0
    0.1
    0.2
    0.3
    Á = 0 Á = 0.01 Á = 0.1 Á = 0.3
    Figure 5.3: Top: examples of entropic semi-discrete ¯
    c-transforms g
    ¯
    c,Á in 1-D, for
    0 0.5 1
    -0.2
    0
    0.2
    0
    0.1
    0.2
    0.3
    Á = 0 Á = 0.01 Á = 0.1 Á = 0.3
    Figure 5.3: Top: examples of entropic semi-discrete ¯
    c-transforms g
    ¯
    c,Á in 1-D, for
    Laguerre cells “Sinkhorn” Laguerre cells
    Semi-discrete Optimal Transport is a multiclass SVM. Sinkhorn version is logistic regression. Decomposes space in entropic Laguerre diagrams
    0 0.5 1
    -0.2
    0
    0.2
    0
    0.1
    0.2
    0.3
    Á = 0 Á = 0.01 Á = 0.1 Á = 0.3
    Figure 5.3: Top: examples of entropic semi-discrete ¯
    c-transforms g
    ¯
    c,Á in 1-D, for
    ground cost c(x, y) = |x ≠ y| for varying Á (see colorbar). The red points are at
    locations (yj, ≠gj
    )
    j
    . Bottom: examples of entropic semi-discrete ¯
    c-transforms g
    ¯
    c,Á
    in 2-D, for ground cost c(x, y) = Îx ≠ yÎ for varying Á. The black curves are the level
    sets of the function g
    ¯
    c,Á, while the colors indicate the smoothed indicator function
    of the Laguerre cells ‰Á
    j
    . The red points are at locations yj
    œ R2
    , and their size is
    proportional to gj
    .
    Semi-discrete OT and Entropy

    View Slide

  71. Overview
    • Measures and Histograms
    • From Monge to Kantorovitch Formulations
    • Entropic Regularization and Sinkhorn
    • Barycenters
    • Unbalanced OT and Gradient Flows
    • Minimum Kantorovitch Estimators
    • Gromov-Wasserstein

    View Slide

  72. Wasserstein Barycenters
    Barycenters of measures (µk)k:
    P
    k k = 1
    µ? 2 argmin
    µ
    P
    k kW2
    2
    (µk, µ)

    View Slide

  73. Wasserstein Barycenters
    µ
    µ1
    µ3
    W2
    (µ1
    , µ )
    W
    2 (µ
    2 ,µ )
    µ2
    W2
    (µ3

    )
    Barycenters of measures (µk)k:
    P
    k k = 1
    If µk = xk
    then µ? = P
    k kxk
    Generalizes Euclidean barycenter:
    µ? 2 argmin
    µ
    P
    k kW2
    2
    (µk, µ)

    View Slide

  74. Wasserstein Barycenters
    µ
    µ1
    µ3
    W2
    (µ1
    , µ )
    W
    2 (µ
    2 ,µ )
    µ2
    W2
    (µ3

    )
    Barycenters of measures (µk)k:
    P
    k k = 1
    If µk = xk
    then µ? = P
    k kxk
    Generalizes Euclidean barycenter:
    µ? 2 argmin
    µ
    P
    k kW2
    2
    (µk, µ)

    View Slide

  75. Wasserstein Barycenters
    µ
    µ1
    µ3
    W2
    (µ1
    , µ )
    W
    2 (µ
    2 ,µ )
    µ2
    W2
    (µ3

    )
    Barycenters of measures (µk)k:
    P
    k k = 1
    If µk = xk
    then µ? = P
    k kxk
    Generalizes Euclidean barycenter:
    µ2
    µ1
    Mc Cann’s displacement interpolation.
    µ? 2 argmin
    µ
    P
    k kW2
    2
    (µk, µ)

    View Slide

  76. Wasserstein Barycenters
    µ
    µ1
    µ3
    W2
    (µ1
    , µ )
    W
    2 (µ
    2 ,µ )
    µ2
    W2
    (µ3

    )
    Barycenters of measures (µk)k:
    P
    k k = 1
    If µk = xk
    then µ? = P
    k kxk
    Generalizes Euclidean barycenter:
    µ2
    µ1
    µ3
    µ2
    µ1
    Mc Cann’s displacement interpolation.
    µ? 2 argmin
    µ
    P
    k kW2
    2
    (µk, µ)

    View Slide

  77. Wasserstein Barycenters
    µ
    µ1
    µ3
    W2
    (µ1
    , µ )
    W
    2 (µ
    2 ,µ )
    µ2
    W2
    (µ3

    )
    Barycenters of measures (µk)k:
    P
    k k = 1
    If µk = xk
    then µ? = P
    k kxk
    Generalizes Euclidean barycenter:
    µ2
    µ1
    µ3
    µ exists and is unique.
    Theorem:
    if µ1 does not vanish on small sets,
    [Agueh, Carlier, 2010]
    (for c(x, y) = ||x y||2)
    µ2
    µ1
    Mc Cann’s displacement interpolation.
    µ? 2 argmin
    µ
    P
    k kW2
    2
    (µk, µ)

    View Slide

  78. Displacement Interpolation

    View Slide

  79. 110 Dynamic Formulations
    –0 –1
    /
    5 –2
    /
    5 –3
    /
    5 –4
    /
    5 –1
    Figure 7.2: Comparison of displacement interpolation (7.8) of discrete measures.
    Top: point clouds (empirical measures (–0, –1) with the same number of points).
    Bottom: same but with weight. For 0 < t < 1, the top example corresponds to an
    Displacement Interpolation

    View Slide

  80. 2 ⌃3 Wasserstein mean L2 mean
    Wasserstein Barycenters

    View Slide

  81. 2 ⌃3 Wasserstein mean L2 mean
    Wasserstein Barycenters

    View Slide

  82. Regularized Barycenters
    min
    (⇡k)k,µ
    (
    X
    k
    k (hc, ⇡k
    i + "KL(⇡k
    |⇡0,k)) ; 8k, ⇡k
    2 ⇧(µk, µ)
    )

    View Slide

  83. Regularized Barycenters
    ! Need to fix a discretization grid for µ, i.e. choose (⇡0,k)k
    min
    (⇡k)k,µ
    (
    X
    k
    k (hc, ⇡k
    i + "KL(⇡k
    |⇡0,k)) ; 8k, ⇡k
    2 ⇧(µk, µ)
    )

    View Slide

  84. Regularized Barycenters
    ! Need to fix a discretization grid for µ, i.e. choose (⇡0,k)k
    ! Sinkhorn-like algorithm [Benamou, Carlier, Cuturi, Nenna, Peyr´
    e, 2015].
    min
    (⇡k)k,µ
    (
    X
    k
    k (hc, ⇡k
    i + "KL(⇡k
    |⇡0,k)) ; 8k, ⇡k
    2 ⇧(µk, µ)
    )

    View Slide

  85. Regularized Barycenters
    ! Need to fix a discretization grid for µ, i.e. choose (⇡0,k)k
    ! Sinkhorn-like algorithm [Benamou, Carlier, Cuturi, Nenna, Peyr´
    e, 2015].
    [Solomon et al, SIGGRAPH 2015]
    min
    (⇡k)k,µ
    (
    X
    k
    k (hc, ⇡k
    i + "KL(⇡k
    |⇡0,k)) ; 8k, ⇡k
    2 ⇧(µk, µ)
    )

    View Slide

  86. Barycenters of 2D Shapes

    View Slide

  87. Barycenters of 3D Shapes

    View Slide

  88. Barycenter on a Surface
    1
    µ1
    µ2
    0 1

    View Slide

  89. Barycenter on a Surface
    1
    µ1
    µ2
    µ3
    µ4
    µ5 µ6
    µ1
    µ2
    0 1

    View Slide

  90. Barycenter on a Surface
    1
    = (1, . . . , 1)/6
    µ1
    µ2
    µ3
    µ4
    µ5 µ6
    µ1
    µ2
    0 1

    View Slide

  91. Color Transfer
    µ

    Input images: (f, g) (chrominance components)
    Input measures:
    f
    g
    µ(A) = U(f 1(A)), ⌫(A) = U(g 1(A))

    View Slide

  92. Color Transfer
    µ

    Input images: (f, g) (chrominance components)
    Input measures:
    f
    g
    µ(A) = U(f 1(A)), ⌫(A) = U(g 1(A))

    View Slide

  93. Color Transfer
    µ

    Input images: (f, g) (chrominance components)
    Input measures:
    f T
    T f
    g
    ˜
    T g
    µ(A) = U(f 1(A)), ⌫(A) = U(g 1(A))

    View Slide

  94. Topic Models
    [Rolet’16]

    View Slide

  95. Overview
    • Measures and Histograms
    • From Monge to Kantorovitch Formulations
    • Entropic Regularization and Sinkhorn
    • Barycenters
    • Unbalanced OT and Gradient Flows
    • Minimum Kantorovitch Estimators
    • Gromov-Wasserstein

    View Slide

  96. Unbalanced Transport
    (⇠, µ) 2 M+(X)2, KL(⇠|µ) def.
    =
    R
    X
    log

    d⇠


    dµ +
    R
    X
    (dµ d⇠)
    WFc(µ, ⌫) def.
    = min

    hc, ⇡i + KL(P1]⇡|µ) + KL(P2]⇡|⌫)
    [Liereo, Mielke, Savar´
    e 2015]

    View Slide

  97. Unbalanced Transport
    [Liereo, Mielke, Savar´
    e 2015] [Chizat, Schmitzer, Peyr´
    e, Vialard 2015]
    Proposition:
    then WF1/2
    c is a distance on M+(X).
    (⇠, µ) 2 M+(X)2, KL(⇠|µ) def.
    =
    R
    X
    log

    d⇠


    dµ +
    R
    X
    (dµ d⇠)
    If c(x, y) = log(cos(min(d(x, y), ⇡
    2
    ))
    WFc(µ, ⌫) def.
    = min

    hc, ⇡i + KL(P1]⇡|µ) + KL(P2]⇡|⌫)
    [Liereo, Mielke, Savar´
    e 2015]

    View Slide

  98. Unbalanced Transport
    ! “Dynamic” Benamou-Brenier formulation.
    [Liereo, Mielke, Savar´
    e 2015][Kondratyev, Monsaingeon, Vorotnikov, 2015]
    [Liereo, Mielke, Savar´
    e 2015] [Chizat, Schmitzer, Peyr´
    e, Vialard 2015]
    Proposition:
    then WF1/2
    c is a distance on M+(X).
    [Chizat, Schmitzer, P, Vialard 2015]
    (⇠, µ) 2 M+(X)2, KL(⇠|µ) def.
    =
    R
    X
    log

    d⇠


    dµ +
    R
    X
    (dµ d⇠)
    If c(x, y) = log(cos(min(d(x, y), ⇡
    2
    ))
    WFc(µ, ⌫) def.
    = min

    hc, ⇡i + KL(P1]⇡|µ) + KL(P2]⇡|⌫)
    [Liereo, Mielke, Savar´
    e 2015]
    Balanced OT Unbalanced OT

    View Slide

  99. p
    =
    +
    1
    p = 2
    p = 1
    Metric space (X, d), minimize F(x) on X.
    F(x) = ||x||2 on (X = R2, || · ||p)
    Implicit Euler step:
    xk+1
    def.
    = argmin
    x2X
    d(xk, x)2 + ⌧F(x)
    xk+1
    xk
    {x ; d(xk, x) ⇠ ⌧}
    xk+2
    xk+3
    Implicit Euler Stepping

    View Slide

  100. xk+1 = argmin
    x2X
    d(xk, x)2 + ⌧hrF(xk), xi
    p
    =
    +
    1
    p = 1
    p = 2
    p
    =
    +
    1
    p = 2
    p = 1
    xk+1 = argmin
    x2X
    d(xk, x)2 + ⌧F(x)
    Metric space (X, d), minimize F(x) on X.
    F(x) = ||x||2 on (X = R2, || · ||p)
    Implicit
    Explicit
    Implicit vs. Explicit Stepping

    View Slide

  101. Wasserstein Gradient Flows
    Implicit Euler step:
    [Jordan, Kinderlehrer, Otto 1998]
    µt+1 = ProxW
    ⌧f
    (µt) def.
    = argmin
    µ2M+(X)
    W2
    2
    (µt, µ) + ⌧f(µ)

    View Slide

  102. Wasserstein Gradient Flows
    Implicit Euler step:
    [Jordan, Kinderlehrer, Otto 1998]
    Formal limit ⌧ ! 0: @tµ = div (µr(f0(µ)))
    µt+1 = ProxW
    ⌧f
    (µt) def.
    = argmin
    µ2M+(X)
    W2
    2
    (µt, µ) + ⌧f(µ)

    View Slide

  103. Wasserstein Gradient Flows
    Implicit Euler step:
    [Jordan, Kinderlehrer, Otto 1998]
    Formal limit ⌧ ! 0: @tµ = div (µr(f0(µ)))
    (heat di↵usion)
    f(µ) =
    R
    log(dµ
    dx
    )dµ @tµ = µ
    µt+1 = ProxW
    ⌧f
    (µt) def.
    = argmin
    µ2M+(X)
    W2
    2
    (µt, µ) + ⌧f(µ)

    View Slide

  104. Wasserstein Gradient Flows
    Implicit Euler step:
    [Jordan, Kinderlehrer, Otto 1998]
    Formal limit ⌧ ! 0: @tµ = div (µr(f0(µ)))
    (advection)
    f(µ) =
    R
    wdµ @tµ = div(µrw)
    (heat di↵usion)
    f(µ) =
    R
    log(dµ
    dx
    )dµ @tµ = µ
    Evolution µt
    Evolution µt
    rw rw
    µt+1 = ProxW
    ⌧f
    (µt) def.
    = argmin
    µ2M+(X)
    W2
    2
    (µt, µ) + ⌧f(µ)

    View Slide

  105. Wasserstein Gradient Flows
    Implicit Euler step:
    [Jordan, Kinderlehrer, Otto 1998]
    Formal limit ⌧ ! 0: @tµ = div (µr(f0(µ)))
    (advection)
    f(µ) =
    R
    wdµ @tµ = div(µrw)
    (heat di↵usion)
    f(µ) =
    R
    log(dµ
    dx
    )dµ @tµ = µ
    (non-linear di↵usion)
    f(µ) = 1
    m 1
    R
    (dµ
    dx
    )m 1dµ @tµ = µm
    Evolution µt
    Evolution µt
    rw rw
    µt+1 = ProxW
    ⌧f
    (µt) def.
    = argmin
    µ2M+(X)
    W2
    2
    (µt, µ) + ⌧f(µ)

    View Slide

  106. Eulerian vs. Lagrangian Discretization

    View Slide

  107. ˆ
    µn
    µ
    n ! +1
    ˆ
    µn = 1
    n
    Pn
    i=1 xi
    ˆ
    H(ˆ
    µn) def.
    =
    P
    i
    log(minj6=i
    ||xi xj
    ||) R
    log(dµ
    dx
    (x))dµ(x)
    H(µ) def.
    =
    n ! +1
    Lagrangian Discretization of Entropy

    View Slide

  108. time
    d⇢t
    dt
    = ⇢t + r(V ⇢t)
    min

    E(⇢) def.
    =
    Z
    V (x)⇢(x)dx +
    Z
    ⇢(x) log(⇢(x))dx
    V (x) = ||x||2
    Wasserstein flow of E:
    Lagrangian Discretization of Gradient Flows

    View Slide

  109. Generalized Entropic Regularization
    Primal: min

    hdp, ⇡i + f1(P1]⇡) + f2(P2]⇡) + "KL(⇡|⇡0)

    View Slide

  110. Generalized Entropic Regularization
    Primal: min

    hdp, ⇡i + f1(P1]⇡) + f2(P2]⇡) + "KL(⇡|⇡0)
    Dual:
    (a, b) def.
    = (e u
    " , e v
    " )
    max
    u,v
    f⇤
    1
    (u) f⇤
    2
    (v) "he u
    " , Ke v
    " i
    ⇡(x, y) = a(x)K(x, y)b(y)

    View Slide

  111. Generalized Entropic Regularization
    Primal: min

    hdp, ⇡i + f1(P1]⇡) + f2(P2]⇡) + "KL(⇡|⇡0)
    Block coordinates
    relaxation:
    max
    u
    f⇤
    1
    (u) "he u
    " , Ke v
    " i
    max
    v
    f⇤
    2
    (v) "he v
    " , K⇤e u
    " i (Iv)
    (Iu)
    Dual:
    (a, b) def.
    = (e u
    " , e v
    " )
    max
    u,v
    f⇤
    1
    (u) f⇤
    2
    (v) "he u
    " , Ke v
    " i
    ⇡(x, y) = a(x)K(x, y)b(y)

    View Slide

  112. Generalized Entropic Regularization
    Primal: min

    hdp, ⇡i + f1(P1]⇡) + f2(P2]⇡) + "KL(⇡|⇡0)
    Block coordinates
    relaxation:
    max
    u
    f⇤
    1
    (u) "he u
    " , Ke v
    " i
    max
    v
    f⇤
    2
    (v) "he v
    " , K⇤e u
    " i (Iv)
    (Iu)
    Proposition:
    ProxKL
    f1/"
    (µ) def.
    = argmin ⌫f1(⌫) + "KL(⌫|µ)
    the solutions of (Iu) and (Iv) read:
    a =
    ProxKL
    f1/"
    (Kb)
    Kb
    b =
    ProxKL
    f2/"
    (K⇤a)
    K⇤a
    Dual:
    (a, b) def.
    = (e u
    " , e v
    " )
    max
    u,v
    f⇤
    1
    (u) f⇤
    2
    (v) "he u
    " , Ke v
    " i
    ⇡(x, y) = a(x)K(x, y)b(y)

    View Slide

  113. Generalized Entropic Regularization
    Primal:
    ! Only matrix-vector multiplications. ! Highly parallelizable.
    ! On regular grids: only convolutions! Linear time iterations.
    min

    hdp, ⇡i + f1(P1]⇡) + f2(P2]⇡) + "KL(⇡|⇡0)
    Block coordinates
    relaxation:
    max
    u
    f⇤
    1
    (u) "he u
    " , Ke v
    " i
    max
    v
    f⇤
    2
    (v) "he v
    " , K⇤e u
    " i (Iv)
    (Iu)
    Proposition:
    ProxKL
    f1/"
    (µ) def.
    = argmin ⌫f1(⌫) + "KL(⌫|µ)
    the solutions of (Iu) and (Iv) read:
    a =
    ProxKL
    f1/"
    (Kb)
    Kb
    b =
    ProxKL
    f2/"
    (K⇤a)
    K⇤a
    Dual:
    (a, b) def.
    = (e u
    " , e v
    " )
    max
    u,v
    f⇤
    1
    (u) f⇤
    2
    (v) "he u
    " , Ke v
    " i
    ⇡(x, y) = a(x)K(x, y)b(y)

    View Slide

  114. Gradient Flows: Crowd Motion
    µt+1
    def.
    = argminµ
    W↵

    (µt, µ) + ⌧f(µ)
    Congestion-inducing function:
    f(µ) = ◆[0,]
    (µ) + hw, µi
    [Maury, Roudne↵-Chupin, Santambrogio 2010]

    View Slide

  115. Gradient Flows: Crowd Motion
    µt+1
    def.
    = argminµ
    W↵

    (µt, µ) + ⌧f(µ)
    Proposition: Prox1
    "
    f
    (µ) = min(e "wµ, )
    Congestion-inducing function:
    f(µ) = ◆[0,]
    (µ) + hw, µi
    [Maury, Roudne↵-Chupin, Santambrogio 2010]

    View Slide

  116. Gradient Flows: Crowd Motion
    µt+1
    def.
    = argminµ
    W↵

    (µt, µ) + ⌧f(µ)
    Proposition: Prox1
    "
    f
    (µ) = min(e "wµ, )
     = ||µt=0
    ||1  = 2||µt=0
    ||1  = 4||µt=0
    ||1
    Congestion-inducing function:
    f(µ) = ◆[0,]
    (µ) + hw, µi
    [Maury, Roudne↵-Chupin, Santambrogio 2010]
    rw

    View Slide

  117. Crowd Motion on a Surface
    Potential cos(w)
     = ||µt=0
    ||1  = 6||µt=0
    ||1
    X = triangulated mesh.

    View Slide

  118. Crowd Motion on a Surface
    Potential cos(w)
     = ||µt=0
    ||1  = 6||µt=0
    ||1
    X = triangulated mesh.

    View Slide

  119. Gradient Flows: Crowd Motion with Obstacles
    Potential cos(w)
     = ||µt=0
    ||1  = 2||µt=0
    ||1  = 4||µt=0
    ||1  = 6||µt=0
    ||1
    X = sub-domain of R2.

    View Slide

  120. Gradient Flows: Crowd Motion with Obstacles
    Potential cos(w)
     = ||µt=0
    ||1  = 2||µt=0
    ||1  = 4||µt=0
    ||1  = 6||µt=0
    ||1
    X = sub-domain of R2.

    View Slide

  121. Multiple-Density Gradient Flows
    (µ1,t+1, µ2,t+1) def.
    = argmin
    (µ1,µ2)
    W↵

    (µ1,t, µ1) + W↵

    (µ2,t, µ2) + ⌧f(µ1, µ2)

    View Slide

  122. Multiple-Density Gradient Flows
    Wasserstein attraction:
    rw
    (µ1,t+1, µ2,t+1) def.
    = argmin
    (µ1,µ2)
    W↵

    (µ1,t, µ1) + W↵

    (µ2,t, µ2) + ⌧f(µ1, µ2)
    f(µ1, µ2) = W↵

    (µ1, µ2) + h1(µ1) + h2(µ2)
    hi(µ) = hw, µi

    View Slide

  123. Multiple-Density Gradient Flows
    Wasserstein attraction:
    rw
    (µ1,t+1, µ2,t+1) def.
    = argmin
    (µ1,µ2)
    W↵

    (µ1,t, µ1) + W↵

    (µ2,t, µ2) + ⌧f(µ1, µ2)
    f(µ1, µ2) = W↵

    (µ1, µ2) + h1(µ1) + h2(µ2)
    hi(µ) = ◆[0,]
    (µ).
    hi(µ) = hw, µi

    View Slide

  124. Overview
    • Measures and Histograms
    • From Monge to Kantorovitch Formulations
    • Entropic Regularization and Sinkhorn
    • Barycenters
    • Unbalanced OT and Gradient Flows
    • Minimum Kantorovitch Estimators
    • Gromov-Wasserstein

    View Slide

  125. Discriminative vs Generative Models
    Z
    x
    z
    X
    Z X
    Low dimension High dimension
    Generative
    Discriminative
    g✓
    d⇠
    g✓
    d⇠

    View Slide

  126. Discriminative vs Generative Models
    Z
    x
    z
    X
    Z X
    Low dimension High dimension
    Generative
    Discriminative
    g✓
    d⇠
    g✓
    d⇠
    classification, z =class probability
    Supervised:
    ! Learn d⇠ from labeled data (xi, zi)i.

    View Slide

  127. Discriminative vs Generative Models
    Z
    x
    z
    X
    Z X
    Low dimension High dimension
    Generative
    Discriminative
    g✓
    d⇠
    g✓
    d⇠
    classification, z =class probability
    Supervised:
    ! Learn d⇠ from labeled data (xi, zi)i.
    Compression: z = d⇠(x) is a representation.
    Generation: x = g✓(z) is a synthesis.
    Un-supervised:
    ! Learn (g✓, d⇠) from data (xi)i.

    View Slide

  128. Discriminative vs Generative Models
    Z
    x
    z
    X
    Z X
    Low dimension High dimension
    Generative
    Discriminative
    g✓
    d⇠
    g✓
    d⇠
    classification, z =class probability
    Supervised:
    ! Learn d⇠ from labeled data (xi, zi)i.
    Compression: z = d⇠(x) is a representation.
    Generation: x = g✓(z) is a synthesis.
    Un-supervised:
    ! Learn (g✓, d⇠) from data (xi)i.
    Density fitting
    g✓({zi
    }i) ⇡ {xi
    }i
    Auto-encoders
    g✓(d⇠(xi)) ⇡ xi

    View Slide

  129. Discriminative vs Generative Models
    Z
    x
    z
    X
    Z X
    Low dimension High dimension
    Generative
    Discriminative
    g✓
    d⇠
    g✓
    d⇠
    classification, z =class probability
    Supervised:
    ! Learn d⇠ from labeled data (xi, zi)i.
    Compression: z = d⇠(x) is a representation.
    Generation: x = g✓(z) is a synthesis.
    Un-supervised:
    ! Learn (g✓, d⇠) from data (xi)i.
    Density fitting
    g✓({zi
    }i) ⇡ {xi
    }i
    Auto-encoders
    g✓(d⇠(xi)) ⇡ xi
    Optimal transport
    map d⇠

    View Slide

  130. Density Fitting and Generative Models
    Parametric model: ✓ 7! µ✓
    µ✓

    Observations: ⌫ = 1
    n
    Pn
    i=1 xi

    View Slide

  131. Density Fitting and Generative Models
    Parametric model: ✓ 7! µ✓
    µ✓

    Observations: ⌫ = 1
    n
    Pn
    i=1 xi
    dµ✓(y) = f✓(y)dy
    Density fitting:
    Maximum
    likelihood (MLE)
    min

    c
    KL(⌫|µ✓) def.
    =
    X
    j
    log(f✓(yj))

    View Slide

  132. Density Fitting and Generative Models
    Parametric model: ✓ 7! µ✓
    µ✓

    Observations: ⌫ = 1
    n
    Pn
    i=1 xi
    g✓ µ✓
    X
    Z

    Generative model fit: µ✓ = g✓,]⇣
    ! MLE undefined.
    ! Need a weaker metric.
    c
    KL(µ✓
    |⌫) = +1
    dµ✓(y) = f✓(y)dy
    Density fitting:
    Maximum
    likelihood (MLE)
    min

    c
    KL(⌫|µ✓) def.
    =
    X
    j
    log(f✓(yj))

    View Slide

  133. Loss Functions for Measures
    Optimal Transport Distances
    µ = 1
    N
    P
    i xi
    W(µ, ⌫)p def.
    = min
    T 2Cµ,⌫
    P
    i,j
    Ti,j
    ||xi yj
    ||p
    ⌫ = 1
    P
    P
    j yj
    Density fitting: min

    D(µ✓, ⌫)
    T
    µ ⌫

    View Slide

  134. Loss Functions for Measures
    Optimal Transport Distances
    µ = 1
    N
    P
    i xi
    W(µ, ⌫)p def.
    = min
    T 2Cµ,⌫
    P
    i,j
    Ti,j
    ||xi yj
    ||p
    ⌫ = 1
    P
    P
    j yj
    Maximum Mean Discrepancy (MMD)
    ||µ ⌫||2
    k
    def.
    =
    1
    N2
    X
    i,i0
    k(xi, xi0
    ) +
    1
    P2
    X
    j,j0
    k(yj, yj0
    )
    2
    NP
    X
    i,j
    k(xi, yj)
    Gaussian: k(x, y) = e |
    |x y|
    |2
    2 2 . Energy distance: k(x, y) = ||x y||2.
    Density fitting: min

    D(µ✓, ⌫)
    T
    µ ⌫
    k
    µ ⌫

    View Slide

  135. Loss Functions for Measures
    Optimal Transport Distances
    µ = 1
    N
    P
    i xi
    W(µ, ⌫)p def.
    = min
    T 2Cµ,⌫
    P
    i,j
    Ti,j
    ||xi yj
    ||p
    ⌫ = 1
    P
    P
    j yj
    Maximum Mean Discrepancy (MMD)
    ||µ ⌫||2
    k
    def.
    =
    1
    N2
    X
    i,i0
    k(xi, xi0
    ) +
    1
    P2
    X
    j,j0
    k(yj, yj0
    )
    2
    NP
    X
    i,j
    k(xi, yj)
    Gaussian: k(x, y) = e |
    |x y|
    |2
    2 2 . Energy distance: k(x, y) = ||x y||2.
    Density fitting: min

    D(µ✓, ⌫)
    T
    µ ⌫
    k
    µ ⌫
    µ ⌫
    T"
    ¯
    W"(µ, ⌫)p def.
    = W"(µ, ⌫)p 1
    2
    W"(µ, µ)p 1
    2
    W"(⌫, ⌫)p
    W"(µ, ⌫)p def.
    =
    P
    i,j
    T"
    i,j
    ||xi yj
    ||p
    Sinkhorn divergences [Cuturi 13]

    View Slide

  136. Loss Functions for Measures
    Optimal Transport Distances
    µ = 1
    N
    P
    i xi
    W(µ, ⌫)p def.
    = min
    T 2Cµ,⌫
    P
    i,j
    Ti,j
    ||xi yj
    ||p
    ⌫ = 1
    P
    P
    j yj
    Maximum Mean Discrepancy (MMD)
    ||µ ⌫||2
    k
    def.
    =
    1
    N2
    X
    i,i0
    k(xi, xi0
    ) +
    1
    P2
    X
    j,j0
    k(yj, yj0
    )
    2
    NP
    X
    i,j
    k(xi, yj)
    Gaussian: k(x, y) = e |
    |x y|
    |2
    2 2 . Energy distance: k(x, y) = ||x y||2.
    for k(x, y) = ||x y||p
    Density fitting: min

    D(µ✓, ⌫)
    Theorem: [Ramdas, G.Trillos, Cuturi 17] "!+1
    ! ||µ ⌫||2
    k
    ¯
    W"(µ, ⌫)p
    "!0
    ! W(µ, ⌫)p
    T
    µ ⌫
    k
    µ ⌫
    µ ⌫
    T"
    ¯
    W"(µ, ⌫)p def.
    = W"(µ, ⌫)p 1
    2
    W"(µ, µ)p 1
    2
    W"(⌫, ⌫)p
    W"(µ, ⌫)p def.
    =
    P
    i,j
    T"
    i,j
    ||xi yj
    ||p
    Sinkhorn divergences [Cuturi 13]

    View Slide

  137. Loss Functions for Measures
    Optimal Transport Distances
    µ = 1
    N
    P
    i xi
    W(µ, ⌫)p def.
    = min
    T 2Cµ,⌫
    P
    i,j
    Ti,j
    ||xi yj
    ||p
    ⌫ = 1
    P
    P
    j yj
    Maximum Mean Discrepancy (MMD)
    ||µ ⌫||2
    k
    def.
    =
    1
    N2
    X
    i,i0
    k(xi, xi0
    ) +
    1
    P2
    X
    j,j0
    k(yj, yj0
    )
    2
    NP
    X
    i,j
    k(xi, yj)
    Gaussian: k(x, y) = e |
    |x y|
    |2
    2 2 . Energy distance: k(x, y) = ||x y||2.
    for k(x, y) = ||x y||p
    Density fitting: min

    D(µ✓, ⌫)
    Theorem: [Ramdas, G.Trillos, Cuturi 17] "!+1
    ! ||µ ⌫||2
    k
    ¯
    W"(µ, ⌫)p
    "!0
    ! W(µ, ⌫)p
    T
    µ ⌫
    k
    µ ⌫
    – Scale free (no , no heavy tail kernel).
    – Non-Euclidean, arbitrary ground distance.
    – Less biased gradient.
    – No curse of dimension (low sample complexity).
    Best of both worlds:
    ! cross-validate "
    µ ⌫
    T"
    ¯
    W"(µ, ⌫)p def.
    = W"(µ, ⌫)p 1
    2
    W"(µ, µ)p 1
    2
    W"(⌫, ⌫)p
    W"(µ, ⌫)p def.
    =
    P
    i,j
    T"
    i,j
    ||xi yj
    ||p
    Sinkhorn divergences [Cuturi 13]

    View Slide

  138. Deep Discriminative vs Generative Models
    ✓1
    ✓2


    z
    x
    g✓
    ⇢ ⇢
    z
    x
    Discriminative
    Generative
    d⇠
    ⇠1 ⇠2
    g✓(z) = ⇢(✓K(. . . ⇢(✓2(⇢(✓1(z) . . .)
    Deep networks:
    d⇠(x) = ⇢(⇠K(. . . ⇢(⇠2(⇢(⇠1(x) . . .)

    View Slide

  139. Deep Discriminative vs Generative Models
    ✓1
    ✓2


    z
    x
    g✓
    ⇢ ⇢
    z
    x
    Discriminative
    Generative
    z1
    z2
    g✓
    Z
    x
    z
    X
    d⇠
    d⇠
    ⇠1 ⇠2
    g✓(z) = ⇢(✓K(. . . ⇢(✓2(⇢(✓1(z) . . .)
    Deep networks:
    d⇠(x) = ⇢(⇠K(. . . ⇢(⇠2(⇢(⇠1(x) . . .)

    View Slide

  140. Examples of Image Generation
    [Credit ArXiv:1511.06434]
    g✓
    Z X

    View Slide

  141. Overview
    • Measures and Histograms
    • From Monge to Kantorovitch Formulations
    • Entropic Regularization and Sinkhorn
    • Barycenters
    • Unbalanced OT and Gradient Flows
    • Minimum Kantorovitch Estimators
    • Gromov-Wasserstein

    View Slide

  142. unregistered
    spaces
    Gromov-Wasserstein
    Inputs: {(similarity/kernel matrix, histogram)}
    X
    Y

    View Slide

  143. unregistered
    spaces
    Gromov-Wasserstein
    Inputs: {(similarity/kernel matrix, histogram)}
    X
    Y
    [Memoli 2011]
    Def. Gromov-Wasserstein distance:
    [Sturm 2012]

    View Slide

  144. unregistered
    spaces
    Gromov-Wasserstein
    Inputs: {(similarity/kernel matrix, histogram)}
    X
    Y
    ! need for a fast approximate solver.
    ! NP-hard in general.
    [Memoli 2011]
    Def. Gromov-Wasserstein distance:
    [Sturm 2012]

    View Slide

  145. Y
    xi
    X
    Gromov-Wasserstein as a Metric
    yj
    Def.

    View Slide

  146. Y
    xi
    X
    Gromov-Wasserstein as a Metric
    yj
    X Y
    f
    f
    X Y
    Def.
    ()
    Isometries on M:
    Def.

    View Slide

  147. Y
    xi
    X
    Gromov-Wasserstein as a Metric
    yj
    ! “bending-invariant” objects recognition.
    X Y
    f
    f
    X Y
    Def.
    ()
    Isometries on M:
    Def.
    Prop.
    [Memoli 2011]

    View Slide

  148. X
    Y
    For Arbitrary Spaces
    Metric-measure spaces (X, Y ): (dX , µ), (dY , ⌫)

    View Slide

  149. X
    Y
    For Arbitrary Spaces
    d
    X (x,x 0
    )
    dY (y, y0)
    Def. Gromov-Wasserstein distance:
    [Memoli 2011]
    [Sturm 2012]
    GW2
    2
    (dX , µ, dY , ⌫) def.
    =
    min
    ⇡2⇧(µ,⌫)
    Z
    X2⇥Y 2
    |dX(x, x0) dY (y, y0)|2d⇡(x, y)d⇡(x0, y0)
    Metric-measure spaces (X, Y ): (dX , µ), (dY , ⌫)

    View Slide

  150. X
    Y
    For Arbitrary Spaces
    f
    X Y
    Prop.
    GW is a distance on
    mm-spaces/isometries.
    ! “bending-invariant” objects recognition.
    d
    X (x,x 0
    )
    dY (y, y0)
    Def. Gromov-Wasserstein distance:
    [Memoli 2011]
    [Sturm 2012]
    GW2
    2
    (dX , µ, dY , ⌫) def.
    =
    min
    ⇡2⇧(µ,⌫)
    Z
    X2⇥Y 2
    |dX(x, x0) dY (y, y0)|2d⇡(x, y)d⇡(x0, y0)
    Metric-measure spaces (X, Y ): (dX , µ), (dY , ⌫)

    View Slide

  151. X
    Y
    For Arbitrary Spaces
    f
    X Y
    Prop.
    GW is a distance on
    mm-spaces/isometries.
    ! need for a fast approximate solver.
    ! “bending-invariant” objects recognition.
    ! QAP: NP-hard in general.
    d
    X (x,x 0
    )
    dY (y, y0)
    Def. Gromov-Wasserstein distance:
    [Memoli 2011]
    [Sturm 2012]
    GW2
    2
    (dX , µ, dY , ⌫) def.
    =
    min
    ⇡2⇧(µ,⌫)
    Z
    X2⇥Y 2
    |dX(x, x0) dY (y, y0)|2d⇡(x, y)d⇡(x0, y0)
    Metric-measure spaces (X, Y ): (dX , µ), (dY , ⌫)

    View Slide

  152. Entropic Gromov Wasserstein
    Projected mirror descent:
    Def. Entropic Gromov-Wasserstein

    View Slide

  153. Entropic Gromov Wasserstein
    Projected mirror descent:
    Def. Entropic Gromov-Wasserstein
    Projected mirror descent:
    Def.
    Prop. for ⌧ = 1/", the iteration reads
    repeat:
    until convergence.
    initialization:
    return T
    func T = GW(C, ¯
    C, p, q)

    View Slide

  154. phine
    Vladimir G. Kim
    Adobe Research
    Suvrit Sra
    MIT
    Source Targets
    Figure 1: Entropic GW can find correspondences between a source
    surface (left) and a surface with similar structure, a surface with
    shared semantic structure, a noisy 3D point cloud, an icon, and a
    hand drawing. Each fuzzy map was computed using the same code.
    In this paper, we propose a new correspondence algorithm that
    minimizes distortion of long- and short-range distances alike. We
    study an entropically-regularized version of the Gromov-Wasserstein
    (GW) mapping objective function from [M´
    emoli 2011] measuring
    the distortion of geodesic distances. The optimizer is a probabilistic
    matching expressed as a “fuzzy” correspondence matrix in the style
    of [Kim et al. 2012; Solomon et al. 2012]; we control sharpness of
    the correspondence via the weight of the entropic regularizer.
    0 0.02 0.04
    0
    0.02
    0
    0.02
    Teddies
    Humans
    Four-legged
    Armadillo
    Figure 15: MDS embedding of four classes from SHREC dataset.
    0 0.5 1
    0
    0.5
    1
    1
    5
    10
    15
    20
    25
    30
    35
    40
    45
    PCA 1
    PCA 2
    Figure 16: Recovery of galloping horse sequence.
    0 is the base shape) as a feature vector for shape i. We reproduce
    the result presented in the work of Rustamov et al., recovering
    the circular structure of meshes from a galloping horse animation
    sequence (Figure 16). Unlike Rustamov et al., however, our method
    does not require ground truth maps between shapes as input.
    5.2 Supervised Matching
    An important feature of a matching tool is the ability to incorporate
    user input, e.g. ground truth matches of points or regions. In the
    GWα framework, one way to enforce these constraints is to provide
    a stencil S specifying a sparsity pattern for the map Γ. Incorporating
    Figure 18: Mapping a set of 185 images onto a two shapes while
    preserving color similarity. (Images from Flickr public domain collection.)
    Rn0×n0
    +
    and D ∈ Rn×n
    +
    we are given symmetric weight matrices
    W0 ∈ Rn0×n0
    +
    and W ∈ Rn×n
    +
    . We could solve a weighted
    version of the GWα matching problem (3) that prioritizes maps
    preserving distances corresponding to large W values:
    min
    Γ∈M
    ijkℓ
    (D0ij −Dkℓ
    )2ΓikΓjℓW0ijWjℓµ0iµ0jµkµℓ
    . (8)
    For instance, (W0, W) might contain confidence values expressing
    the quality of the entries of (D0, D). Or, W0, W could take values
    in {ε, 1} reducing the weight of distances that are unknown or do
    not need to be preserved by Γ.
    Following the same simplifications as §3.1, we can optimize this
    objective by minimizing ⟨Γ, ΛW
    (Γ)⟩, where
    ΛW
    (Γ) :=
    1
    2
    [D∧2
    0
    ⊗ W0][[µ0
    ]]Γ[[µ]]W
    − [D0 ⊗ W0][[µ0
    ]]Γ[[µ]][D ⊗ W]
    +
    1
    2
    W0[[µ0
    ]]Γ[[µ]][D∧2 ⊗ W]
    Applications of GW: Shapes Analysis
    Use T to define registration between:
    Colors
    distribution Shape
    Shape
    Shape

    View Slide

  155. phine
    Vladimir G. Kim
    Adobe Research
    Suvrit Sra
    MIT
    Source Targets
    Figure 1: Entropic GW can find correspondences between a source
    surface (left) and a surface with similar structure, a surface with
    shared semantic structure, a noisy 3D point cloud, an icon, and a
    hand drawing. Each fuzzy map was computed using the same code.
    In this paper, we propose a new correspondence algorithm that
    minimizes distortion of long- and short-range distances alike. We
    study an entropically-regularized version of the Gromov-Wasserstein
    (GW) mapping objective function from [M´
    emoli 2011] measuring
    the distortion of geodesic distances. The optimizer is a probabilistic
    matching expressed as a “fuzzy” correspondence matrix in the style
    of [Kim et al. 2012; Solomon et al. 2012]; we control sharpness of
    the correspondence via the weight of the entropic regularizer.
    0 0.02 0.04
    0
    0.02
    0
    0.02
    Teddies
    Humans
    Four-legged
    Armadillo
    Figure 15: MDS embedding of four classes from SHREC dataset.
    0 0.5 1
    0
    0.5
    1
    1
    5
    10
    15
    20
    25
    30
    35
    40
    45
    PCA 1
    PCA 2
    Figure 16: Recovery of galloping horse sequence.
    0 is the base shape) as a feature vector for shape i. We reproduce
    the result presented in the work of Rustamov et al., recovering
    the circular structure of meshes from a galloping horse animation
    sequence (Figure 16). Unlike Rustamov et al., however, our method
    does not require ground truth maps between shapes as input.
    5.2 Supervised Matching
    An important feature of a matching tool is the ability to incorporate
    user input, e.g. ground truth matches of points or regions. In the
    GWα framework, one way to enforce these constraints is to provide
    a stencil S specifying a sparsity pattern for the map Γ. Incorporating
    Figure 18: Mapping a set of 185 images onto a two shapes while
    preserving color similarity. (Images from Flickr public domain collection.)
    Rn0×n0
    +
    and D ∈ Rn×n
    +
    we are given symmetric weight matrices
    W0 ∈ Rn0×n0
    +
    and W ∈ Rn×n
    +
    . We could solve a weighted
    version of the GWα matching problem (3) that prioritizes maps
    preserving distances corresponding to large W values:
    min
    Γ∈M
    ijkℓ
    (D0ij −Dkℓ
    )2ΓikΓjℓW0ijWjℓµ0iµ0jµkµℓ
    . (8)
    For instance, (W0, W) might contain confidence values expressing
    the quality of the entries of (D0, D). Or, W0, W could take values
    in {ε, 1} reducing the weight of distances that are unknown or do
    not need to be preserved by Γ.
    Following the same simplifications as §3.1, we can optimize this
    objective by minimizing ⟨Γ, ΛW
    (Γ)⟩, where
    ΛW
    (Γ) :=
    1
    2
    [D∧2
    0
    ⊗ W0][[µ0
    ]]Γ[[µ]]W
    − [D0 ⊗ W0][[µ0
    ]]Γ[[µ]][D ⊗ W]
    +
    1
    2
    W0[[µ0
    ]]Γ[[µ]][D∧2 ⊗ W]
    Applications of GW: Shapes Analysis
    0 0.02 0.04
    0
    0.02
    0
    0.02
    Te
    Hu
    Fo
    Ar
    Figure 1: The database that has been used, divid
    MDS in 3-D
    Use T to define registration between:
    Colors
    distribution Shape
    Shape
    Shape
    Geodesic distances GW distances MDS
    Vizualization
    Shapes
    (Xs)s 0 0.02 0.04
    0
    0.02
    0
    0.02
    Teddies
    Humans
    Four-legged
    Armadillo
    Figure 15: MDS embedding of four classes from SHREC dataset.
    0
    0.5
    1
    1
    5
    10
    15
    20
    25
    30
    35
    40
    45
    PCA 2
    Figure 18: Mapping a set of 185 images onto a two shapes while
    preserving color similarity. (Images from Flickr public domain collection.)
    Rn0×n0
    +
    and D ∈ Rn×n
    +
    we are given symmetric weight matrices
    W0 ∈ Rn0×n0
    +
    and W ∈ Rn×n
    +
    . We could solve a weighted
    version of the GWα matching problem (3) that prioritizes maps
    preserving distances corresponding to large W values:
    MDS in 2-D

    View Slide

  156. Applications of GW: Quantum Chemistry
    Regression problem:
    ! f by solving DFT approximation is too costly.

    View Slide

  157. Applications of GW: Quantum Chemistry
    Regression problem:
    ! f by solving DFT approximation is too costly.
    [Rupp et al 2012]
    2
    6
    6
    6
    6
    4
    3
    7
    7
    7
    7
    5

    View Slide

  158. Applications of GW: Quantum Chemistry
    Regression problem:
    ! f by solving DFT approximation is too costly.
    GW-interpolation: [Rupp et al 2012]
    2
    6
    6
    6
    6
    4
    3
    7
    7
    7
    7
    5

    View Slide

  159. Gromov-Wasserstein Geodesics
    Def. Gromov-Wasserstein Geodesic

    View Slide

  160. Gromov-Wasserstein Geodesics
    Def. Gromov-Wasserstein Geodesic
    Prop.
    [Sturm 2012]

    View Slide

  161. Gromov-Wasserstein Geodesics
    Def. Gromov-Wasserstein Geodesic
    ! X ⇥ Y is not practical for most applications.
    (need to fix the size of the geodesic embedding space)
    ! Extension to more than 2 input spaces?
    Prop.
    [Sturm 2012]

    View Slide

  162. Gromov-Wasserstein Barycenters
    Input:
    1
    2
    3
    Def. GW Barycenters

    View Slide

  163. Gromov-Wasserstein Barycenters
    Input:
    1
    2
    3
    Def. GW Barycenters
    repeat:
    until convergence.
    initialization: C C0
    for s = 1 to S do
    return C
    Alternating minimization:
    On Ts

    View Slide

  164. Conclusion: Toward High-dimensional OT
    Monge Kantorovich Dantzig Brenier Otto McCann Villani
    ansport framework Sliced Wasserstein projection Applications
    lication to Color Transfer
    Source image (X)
    Style image (Y)
    Sliced Wasserstein projection of X to style
    image color statistics Y
    Source image after color transfer
    J. Rabin Wasserstein Regularization

    View Slide