7.6k

# Numerical Optimal Transport

Slides for a course. ## Gabriel Peyré

December 07, 2017

## Transcript

1. Numerical
Optimal Transport
Gabriel Peyré
É C O L E N O R M A L E
S U P É R I E U R E
Outline
• What is the Lasso
• Lasso with an orthogonal design
• Optimality conditions and subgradients (LAR
• Coordinate descent algorithm
… with some demos
www.numerical-tours.com
http://optimaltransport.github.io

2. Overview
• Measures and Histograms
• From Monge to Kantorovitch Formulations
• Entropic Regularization and Sinkhorn
• Barycenters
• Unbalanced OT and Gradient Flows
• Minimum Kantorovitch Estimators
• Gromov-Wasserstein

3. Comparing Measures and Spaces
Source image (X)
Style image (Y)
Sliced Wasserstein projection of X to style
image color statistics Y
Source image after color transfer
J. Rabin Wasserstein Regularization
! images, vision, graphics and machine learning, . . .
• Probability distributions and histograms

4. Comparing Measures and Spaces
Source image (X)
Style image (Y)
Sliced Wasserstein projection of X to style
image color statistics Y
Source image after color transfer
J. Rabin Wasserstein Regularization
! images, vision, graphics and machine learning, . . .
• Probability distributions and histograms
• Optimal transport
Optimal transport mean
L2 mean

5. Probability Measures
Positive Radon measure µ on a set X.
dµ(x) = m(x)dx
Measure of sets A ⇢ X: µ(A) =
R
A
dµ(x) > 0
X X X
µ =
P
i
µi xi
2.2. Assignment and Monge Problem 9
Discrete d = 1 Discrete d = 2 Density d = 1 Density d = 2
Figure 2.1: Schematic display of discrete distributions – =
q
n
i
=1 ai”xi
(red cor-
responds to empirical uniform distribution ai
= 1/n, and blue to arbitrary distri-
butions) and densities d–(x) = ﬂ–
(x)dx (in violet), in both 1-D and 2-D. Discrete
distributions in 1-D are displayed using vertical segments (with length equal to ai
)
and in 2-D using point clouds (radius equal to ai
).
is the dimension), can have a density d–(x) = ﬂ

(x)dx w.r.t. the
Lebesgue measure, often denoted ﬂ

= d

d
x
, which means that
d
⁄ ⁄

6. Probability Measures
Positive Radon measure µ on a set X.
dµ(x) = m(x)dx
Integration against continuous functions:
R
X
g(x)dµ(x) > 0
Measure of sets A ⇢ X: µ(A) =
R
A
dµ(x) > 0
X X X
µ =
P
i
µi xi
R
X
gdµ =
R
X
m(x)dx
R
X
gdµ =
P
i
µig(xi)
dµ(x) = m(x)dx
µ =
P
i
µi xi
2.2. Assignment and Monge Problem 9
Discrete d = 1 Discrete d = 2 Density d = 1 Density d = 2
Figure 2.1: Schematic display of discrete distributions – =
q
n
i
=1 ai”xi
(red cor-
responds to empirical uniform distribution ai
= 1/n, and blue to arbitrary distri-
butions) and densities d–(x) = ﬂ–
(x)dx (in violet), in both 1-D and 2-D. Discrete
distributions in 1-D are displayed using vertical segments (with length equal to ai
)
and in 2-D using point clouds (radius equal to ai
).
is the dimension), can have a density d–(x) = ﬂ

(x)dx w.r.t. the
Lebesgue measure, often denoted ﬂ

= d

d
x
, which means that
d
⁄ ⁄

7. Probability Measures
Positive Radon measure µ on a set X.
dµ(x) = m(x)dx
Integration against continuous functions:
R
X
g(x)dµ(x) > 0
Measure of sets A ⇢ X: µ(A) =
R
A
dµ(x) > 0
X X X
Probability (normalized) measure: µ(X) =
R
X
dµ(x) = 1
µ =
P
i
µi xi
R
X
gdµ =
R
X
m(x)dx
R
X
gdµ =
P
i
µig(xi)
dµ(x) = m(x)dx
µ =
P
i
µi xi
2.2. Assignment and Monge Problem 9
Discrete d = 1 Discrete d = 2 Density d = 1 Density d = 2
Figure 2.1: Schematic display of discrete distributions – =
q
n
i
=1 ai”xi
(red cor-
responds to empirical uniform distribution ai
= 1/n, and blue to arbitrary distri-
butions) and densities d–(x) = ﬂ–
(x)dx (in violet), in both 1-D and 2-D. Discrete
distributions in 1-D are displayed using vertical segments (with length equal to ai
)
and in 2-D using point clouds (radius equal to ai
).
is the dimension), can have a density d–(x) = ﬂ

(x)dx w.r.t. the
Lebesgue measure, often denoted ﬂ

= d

d
x
, which means that
d
⁄ ⁄

P(X 2 A)
R
A
dµ(x)
=
P(Xn
2 A) n!+1
! P(X 2 A)
8 set A 8 continuous function f
R
fdµn
n!+1
!
R
fdµ
Weak⇤ convergence: Convergence in law:
Measures and Random Variables

P(X 2 A)
R
A
dµ(x)
=
P(Xn
2 A) n!+1
! P(X 2 A)
8 set A 8 continuous function f
R
fdµn
n!+1
!
R
fdµ
Weak⇤ convergence: Convergence in law:
Weak convergence:
. . .
Measures and Random Variables

10. Quotient space:
Discrete measure:
Discretization: Histogram vs. Empirical
Lagrangian (point clouds) Eulerian (histograms)
µ =
N
X
i=1
µi xi
xi
2 X,
X
i
µi = 1
Constant weights µi = 1
N
XN /⌃N
{(µi)i
> 0 ;
P
i
µi = 1}
Convex polytope (simplex):
xi
X X
Fixed positions xi (e.g. grid)

11. -
f
.
s
-
t
e
e
e
l
.
x
Radon measures (µ, ⌫) on (X, Y ).
Transfer of measure by f : X ! Y : push forward.
⌫ = f]µ deﬁned by:
⌫(A) def.
= µ(f 1(A))
()
R
Y
g(y)d⌫(y) def.
=
R
X
g(f(x))dµ(x)
y = f(x)
f
X
µ ⌫
Y
A
f 1(A)
Push Forward

12. -
f
.
s
-
t
e
e
e
l
.
x
Radon measures (µ, ⌫) on (X, Y ).
Transfer of measure by f : X ! Y : push forward.
⌫ = f]µ deﬁned by:
⌫(A) def.
= µ(f 1(A))
()
R
Y
g(y)d⌫(y) def.
=
R
X
g(f(x))dµ(x)
y = f(x)
f
X
µ ⌫
Y
A
f 1(A)
Smooth densities: dµ = ⇢(x)dx, d⌫ = ⇠(x)dx
Push Forward
f]µ = ⌫ () ⇢(f(x))| det(@f(x))| = ⇠(x)

13. µ
f]µ
f f
'
Measures:
push-forward
Functions:
pull-back
f]'
Remark: f] and f] are adjoints
Z
Y
'd(f]µ) =
Z
X
(f]')dµ
=
P
i xi
def.
=
P
i f(xi)
def.
= ' f
f : X ! Y
f] : M(X) ! M(Y) f] : C(Y) ! C(X)
Push-forward vs. Pull-back

14. P( lim
n!+1
Xn = X) = 1
=)
lim
n!+1
E(|Xn X|p) = 0
P(Xn
2 A) n!+1
! P(X 2 A)
In law
8 " > 0, P(|Xn X| > ") n!+1
! 0
In probability
Almost sure
In mean
=)
=)
(the Xn can be deﬁned on di↵erent spaces)
Convergence of Random Variables

15. Overview
• Measures and Histograms
• From Monge to Kantorovitch Formulations
• Entropic Regularization and Sinkhorn
• Barycenters
• Unbalanced OT and Gradient Flows
• Minimum Kantorovitch Estimators
• Gromov-Wasserstein

16. Gaspard Monge (1746-1818)
(1784)

17. Monge Transport
min
⌫=f]µ
Z
X
c(x, f(x))dµ(x) f
X
µ
Y

18. Monge Transport
min
⌫=f]µ
Z
X
c(x, f(x))dµ(x) f
X
µ
Y

Theorem: [Brenier] for c(x, y) = ||x y||2, (µ, ⌫) with density,
there exists a unique optimal f. One has f = r where
is the unique convex function such that (r )]µ = ⌫

19. Monge Transport
min
⌫=f]µ
Z
X
c(x, f(x))dµ(x) f
X
µ
Y

Monge-Amp`
ere equation: ⇢(r ) det(@2 ) = ⇠
Theorem: [Brenier] for c(x, y) = ||x y||2, (µ, ⌫) with density,
there exists a unique optimal f. One has f = r where
is the unique convex function such that (r )]µ = ⌫

20. Monge Transport
min
⌫=f]µ
Z
X
c(x, f(x))dµ(x) f
X
µ
Y

Monge-Amp`
ere equation: ⇢(r ) det(@2 ) = ⇠

µ
f f0
?
µ

µ

f
µ

f
Non-uniqueness / non-existence:
Theorem: [Brenier] for c(x, y) = ||x y||2, (µ, ⌫) with density,
there exists a unique optimal f. One has f = r where
is the unique convex function such that (r )]µ = ⌫

21. Leonid Kantorovich (1912-1986)
Леонид Витальевич Канторович

22. Optimal Transport was formulated in 1930 by A.N. Tolstoi,
12 years before Kantorovich. He even solved a "large
scale" 10×68 instance!
Before Kantorovitch

23. xi
yj
Kantorovitch’s Formulation
Points (xi)i, (yj)j
T
Input distributions
Couplings
Def.

24. xi
yj
Kantorovitch’s Formulation
Points (xi)i, (yj)j
T
Input distributions
Couplings
Def.
[Kantorovich 1942]
Wasserstein Distance / EMD
Def.

25. OT Between General Measures
Couplings:
P1]⇡(S) def.
= ⇡(S, X) P2]⇡(S) def.
= ⇡(X, S)
Marginals:

⇧(µ, ⌫) def.
= {⇡ 2 M+(X ⇥ X) ; P1]⇡ = µ, P2]⇡ = ⌫}

Optimal transport: [Kantorovitch 1942]
Wp
p
(µ, ⌫) def.
= min

hdp, ⇡i =
Z
X⇥X
d(x, y)pd⇡(x, y) ; ⇡ 2 ⇧(µ, ⌫)

26. Couplings: the 3 Settings

Discrete

Continuous

Semi-discrete
↵ ↵ ↵
↵ ↵ ↵

27. Couplings

28. 1-D Optimal Transport
Remark. If ⌦ = R, c(x, y) = c(|x y|),
c convex, F 1
µ
, F 1

quantile functions,
W(µ, ⌫)=
Z 1
0
c(|F 1
µ
(x) F 1

(x)|)dx

29. Cases 37
ρβ
ρα
Put
Remark. If ⌦ = Rd, c(x, y) = kx yk2, and
µ = N(mµ, ⌃µ), ⌫ = N(m⌫, ⌃⌫) then
W2
2
(µ, ⌫) = kmµ m⌫
k2 + B(⌃µ, ⌃⌫)2
where B is the Bures metric
B(⌃µ, ⌃⌫)2 = trace(⌃µ+⌃⌫ 2(⌃1/2
µ
⌃⌫⌃1/2
µ
)1/2).
The map T : x 7! m⌫ + A(x mµ) is optimal,
where A = ⌃
1
2
µ

1
2
µ ⌃⌫⌃
1
2
µ
◆1
2

1
2
µ .
m
1
p
2⇡
e (x m)2
2 2
OT Between Gaussians
T : x 7! m⌫ + A(x mµ)

30. OT on Gaussians and Bures' Distance

31. W1 OT and Min-cost Flows

32. Bins-to-bins metrics:
dµ(x) = ⇢(x)dx

µ(x) = ˜
⇢(x)dx
µ
˜
µ
Kullback-Leibler divergence:
DKL(µ, ˜
µ) =
Z
⇢(x) log
⇢(x)
˜
⇢(x)
dx
Hellinger distance:
D
H
(µ, ˜
µ)2 =
Z ⇣p
⇢(x)
p
˜
⇢(x)

2
dx
Metrics on the Space of Measures

33. Bins-to-bins metrics:
dµ (x) = ⇢(x )dx
dµ(x) = ⇢(x)dx

µ(x) = ˜
⇢(x)dx
µ
˜
µ
µ
Kullback-Leibler divergence:
DKL(µ, ˜
µ) =
Z
⇢(x) log
⇢(x)
˜
⇢(x)
dx
Hellinger distance:
E↵ect of translation:
D(µ, µ ) ⇡ cst W2(µ, µ ) =
D
H
(µ, ˜
µ)2 =
Z ⇣p
⇢(x)
p
˜
⇢(x)

2
dx
Metrics on the Space of Measures

34. D'(↵| ) def.
=
Z
X
'

d↵
d

d ||↵ ||B
def.
= max
f2B
Z
X
f(x)(d↵(x) d (x))
Csisz´
ar divergences: Dual norms:
X X

D'
|| · ||B
Weak topology
Strong topology
! KL, TV, 2
, Hellinger . . . ! W1, ﬂat, RKHS

, energy dist, . . .
Csiszar Divergence vs Dual Norms

35. Csiszár divergences, a unifying way to deﬁne losses between arbitrary
positive measures (discrete & densities). https://en.wikipedia.org/wiki/F-
divergence …
Csiszar Divergence

36. -3 -2 -1 0 1 2 3
0
1
2
3
Energy
Gauss
W
1
Flat
-3 -2 -1 0 1 2 3
0
1
2
3
Energy
Gauss
W
1
Flat
Dual norms: (aka Integral Probability Metrics)
||↵ ||B
def.
= max
⇢Z
X
f(x)(d↵(x) d (x)) ; f 2 B
t t/2
Wasserstein 1: B = {f ; ||rf||1
6 1}.
Flat norm: B = {f ; ||f||1
6 1, ||rf||1
6 1}.
RKHS: B = f ; ||f||2
k
6
1 .
||↵ ||2
B
=
Z
k(x, x0)d↵(x)d↵(x0) +
Z
k(x, x0)d (y)d (y0) 2
Z
k(x, y)d↵(x)d (y)
Energy distance: k(x, y) = ||x y||
Gaussian: k(x, y) = e |
|x y|
|2
2 2
↵ ↵
Dual Norms

37. RKHS Norms aka Maximum Mean Discrepency

38. The Earth Mover’s Distance
µ ⌫
dist(I1, I2) = W1(µ, ⌫)
[Rubner’98]

39. 29
The Word Mover’s Distance
[Kusner’15] dist(D1, D2) = W2(µ, ⌫)
µ

40. Overview
• Measures and Histograms
• From Monge to Kantorovitch Formulations
• Linear Programming and Semi-discrete
• Entropic Regularization and Sinkhorn
• Barycenters
• Unbalanced OT and Gradient Flows
• Minimum Kantorovitch Estimators
• Gromov-Wasserstein

41. Linear programming:
µ =
PN1
i=1
pi xi
, ⌫ =
PN2
j=1
pj yi
Algorithms

42. Linear programming:
µ =
PN1
i=1
pi xi
, ⌫ =
PN2
j=1
pj yi
⇠ O(N3)
Hungarian/Auction:
µ = 1
N
PN
i=1 xi
, ⌫ = 1
N
PN
j=1 yj
Algorithms

43. Linear programming:
µ =
PN1
i=1
pi xi
, ⌫ =
PN2
j=1
pj yi
⇠ O(N3)
Hungarian/Auction:
µ = 1
N
PN
i=1 xi
, ⌫ = 1
N
PN
j=1 yj
Algorithms
µ

44. Linear programming:
µ =
PN1
i=1
pi xi
, ⌫ =
PN2
j=1
pj yi
⇠ O(N3)
Hungarian/Auction:
µ = 1
N
PN
i=1 xi
, ⌫ = 1
N
PN
j=1 yj
Monge-Amp`
ere/Benamou-Brenier, d = || · ||2
2
.
also reveal that the network simplex behaves in O(n
2) in our con-
text, which is a major gain at the scale at which we typical work,
i.e. thousands of particles. This ﬁnding is also useful for applica-
tions that use EMD, where using the network simplex instead of
the transport simplex can bring a signiﬁcant performance increase.
Our experiments also show that ﬁxed-point precision further speeds
up the computation. We observed that the value of the ﬁnal trans-
port cost is less accurate because of the limited precision, but that
the particle pairing that produces the actual interpolation scheme
remains unchanged. We used the ﬁxed point method to generate
the results presented in this paper. The results of the performance
study are also of broader interest, as current EMD image retrieval
or color transfer techniques rely on slower solvers [Rubner et al.
2000; Kanters et al. 2003; Morovic and Sun 2003].
101 102 103 104
10−4
10−2
100
102
Problem size : number of bins per histogram
Time in seconds
Network Simplex (fixed point)
Network Simplex (double prec.)
Transport Simplex
y = α x2
y = β x3
Figure 6: Log-log plot of the running times of different solvers.
Figure 7: Synthetic 2D examples on a Euclidean domain. Th
left and right columns show the input distributions, while the cent
columns show interpolations for ↵ = 1/4, ↵ = 1/2, and ↵ = 3/
positive. A negative side effect of this choice is that interpolatin
between BRDFs of equal energy conserves their log energy (§ 3.
instead of their energy. Because we apply a concave remappin
the interpolated value is guaranteed to be always lower, which e
sures that our result does not break the energy preservation rul
Algorithms
µ

45. Linear programming:
µ =
PN1
i=1
pi xi
, ⌫ =
PN2
j=1
pj yi
⇠ O(N3)
Hungarian/Auction:
µ = 1
N
PN
i=1 xi
, ⌫ = 1
N
PN
j=1 yj
Monge-Amp`
ere/Benamou-Brenier, d = || · ||2
2
.
also reveal that the network simplex behaves in O(n
2) in our con-
text, which is a major gain at the scale at which we typical work,
i.e. thousands of particles. This ﬁnding is also useful for applica-
tions that use EMD, where using the network simplex instead of
the transport simplex can bring a signiﬁcant performance increase.
Our experiments also show that ﬁxed-point precision further speeds
up the computation. We observed that the value of the ﬁnal trans-
port cost is less accurate because of the limited precision, but that
the particle pairing that produces the actual interpolation scheme
remains unchanged. We used the ﬁxed point method to generate
the results presented in this paper. The results of the performance
study are also of broader interest, as current EMD image retrieval
or color transfer techniques rely on slower solvers [Rubner et al.
2000; Kanters et al. 2003; Morovic and Sun 2003].
101 102 103 104
10−4
10−2
100
102
Problem size : number of bins per histogram
Time in seconds
Network Simplex (fixed point)
Network Simplex (double prec.)
Transport Simplex
y = α x2
y = β x3
Figure 6: Log-log plot of the running times of different solvers.
Figure 7: Synthetic 2D examples on a Euclidean domain. Th
left and right columns show the input distributions, while the cent
columns show interpolations for ↵ = 1/4, ↵ = 1/2, and ↵ = 3/
positive. A negative side effect of this choice is that interpolatin
between BRDFs of equal energy conserves their log energy (§ 3.
instead of their energy. Because we apply a concave remappin
the interpolated value is guaranteed to be always lower, which e
sures that our result does not break the energy preservation rul
Algorithms
µ

Semi-discrete: Laguerre cells, d = || · ||2
2
.
[Levy,’15]

46. Linear programming:
µ =
PN1
i=1
pi xi
, ⌫ =
PN2
j=1
pj yi
⇠ O(N3)
Hungarian/Auction:
µ = 1
N
PN
i=1 xi
, ⌫ = 1
N
PN
j=1 yj
Monge-Amp`
ere/Benamou-Brenier, d = || · ||2
2
.
also reveal that the network simplex behaves in O(n
2) in our con-
text, which is a major gain at the scale at which we typical work,
i.e. thousands of particles. This ﬁnding is also useful for applica-
tions that use EMD, where using the network simplex instead of
the transport simplex can bring a signiﬁcant performance increase.
Our experiments also show that ﬁxed-point precision further speeds
up the computation. We observed that the value of the ﬁnal trans-
port cost is less accurate because of the limited precision, but that
the particle pairing that produces the actual interpolation scheme
remains unchanged. We used the ﬁxed point method to generate
the results presented in this paper. The results of the performance
study are also of broader interest, as current EMD image retrieval
or color transfer techniques rely on slower solvers [Rubner et al.
2000; Kanters et al. 2003; Morovic and Sun 2003].
101 102 103 104
10−4
10−2
100
102
Problem size : number of bins per histogram
Time in seconds
Network Simplex (fixed point)
Network Simplex (double prec.)
Transport Simplex
y = α x2
y = β x3
Figure 6: Log-log plot of the running times of different solvers.
Figure 7: Synthetic 2D examples on a Euclidean domain. Th
left and right columns show the input distributions, while the cent
columns show interpolations for ↵ = 1/4, ↵ = 1/2, and ↵ = 3/
positive. A negative side effect of this choice is that interpolatin
between BRDFs of equal energy conserves their log energy (§ 3.
instead of their energy. Because we apply a concave remappin
the interpolated value is guaranteed to be always lower, which e
sures that our result does not break the energy preservation rul
Algorithms
µ

Semi-discrete: Laguerre cells, d = || · ||2
2
.
[Levy,’15]

47. Need for fast approximate algorithms for generic c.
Linear programming:
µ =
PN1
i=1
pi xi
, ⌫ =
PN2
j=1
pj yi
⇠ O(N3)
Hungarian/Auction:
µ = 1
N
PN
i=1 xi
, ⌫ = 1
N
PN
j=1 yj
Monge-Amp`
ere/Benamou-Brenier, d = || · ||2
2
.
also reveal that the network simplex behaves in O(n
2) in our con-
text, which is a major gain at the scale at which we typical work,
i.e. thousands of particles. This ﬁnding is also useful for applica-
tions that use EMD, where using the network simplex instead of
the transport simplex can bring a signiﬁcant performance increase.
Our experiments also show that ﬁxed-point precision further speeds
up the computation. We observed that the value of the ﬁnal trans-
port cost is less accurate because of the limited precision, but that
the particle pairing that produces the actual interpolation scheme
remains unchanged. We used the ﬁxed point method to generate
the results presented in this paper. The results of the performance
study are also of broader interest, as current EMD image retrieval
or color transfer techniques rely on slower solvers [Rubner et al.
2000; Kanters et al. 2003; Morovic and Sun 2003].
101 102 103 104
10−4
10−2
100
102
Problem size : number of bins per histogram
Time in seconds
Network Simplex (fixed point)
Network Simplex (double prec.)
Transport Simplex
y = α x2
y = β x3
Figure 6: Log-log plot of the running times of different solvers.
Figure 7: Synthetic 2D examples on a Euclidean domain. Th
left and right columns show the input distributions, while the cent
columns show interpolations for ↵ = 1/4, ↵ = 1/2, and ↵ = 3/
positive. A negative side effect of this choice is that interpolatin
between BRDFs of equal energy conserves their log energy (§ 3.
instead of their energy. Because we apply a concave remappin
the interpolated value is guaranteed to be always lower, which e
sures that our result does not break the energy preservation rul
Algorithms
µ

Semi-discrete: Laguerre cells, d = || · ||2
2
.
[Levy,’15]

48. C-Transform
92 Semi-discrete Optimal Transport
0 0.5 1
-0.2
0
0.2
0.4
0.6
0.5
1
1.5
2
p = 1/2 p = 1 p = 3/2 p = 2
Figure 5.1: Top: examples of semi-discrete ¯
c-transforms g
¯
c in 1-D, for ground
d the useful indicator function notation (4.42).
alternate minimization on either f or g leads to the im-
n of c-transform:
’ y œ Y, fc(y) def.
= inf
xœX
c(x, y) ≠ f(x), (5.1)
’ x œ X, g¯
c(x) def.
= inf
yœY
c(x, y) ≠ g(y), (5.2)
oted ¯
c(y, x) def.
= c(x, y). Indeed, one can check that
œ argmax
g
E(f, g) and g¯
c œ argmax
f
E(f, g). (5.3)
ese partial minimizations deﬁne maximizers on the sup-
tively – and —, while the deﬁnitions (5.1) actually deﬁne
he whole spaces X and Y. This is thus a way to extend in
ay solutions of (2.22) on the whole spaces. When X = Rd
Îx ≠ yÎp, then the c-transform (5.1) fc is the so-called
n between ≠f and Î·Îp. The deﬁnition of fc is also often
a “Hopf-Lax formula”.
(f, g) œ C(X) ◊ C(Y) ‘æ (g¯
c, fc) œ C(X) ◊ C(Y) replaces
s by “better” ones (improving the dual objective E). Func-
be written in the form fc and g¯
c are called c-concave and
ctions. In the special case c(x, y) = Èx, yÍ in X = Y = Rd,
n coincides with the usual notion of concave functions.
turally Proposition 3.1 to a continuous case, one has the
operations are replaced by a “soft-min”.
Using (5.3), one can reformulate (2.22) as an unconstrained
program over a single potential
Lc
(–, —) = max
fœC
(
X
)

X
f(x)d–(x) +

Y
fc(y)d—(y),
= max
gœC
(
Y
)

X

c(x)d–(x) +

Y
g(y)d—(y).
Since one can iterate the map (f, g) ‘æ (g¯
c, fc), it is possible to
these optimization problems the constraint that f is ¯
c-concave
is c-concave, which is important to ensure enough regularity on
potentials and show for instance existence of solutions to (2.22)
5.2 Semi-discrete Formulation
A case of particular interest is when — =
q
j bj

yj
is discrete (of
the same construction applies if – is discrete by exchanging the
(–, —)). One can adapt the deﬁnition of the ¯
c transform (5.1)
setting by restricting the minimization to the support (y
j
)
j
of —
’ g œ Rm, ’ x œ X, g
¯
c(x) def.
= min
jœJmK
c(x, y
j
) ≠ gj
.
This transform maps a vector g to a continuous function g
¯
c œ
Note that this deﬁnition coincides with (5.1) when imposing th
space X is equal to the support of —. Figure 5.1 shows some ex
of such discrete ¯
c-transforms in 1-D and 2-D.
Using this discrete ¯
c-transform, in this semi-discrete case, (
equivalent to the following ﬁnite dimensional optimization

49. Semi-discrete Descent Algorithm
94 Semi-discrete Optimal Transport
¸ = 1 ¸ = 3 ¸ = 50 ¸ = 100 Matching
Figure 5.2: Iterations of the semi-discrete OT algorithm minimizing (5.8) (here
a simple gradient descent is used). The support (yj
)
j
of the discrete measure — is
indicated by the red points, while the continuous measure – is the uniform measure
on a square. The blue cells display the Laguerre partition (Lg(¸) (yj
))
j
where g
(
¸
)
is
the discrete dual potential computed at iteration ¸.

50. 2000 4000 6000 8000 10000
-8
-6
-4
-2
0 log E(g(?)) E(g(`))
`
Stochastic gradient descent for the semi-discrete Optimal Transport,
illustration of convergence and corresponding Laguerre cells.
https://arxiv.org/abs/1605.08527
Semi-discrete Stochastic Descent

51. The simplex of histograms with N bins is ⌃
N
def.
=
p 2 R+
N
;
P
i
p
i
= 1 . The entropy of T 2 RN⇥N
+
is
deﬁned as H(T) def.
=
P
N
i,j=1
T
i,j
(log(T
i,j
) 1). The set
of couplings between histograms p 2 ⌃
N1
and q 2 ⌃
N2
is
C
p,q
def.
= T 2 (R
+)N1
⇥N2 ; T1
N2
= p, T>1
N1
= q .
Here, 1
N
def.
= (1, . . . , 1)> 2 RN . For any tensor L =
(L
i,j,k,`
)
i,j,k,`
and matrix (T
i,j
)
i,j
, we deﬁne the tensor-
matrix multiplication as
L ⌦ T def.
=
⇣ X
k,`
L
i,j,k,`
T
k,`

i,j
. (1)
2. Gromov-Wasserstein Discrepancy
2.1. Entropic Optimal Transport
Optimal transport distances are useful to compare two his-
tograms (p, q) 2 ⌃
N1
⇥ ⌃
N2
deﬁned on the same metric
1
In our setting, sinc
learning problems,
distance matrices, i
does not necessaril
We deﬁne the Gro
two measured simil
and ( ¯
C, q) 2 RN2

GW(C
where E
C,C
The matrix T is a
which the similarit
L is some loss fun
the similarity matr
the Kullback-Leibl
a log(a/b) a +
inition (4) of GW
Entropy:
Entropic Regularization
Def. Regularized OT: [Cuturi NIPS’13]

52. The simplex of histograms with N bins is ⌃
N
def.
=
p 2 R+
N
;
P
i
p
i
= 1 . The entropy of T 2 RN⇥N
+
is
deﬁned as H(T) def.
=
P
N
i,j=1
T
i,j
(log(T
i,j
) 1). The set
of couplings between histograms p 2 ⌃
N1
and q 2 ⌃
N2
is
C
p,q
def.
= T 2 (R
+)N1
⇥N2 ; T1
N2
= p, T>1
N1
= q .
Here, 1
N
def.
= (1, . . . , 1)> 2 RN . For any tensor L =
(L
i,j,k,`
)
i,j,k,`
and matrix (T
i,j
)
i,j
, we deﬁne the tensor-
matrix multiplication as
L ⌦ T def.
=
⇣ X
k,`
L
i,j,k,`
T
k,`

i,j
. (1)
2. Gromov-Wasserstein Discrepancy
2.1. Entropic Optimal Transport
Optimal transport distances are useful to compare two his-
tograms (p, q) 2 ⌃
N1
⇥ ⌃
N2
deﬁned on the same metric
1
In our setting, sinc
learning problems,
distance matrices, i
does not necessaril
We deﬁne the Gro
two measured simil
and ( ¯
C, q) 2 RN2

GW(C
where E
C,C
The matrix T is a
which the similarit
L is some loss fun
the similarity matr
the Kullback-Leibl
a log(a/b) a +
inition (4) of GW
Entropy:
Entropic Regularization
Regularization impact on solution:
Def. Regularized OT: [Cuturi NIPS’13]
c
"
T"
" "

53. Overview
• Measures and Histograms
• From Monge to Kantorovitch Formulations
• Entropic Regularization and Sinkhorn
• Barycenters
• Unbalanced OT and Gradient Flows
• Minimum Kantorovitch Estimators
• Gromov-Wasserstein

54. Sinkhorn’s Algorithm

55. Sinkhorn’s Algorithm

56. Sinkhorn’s Algorithm
Only matrix/vector multiplications. ! Parallelizable.
! Streams well on GPU.

57. Sinkhorn’s Algorithm
Only matrix/vector multiplications. ! Parallelizable.
! Streams well on GPU.

58. Optimal Transport on Surfaces
Triangulated mesh M. Geodesic distance dM .

59. Optimal Transport on Surfaces
Level sets
xi
d(xi, ·)
Ground cost: c(x, y) = dM (x, y)↵.
Triangulated mesh M. Geodesic distance dM .

60. Optimal Transport on Surfaces
Level sets
xi
d(xi, ·)
Computing c (Fast-Marching): N2 log(N) ! too costly.
Ground cost: c(x, y) = dM (x, y)↵.
Triangulated mesh M. Geodesic distance dM .

61. Entropic Transport on Surfaces
@tut(x, ·) = M ut(x, ·), ut=0(x, ·) = x
Heat equation on M:
"
t

62. Entropic Transport on Surfaces
@tut(x, ·) = M ut(x, ·), ut=0(x, ·) = x
Heat equation on M:
! d2
M
"
t

63. Entropic Transport on Surfaces
@tut(x, ·) = M ut(x, ·), ut=0(x, ·) = x
Heat equation on M:
! d2
M
"
Sinkhorn kernel: K def.
= e
d2
M
" ⇡ u"
⇡ Id "
` M
`
t

64. MRI Data Procesing [with A. Gramfort]
L2 barycenter
W2
2
barycenter
Ground cost c = dM : geodesic on cortical surface M.

65. Regularization for General Measures
Schr¨
odinger’s problem:
Landmark computational paper: [Cuturi 2013].
⇡"
def.
= argmin

{hdp, ⇡i + "KL(⇡|⇡0) ; ⇡ 2 ⇧(µ, ⌫)}
⇡" = argmin
⇡2⇧(µ,⌫)
KL(⇡|K)
K(x, y) def.
= e dp(x,y)
" ⇡0(x, y)

66. Regularization for General Measures
Schr¨
odinger’s problem:
Landmark computational paper: [Cuturi 2013].
⇡"
def.
= argmin

{hdp, ⇡i + "KL(⇡|⇡0) ; ⇡ 2 ⇧(µ, ⌫)}
⇡" = argmin
⇡2⇧(µ,⌫)
KL(⇡|K)
K(x, y) def.
= e dp(x,y)
" ⇡0(x, y)
Proposition: [Carlier, Duval, Peyr´
e, Schmitzer 2015]
⇡"
"!0
! argmin
⇡2⇧(µ,⌫)
hdp, ⇡i ⇡"
"!+1
! µ(x)⌫(y)
"
⇡"
"

67. Back to Sinkhorn’s Algorithm
ProxKL
f1/"

µ) = µ
ProxKL
f2/"

⌫) = ⌫
f1 = ◆µ
f2 = ◆⌫
Optimal transport problem:

68. Back to Sinkhorn’s Algorithm
ProxKL
f1/"

µ) = µ
ProxKL
f2/"

⌫) = ⌫
[Sinkhorn 1967][Deming,Stephan 1940]
Sinkhorn/IPFP algorithm:
a(`+1) def.
=
µ
Kb(`)
and b(`+1) def.
=

K⇤a(`+1)
f1 = ◆µ
f2 = ◆⌫
Optimal transport problem:
`
⇡(`)
"

69. Back to Sinkhorn’s Algorithm
ProxKL
f1/"

µ) = µ
ProxKL
f2/"

⌫) = ⌫
⇡"
"
[Sinkhorn 1967][Deming,Stephan 1940]
Sinkhorn/IPFP algorithm:
a(`+1) def.
=
µ
Kb(`)
and b(`+1) def.
=

K⇤a(`+1)
f1 = ◆µ
f2 = ◆⌫
Optimal transport problem:
Proposition:
[Franklin,Lorenz 1989]
⇡(`) def.
= diag(a(`))K diag(b(`))
|| log(⇡
(`)) log(⇡
?)||1 = O(1 )`
, ⇠  1/"
c
Local rate: [Knight 2008]
1000 2000 3000 4000 5000
-2
-1.5
-1
-0.5
0
`
log(|| log(⇡(`)) log(⇡?)||1)
`
⇡(`)
"

70. 0 0.5 1
-0.2
0
0.2
0
0.1
0.2
0.3
Á = 0 Á = 0.01 Á = 0.1 Á = 0.3
Figure 5.3: Top: examples of entropic semi-discrete ¯
c-transforms g
¯
c,Á in 1-D, for
0 0.5 1
-0.2
0
0.2
0
0.1
0.2
0.3
Á = 0 Á = 0.01 Á = 0.1 Á = 0.3
Figure 5.3: Top: examples of entropic semi-discrete ¯
c-transforms g
¯
c,Á in 1-D, for
Laguerre cells “Sinkhorn” Laguerre cells
Semi-discrete Optimal Transport is a multiclass SVM. Sinkhorn version is logistic regression. Decomposes space in entropic Laguerre diagrams
0 0.5 1
-0.2
0
0.2
0
0.1
0.2
0.3
Á = 0 Á = 0.01 Á = 0.1 Á = 0.3
Figure 5.3: Top: examples of entropic semi-discrete ¯
c-transforms g
¯
c,Á in 1-D, for
ground cost c(x, y) = |x ≠ y| for varying Á (see colorbar). The red points are at
locations (yj, ≠gj
)
j
. Bottom: examples of entropic semi-discrete ¯
c-transforms g
¯
c,Á
in 2-D, for ground cost c(x, y) = Îx ≠ yÎ for varying Á. The black curves are the level
sets of the function g
¯
c,Á, while the colors indicate the smoothed indicator function
of the Laguerre cells ‰Á
j
. The red points are at locations yj
œ R2
, and their size is
proportional to gj
.
Semi-discrete OT and Entropy

71. Overview
• Measures and Histograms
• From Monge to Kantorovitch Formulations
• Entropic Regularization and Sinkhorn
• Barycenters
• Unbalanced OT and Gradient Flows
• Minimum Kantorovitch Estimators
• Gromov-Wasserstein

72. Wasserstein Barycenters
Barycenters of measures (µk)k:
P
k k = 1
µ? 2 argmin
µ
P
k kW2
2
(µk, µ)

73. Wasserstein Barycenters
µ
µ1
µ3
W2
(µ1
, µ )
W
2 (µ
2 ,µ )
µ2
W2
(µ3

)
Barycenters of measures (µk)k:
P
k k = 1
If µk = xk
then µ? = P
k kxk
Generalizes Euclidean barycenter:
µ? 2 argmin
µ
P
k kW2
2
(µk, µ)

74. Wasserstein Barycenters
µ
µ1
µ3
W2
(µ1
, µ )
W
2 (µ
2 ,µ )
µ2
W2
(µ3

)
Barycenters of measures (µk)k:
P
k k = 1
If µk = xk
then µ? = P
k kxk
Generalizes Euclidean barycenter:
µ? 2 argmin
µ
P
k kW2
2
(µk, µ)

75. Wasserstein Barycenters
µ
µ1
µ3
W2
(µ1
, µ )
W
2 (µ
2 ,µ )
µ2
W2
(µ3

)
Barycenters of measures (µk)k:
P
k k = 1
If µk = xk
then µ? = P
k kxk
Generalizes Euclidean barycenter:
µ2
µ1
Mc Cann’s displacement interpolation.
µ? 2 argmin
µ
P
k kW2
2
(µk, µ)

76. Wasserstein Barycenters
µ
µ1
µ3
W2
(µ1
, µ )
W
2 (µ
2 ,µ )
µ2
W2
(µ3

)
Barycenters of measures (µk)k:
P
k k = 1
If µk = xk
then µ? = P
k kxk
Generalizes Euclidean barycenter:
µ2
µ1
µ3
µ2
µ1
Mc Cann’s displacement interpolation.
µ? 2 argmin
µ
P
k kW2
2
(µk, µ)

77. Wasserstein Barycenters
µ
µ1
µ3
W2
(µ1
, µ )
W
2 (µ
2 ,µ )
µ2
W2
(µ3

)
Barycenters of measures (µk)k:
P
k k = 1
If µk = xk
then µ? = P
k kxk
Generalizes Euclidean barycenter:
µ2
µ1
µ3
µ exists and is unique.
Theorem:
if µ1 does not vanish on small sets,
[Agueh, Carlier, 2010]
(for c(x, y) = ||x y||2)
µ2
µ1
Mc Cann’s displacement interpolation.
µ? 2 argmin
µ
P
k kW2
2
(µk, µ)

78. Displacement Interpolation

79. 110 Dynamic Formulations
–0 –1
/
5 –2
/
5 –3
/
5 –4
/
5 –1
Figure 7.2: Comparison of displacement interpolation (7.8) of discrete measures.
Top: point clouds (empirical measures (–0, –1) with the same number of points).
Bottom: same but with weight. For 0 < t < 1, the top example corresponds to an
Displacement Interpolation

80. 2 ⌃3 Wasserstein mean L2 mean
Wasserstein Barycenters

81. 2 ⌃3 Wasserstein mean L2 mean
Wasserstein Barycenters

82. Regularized Barycenters
min
(⇡k)k,µ
(
X
k
k (hc, ⇡k
i + "KL(⇡k
|⇡0,k)) ; 8k, ⇡k
2 ⇧(µk, µ)
)

83. Regularized Barycenters
! Need to ﬁx a discretization grid for µ, i.e. choose (⇡0,k)k
min
(⇡k)k,µ
(
X
k
k (hc, ⇡k
i + "KL(⇡k
|⇡0,k)) ; 8k, ⇡k
2 ⇧(µk, µ)
)

84. Regularized Barycenters
! Need to ﬁx a discretization grid for µ, i.e. choose (⇡0,k)k
! Sinkhorn-like algorithm [Benamou, Carlier, Cuturi, Nenna, Peyr´
e, 2015].
min
(⇡k)k,µ
(
X
k
k (hc, ⇡k
i + "KL(⇡k
|⇡0,k)) ; 8k, ⇡k
2 ⇧(µk, µ)
)

85. Regularized Barycenters
! Need to ﬁx a discretization grid for µ, i.e. choose (⇡0,k)k
! Sinkhorn-like algorithm [Benamou, Carlier, Cuturi, Nenna, Peyr´
e, 2015].
[Solomon et al, SIGGRAPH 2015]
min
(⇡k)k,µ
(
X
k
k (hc, ⇡k
i + "KL(⇡k
|⇡0,k)) ; 8k, ⇡k
2 ⇧(µk, µ)
)

86. Barycenters of 2D Shapes

87. Barycenters of 3D Shapes

88. Barycenter on a Surface
1
µ1
µ2
0 1

89. Barycenter on a Surface
1
µ1
µ2
µ3
µ4
µ5 µ6
µ1
µ2
0 1

90. Barycenter on a Surface
1
= (1, . . . , 1)/6
µ1
µ2
µ3
µ4
µ5 µ6
µ1
µ2
0 1

91. Color Transfer
µ

Input images: (f, g) (chrominance components)
Input measures:
f
g
µ(A) = U(f 1(A)), ⌫(A) = U(g 1(A))

92. Color Transfer
µ

Input images: (f, g) (chrominance components)
Input measures:
f
g
µ(A) = U(f 1(A)), ⌫(A) = U(g 1(A))

93. Color Transfer
µ

Input images: (f, g) (chrominance components)
Input measures:
f T
T f
g
˜
T g
µ(A) = U(f 1(A)), ⌫(A) = U(g 1(A))

94. Topic Models
[Rolet’16]

95. Overview
• Measures and Histograms
• From Monge to Kantorovitch Formulations
• Entropic Regularization and Sinkhorn
• Barycenters
• Unbalanced OT and Gradient Flows
• Minimum Kantorovitch Estimators
• Gromov-Wasserstein

96. Unbalanced Transport
(⇠, µ) 2 M+(X)2, KL(⇠|µ) def.
=
R
X
log

d⇠

dµ +
R
X
(dµ d⇠)
WFc(µ, ⌫) def.
= min

hc, ⇡i + KL(P1]⇡|µ) + KL(P2]⇡|⌫)
[Liereo, Mielke, Savar´
e 2015]

97. Unbalanced Transport
[Liereo, Mielke, Savar´
e 2015] [Chizat, Schmitzer, Peyr´
e, Vialard 2015]
Proposition:
then WF1/2
c is a distance on M+(X).
(⇠, µ) 2 M+(X)2, KL(⇠|µ) def.
=
R
X
log

d⇠

dµ +
R
X
(dµ d⇠)
If c(x, y) = log(cos(min(d(x, y), ⇡
2
))
WFc(µ, ⌫) def.
= min

hc, ⇡i + KL(P1]⇡|µ) + KL(P2]⇡|⌫)
[Liereo, Mielke, Savar´
e 2015]

98. Unbalanced Transport
! “Dynamic” Benamou-Brenier formulation.
[Liereo, Mielke, Savar´
e 2015][Kondratyev, Monsaingeon, Vorotnikov, 2015]
[Liereo, Mielke, Savar´
e 2015] [Chizat, Schmitzer, Peyr´
e, Vialard 2015]
Proposition:
then WF1/2
c is a distance on M+(X).
[Chizat, Schmitzer, P, Vialard 2015]
(⇠, µ) 2 M+(X)2, KL(⇠|µ) def.
=
R
X
log

d⇠

dµ +
R
X
(dµ d⇠)
If c(x, y) = log(cos(min(d(x, y), ⇡
2
))
WFc(µ, ⌫) def.
= min

hc, ⇡i + KL(P1]⇡|µ) + KL(P2]⇡|⌫)
[Liereo, Mielke, Savar´
e 2015]
Balanced OT Unbalanced OT

99. p
=
+
1
p = 2
p = 1
Metric space (X, d), minimize F(x) on X.
F(x) = ||x||2 on (X = R2, || · ||p)
Implicit Euler step:
xk+1
def.
= argmin
x2X
d(xk, x)2 + ⌧F(x)
xk+1
xk
{x ; d(xk, x) ⇠ ⌧}
xk+2
xk+3
Implicit Euler Stepping

100. xk+1 = argmin
x2X
d(xk, x)2 + ⌧hrF(xk), xi
p
=
+
1
p = 1
p = 2
p
=
+
1
p = 2
p = 1
xk+1 = argmin
x2X
d(xk, x)2 + ⌧F(x)
Metric space (X, d), minimize F(x) on X.
F(x) = ||x||2 on (X = R2, || · ||p)
Implicit
Explicit
Implicit vs. Explicit Stepping

Implicit Euler step:
[Jordan, Kinderlehrer, Otto 1998]
µt+1 = ProxW
⌧f
(µt) def.
= argmin
µ2M+(X)
W2
2
(µt, µ) + ⌧f(µ)

Implicit Euler step:
[Jordan, Kinderlehrer, Otto 1998]
Formal limit ⌧ ! 0: @tµ = div (µr(f0(µ)))
µt+1 = ProxW
⌧f
(µt) def.
= argmin
µ2M+(X)
W2
2
(µt, µ) + ⌧f(µ)

Implicit Euler step:
[Jordan, Kinderlehrer, Otto 1998]
Formal limit ⌧ ! 0: @tµ = div (µr(f0(µ)))
(heat di↵usion)
f(µ) =
R
log(dµ
dx
)dµ @tµ = µ
µt+1 = ProxW
⌧f
(µt) def.
= argmin
µ2M+(X)
W2
2
(µt, µ) + ⌧f(µ)

Implicit Euler step:
[Jordan, Kinderlehrer, Otto 1998]
Formal limit ⌧ ! 0: @tµ = div (µr(f0(µ)))
f(µ) =
R
wdµ @tµ = div(µrw)
(heat di↵usion)
f(µ) =
R
log(dµ
dx
)dµ @tµ = µ
Evolution µt
Evolution µt
rw rw
µt+1 = ProxW
⌧f
(µt) def.
= argmin
µ2M+(X)
W2
2
(µt, µ) + ⌧f(µ)

Implicit Euler step:
[Jordan, Kinderlehrer, Otto 1998]
Formal limit ⌧ ! 0: @tµ = div (µr(f0(µ)))
f(µ) =
R
wdµ @tµ = div(µrw)
(heat di↵usion)
f(µ) =
R
log(dµ
dx
)dµ @tµ = µ
(non-linear di↵usion)
f(µ) = 1
m 1
R
(dµ
dx
)m 1dµ @tµ = µm
Evolution µt
Evolution µt
rw rw
µt+1 = ProxW
⌧f
(µt) def.
= argmin
µ2M+(X)
W2
2
(µt, µ) + ⌧f(µ)

106. Eulerian vs. Lagrangian Discretization

107. ˆ
µn
µ
n ! +1
ˆ
µn = 1
n
Pn
i=1 xi
ˆ
H(ˆ
µn) def.
=
P
i
log(minj6=i
||xi xj
||) R
log(dµ
dx
(x))dµ(x)
H(µ) def.
=
n ! +1
Lagrangian Discretization of Entropy

108. time
d⇢t
dt
= ⇢t + r(V ⇢t)
min

E(⇢) def.
=
Z
V (x)⇢(x)dx +
Z
⇢(x) log(⇢(x))dx
V (x) = ||x||2
Wasserstein ﬂow of E:

109. Generalized Entropic Regularization
Primal: min

hdp, ⇡i + f1(P1]⇡) + f2(P2]⇡) + "KL(⇡|⇡0)

110. Generalized Entropic Regularization
Primal: min

hdp, ⇡i + f1(P1]⇡) + f2(P2]⇡) + "KL(⇡|⇡0)
Dual:
(a, b) def.
= (e u
" , e v
" )
max
u,v
f⇤
1
(u) f⇤
2
(v) "he u
" , Ke v
" i
⇡(x, y) = a(x)K(x, y)b(y)

111. Generalized Entropic Regularization
Primal: min

hdp, ⇡i + f1(P1]⇡) + f2(P2]⇡) + "KL(⇡|⇡0)
Block coordinates
relaxation:
max
u
f⇤
1
(u) "he u
" , Ke v
" i
max
v
f⇤
2
(v) "he v
" , K⇤e u
" i (Iv)
(Iu)
Dual:
(a, b) def.
= (e u
" , e v
" )
max
u,v
f⇤
1
(u) f⇤
2
(v) "he u
" , Ke v
" i
⇡(x, y) = a(x)K(x, y)b(y)

112. Generalized Entropic Regularization
Primal: min

hdp, ⇡i + f1(P1]⇡) + f2(P2]⇡) + "KL(⇡|⇡0)
Block coordinates
relaxation:
max
u
f⇤
1
(u) "he u
" , Ke v
" i
max
v
f⇤
2
(v) "he v
" , K⇤e u
" i (Iv)
(Iu)
Proposition:
ProxKL
f1/"
(µ) def.
= argmin ⌫f1(⌫) + "KL(⌫|µ)
the solutions of (Iu) and (Iv) read:
a =
ProxKL
f1/"
(Kb)
Kb
b =
ProxKL
f2/"
(K⇤a)
K⇤a
Dual:
(a, b) def.
= (e u
" , e v
" )
max
u,v
f⇤
1
(u) f⇤
2
(v) "he u
" , Ke v
" i
⇡(x, y) = a(x)K(x, y)b(y)

113. Generalized Entropic Regularization
Primal:
! Only matrix-vector multiplications. ! Highly parallelizable.
! On regular grids: only convolutions! Linear time iterations.
min

hdp, ⇡i + f1(P1]⇡) + f2(P2]⇡) + "KL(⇡|⇡0)
Block coordinates
relaxation:
max
u
f⇤
1
(u) "he u
" , Ke v
" i
max
v
f⇤
2
(v) "he v
" , K⇤e u
" i (Iv)
(Iu)
Proposition:
ProxKL
f1/"
(µ) def.
= argmin ⌫f1(⌫) + "KL(⌫|µ)
the solutions of (Iu) and (Iv) read:
a =
ProxKL
f1/"
(Kb)
Kb
b =
ProxKL
f2/"
(K⇤a)
K⇤a
Dual:
(a, b) def.
= (e u
" , e v
" )
max
u,v
f⇤
1
(u) f⇤
2
(v) "he u
" , Ke v
" i
⇡(x, y) = a(x)K(x, y)b(y)

µt+1
def.
= argminµ
W↵

(µt, µ) + ⌧f(µ)
Congestion-inducing function:
f(µ) = ◆[0,]
(µ) + hw, µi
[Maury, Roudne↵-Chupin, Santambrogio 2010]

µt+1
def.
= argminµ
W↵

(µt, µ) + ⌧f(µ)
Proposition: Prox1
"
f
(µ) = min(e "wµ, )
Congestion-inducing function:
f(µ) = ◆[0,]
(µ) + hw, µi
[Maury, Roudne↵-Chupin, Santambrogio 2010]

µt+1
def.
= argminµ
W↵

(µt, µ) + ⌧f(µ)
Proposition: Prox1
"
f
(µ) = min(e "wµ, )
 = ||µt=0
||1  = 2||µt=0
||1  = 4||µt=0
||1
Congestion-inducing function:
f(µ) = ◆[0,]
(µ) + hw, µi
[Maury, Roudne↵-Chupin, Santambrogio 2010]
rw

117. Crowd Motion on a Surface
Potential cos(w)
 = ||µt=0
||1  = 6||µt=0
||1
X = triangulated mesh.

118. Crowd Motion on a Surface
Potential cos(w)
 = ||µt=0
||1  = 6||µt=0
||1
X = triangulated mesh.

119. Gradient Flows: Crowd Motion with Obstacles
Potential cos(w)
 = ||µt=0
||1  = 2||µt=0
||1  = 4||µt=0
||1  = 6||µt=0
||1
X = sub-domain of R2.

120. Gradient Flows: Crowd Motion with Obstacles
Potential cos(w)
 = ||µt=0
||1  = 2||µt=0
||1  = 4||µt=0
||1  = 6||µt=0
||1
X = sub-domain of R2.

(µ1,t+1, µ2,t+1) def.
= argmin
(µ1,µ2)
W↵

(µ1,t, µ1) + W↵

(µ2,t, µ2) + ⌧f(µ1, µ2)

Wasserstein attraction:
rw
(µ1,t+1, µ2,t+1) def.
= argmin
(µ1,µ2)
W↵

(µ1,t, µ1) + W↵

(µ2,t, µ2) + ⌧f(µ1, µ2)
f(µ1, µ2) = W↵

(µ1, µ2) + h1(µ1) + h2(µ2)
hi(µ) = hw, µi

Wasserstein attraction:
rw
(µ1,t+1, µ2,t+1) def.
= argmin
(µ1,µ2)
W↵

(µ1,t, µ1) + W↵

(µ2,t, µ2) + ⌧f(µ1, µ2)
f(µ1, µ2) = W↵

(µ1, µ2) + h1(µ1) + h2(µ2)
hi(µ) = ◆[0,]
(µ).
hi(µ) = hw, µi

124. Overview
• Measures and Histograms
• From Monge to Kantorovitch Formulations
• Entropic Regularization and Sinkhorn
• Barycenters
• Unbalanced OT and Gradient Flows
• Minimum Kantorovitch Estimators
• Gromov-Wasserstein

125. Discriminative vs Generative Models
Z
x
z
X
Z X
Low dimension High dimension
Generative
Discriminative
g✓
d⇠
g✓
d⇠

126. Discriminative vs Generative Models
Z
x
z
X
Z X
Low dimension High dimension
Generative
Discriminative
g✓
d⇠
g✓
d⇠
classiﬁcation, z =class probability
Supervised:
! Learn d⇠ from labeled data (xi, zi)i.

127. Discriminative vs Generative Models
Z
x
z
X
Z X
Low dimension High dimension
Generative
Discriminative
g✓
d⇠
g✓
d⇠
classiﬁcation, z =class probability
Supervised:
! Learn d⇠ from labeled data (xi, zi)i.
Compression: z = d⇠(x) is a representation.
Generation: x = g✓(z) is a synthesis.
Un-supervised:
! Learn (g✓, d⇠) from data (xi)i.

128. Discriminative vs Generative Models
Z
x
z
X
Z X
Low dimension High dimension
Generative
Discriminative
g✓
d⇠
g✓
d⇠
classiﬁcation, z =class probability
Supervised:
! Learn d⇠ from labeled data (xi, zi)i.
Compression: z = d⇠(x) is a representation.
Generation: x = g✓(z) is a synthesis.
Un-supervised:
! Learn (g✓, d⇠) from data (xi)i.
Density ﬁtting
g✓({zi
}i) ⇡ {xi
}i
Auto-encoders
g✓(d⇠(xi)) ⇡ xi

129. Discriminative vs Generative Models
Z
x
z
X
Z X
Low dimension High dimension
Generative
Discriminative
g✓
d⇠
g✓
d⇠
classiﬁcation, z =class probability
Supervised:
! Learn d⇠ from labeled data (xi, zi)i.
Compression: z = d⇠(x) is a representation.
Generation: x = g✓(z) is a synthesis.
Un-supervised:
! Learn (g✓, d⇠) from data (xi)i.
Density ﬁtting
g✓({zi
}i) ⇡ {xi
}i
Auto-encoders
g✓(d⇠(xi)) ⇡ xi
Optimal transport
map d⇠

130. Density Fitting and Generative Models
Parametric model: ✓ 7! µ✓
µ✓

Observations: ⌫ = 1
n
Pn
i=1 xi

131. Density Fitting and Generative Models
Parametric model: ✓ 7! µ✓
µ✓

Observations: ⌫ = 1
n
Pn
i=1 xi
dµ✓(y) = f✓(y)dy
Density ﬁtting:
Maximum
likelihood (MLE)
min

c
KL(⌫|µ✓) def.
=
X
j
log(f✓(yj))

132. Density Fitting and Generative Models
Parametric model: ✓ 7! µ✓
µ✓

Observations: ⌫ = 1
n
Pn
i=1 xi
g✓ µ✓
X
Z

Generative model ﬁt: µ✓ = g✓,]⇣
! MLE undeﬁned.
! Need a weaker metric.
c
KL(µ✓
|⌫) = +1
dµ✓(y) = f✓(y)dy
Density ﬁtting:
Maximum
likelihood (MLE)
min

c
KL(⌫|µ✓) def.
=
X
j
log(f✓(yj))

133. Loss Functions for Measures
Optimal Transport Distances
µ = 1
N
P
i xi
W(µ, ⌫)p def.
= min
T 2Cµ,⌫
P
i,j
Ti,j
||xi yj
||p
⌫ = 1
P
P
j yj
Density ﬁtting: min

D(µ✓, ⌫)
T
µ ⌫

134. Loss Functions for Measures
Optimal Transport Distances
µ = 1
N
P
i xi
W(µ, ⌫)p def.
= min
T 2Cµ,⌫
P
i,j
Ti,j
||xi yj
||p
⌫ = 1
P
P
j yj
Maximum Mean Discrepancy (MMD)
||µ ⌫||2
k
def.
=
1
N2
X
i,i0
k(xi, xi0
) +
1
P2
X
j,j0
k(yj, yj0
)
2
NP
X
i,j
k(xi, yj)
Gaussian: k(x, y) = e |
|x y|
|2
2 2 . Energy distance: k(x, y) = ||x y||2.
Density ﬁtting: min

D(µ✓, ⌫)
T
µ ⌫
k
µ ⌫

135. Loss Functions for Measures
Optimal Transport Distances
µ = 1
N
P
i xi
W(µ, ⌫)p def.
= min
T 2Cµ,⌫
P
i,j
Ti,j
||xi yj
||p
⌫ = 1
P
P
j yj
Maximum Mean Discrepancy (MMD)
||µ ⌫||2
k
def.
=
1
N2
X
i,i0
k(xi, xi0
) +
1
P2
X
j,j0
k(yj, yj0
)
2
NP
X
i,j
k(xi, yj)
Gaussian: k(x, y) = e |
|x y|
|2
2 2 . Energy distance: k(x, y) = ||x y||2.
Density ﬁtting: min

D(µ✓, ⌫)
T
µ ⌫
k
µ ⌫
µ ⌫
T"
¯
W"(µ, ⌫)p def.
= W"(µ, ⌫)p 1
2
W"(µ, µ)p 1
2
W"(⌫, ⌫)p
W"(µ, ⌫)p def.
=
P
i,j
T"
i,j
||xi yj
||p
Sinkhorn divergences [Cuturi 13]

136. Loss Functions for Measures
Optimal Transport Distances
µ = 1
N
P
i xi
W(µ, ⌫)p def.
= min
T 2Cµ,⌫
P
i,j
Ti,j
||xi yj
||p
⌫ = 1
P
P
j yj
Maximum Mean Discrepancy (MMD)
||µ ⌫||2
k
def.
=
1
N2
X
i,i0
k(xi, xi0
) +
1
P2
X
j,j0
k(yj, yj0
)
2
NP
X
i,j
k(xi, yj)
Gaussian: k(x, y) = e |
|x y|
|2
2 2 . Energy distance: k(x, y) = ||x y||2.
for k(x, y) = ||x y||p
Density ﬁtting: min

D(µ✓, ⌫)
Theorem: [Ramdas, G.Trillos, Cuturi 17] "!+1
! ||µ ⌫||2
k
¯
W"(µ, ⌫)p
"!0
! W(µ, ⌫)p
T
µ ⌫
k
µ ⌫
µ ⌫
T"
¯
W"(µ, ⌫)p def.
= W"(µ, ⌫)p 1
2
W"(µ, µ)p 1
2
W"(⌫, ⌫)p
W"(µ, ⌫)p def.
=
P
i,j
T"
i,j
||xi yj
||p
Sinkhorn divergences [Cuturi 13]

137. Loss Functions for Measures
Optimal Transport Distances
µ = 1
N
P
i xi
W(µ, ⌫)p def.
= min
T 2Cµ,⌫
P
i,j
Ti,j
||xi yj
||p
⌫ = 1
P
P
j yj
Maximum Mean Discrepancy (MMD)
||µ ⌫||2
k
def.
=
1
N2
X
i,i0
k(xi, xi0
) +
1
P2
X
j,j0
k(yj, yj0
)
2
NP
X
i,j
k(xi, yj)
Gaussian: k(x, y) = e |
|x y|
|2
2 2 . Energy distance: k(x, y) = ||x y||2.
for k(x, y) = ||x y||p
Density ﬁtting: min

D(µ✓, ⌫)
Theorem: [Ramdas, G.Trillos, Cuturi 17] "!+1
! ||µ ⌫||2
k
¯
W"(µ, ⌫)p
"!0
! W(µ, ⌫)p
T
µ ⌫
k
µ ⌫
– Scale free (no , no heavy tail kernel).
– Non-Euclidean, arbitrary ground distance.
– No curse of dimension (low sample complexity).
Best of both worlds:
! cross-validate "
µ ⌫
T"
¯
W"(µ, ⌫)p def.
= W"(µ, ⌫)p 1
2
W"(µ, µ)p 1
2
W"(⌫, ⌫)p
W"(µ, ⌫)p def.
=
P
i,j
T"
i,j
||xi yj
||p
Sinkhorn divergences [Cuturi 13]

138. Deep Discriminative vs Generative Models
✓1
✓2

z
x
g✓
⇢ ⇢
z
x
Discriminative
Generative
d⇠
⇠1 ⇠2
g✓(z) = ⇢(✓K(. . . ⇢(✓2(⇢(✓1(z) . . .)
Deep networks:
d⇠(x) = ⇢(⇠K(. . . ⇢(⇠2(⇢(⇠1(x) . . .)

139. Deep Discriminative vs Generative Models
✓1
✓2

z
x
g✓
⇢ ⇢
z
x
Discriminative
Generative
z1
z2
g✓
Z
x
z
X
d⇠
d⇠
⇠1 ⇠2
g✓(z) = ⇢(✓K(. . . ⇢(✓2(⇢(✓1(z) . . .)
Deep networks:
d⇠(x) = ⇢(⇠K(. . . ⇢(⇠2(⇢(⇠1(x) . . .)

140. Examples of Image Generation
[Credit ArXiv:1511.06434]
g✓
Z X

141. Overview
• Measures and Histograms
• From Monge to Kantorovitch Formulations
• Entropic Regularization and Sinkhorn
• Barycenters
• Unbalanced OT and Gradient Flows
• Minimum Kantorovitch Estimators
• Gromov-Wasserstein

142. unregistered
spaces
Gromov-Wasserstein
Inputs: {(similarity/kernel matrix, histogram)}
X
Y

143. unregistered
spaces
Gromov-Wasserstein
Inputs: {(similarity/kernel matrix, histogram)}
X
Y
[Memoli 2011]
Def. Gromov-Wasserstein distance:
[Sturm 2012]

144. unregistered
spaces
Gromov-Wasserstein
Inputs: {(similarity/kernel matrix, histogram)}
X
Y
! need for a fast approximate solver.
! NP-hard in general.
[Memoli 2011]
Def. Gromov-Wasserstein distance:
[Sturm 2012]

145. Y
xi
X
Gromov-Wasserstein as a Metric
yj
Def.

146. Y
xi
X
Gromov-Wasserstein as a Metric
yj
X Y
f
f
X Y
Def.
()
Isometries on M:
Def.

147. Y
xi
X
Gromov-Wasserstein as a Metric
yj
! “bending-invariant” objects recognition.
X Y
f
f
X Y
Def.
()
Isometries on M:
Def.
Prop.
[Memoli 2011]

148. X
Y
For Arbitrary Spaces
Metric-measure spaces (X, Y ): (dX , µ), (dY , ⌫)

149. X
Y
For Arbitrary Spaces
d
X (x,x 0
)
dY (y, y0)
Def. Gromov-Wasserstein distance:
[Memoli 2011]
[Sturm 2012]
GW2
2
(dX , µ, dY , ⌫) def.
=
min
⇡2⇧(µ,⌫)
Z
X2⇥Y 2
|dX(x, x0) dY (y, y0)|2d⇡(x, y)d⇡(x0, y0)
Metric-measure spaces (X, Y ): (dX , µ), (dY , ⌫)

150. X
Y
For Arbitrary Spaces
f
X Y
Prop.
GW is a distance on
mm-spaces/isometries.
! “bending-invariant” objects recognition.
d
X (x,x 0
)
dY (y, y0)
Def. Gromov-Wasserstein distance:
[Memoli 2011]
[Sturm 2012]
GW2
2
(dX , µ, dY , ⌫) def.
=
min
⇡2⇧(µ,⌫)
Z
X2⇥Y 2
|dX(x, x0) dY (y, y0)|2d⇡(x, y)d⇡(x0, y0)
Metric-measure spaces (X, Y ): (dX , µ), (dY , ⌫)

151. X
Y
For Arbitrary Spaces
f
X Y
Prop.
GW is a distance on
mm-spaces/isometries.
! need for a fast approximate solver.
! “bending-invariant” objects recognition.
! QAP: NP-hard in general.
d
X (x,x 0
)
dY (y, y0)
Def. Gromov-Wasserstein distance:
[Memoli 2011]
[Sturm 2012]
GW2
2
(dX , µ, dY , ⌫) def.
=
min
⇡2⇧(µ,⌫)
Z
X2⇥Y 2
|dX(x, x0) dY (y, y0)|2d⇡(x, y)d⇡(x0, y0)
Metric-measure spaces (X, Y ): (dX , µ), (dY , ⌫)

152. Entropic Gromov Wasserstein
Projected mirror descent:
Def. Entropic Gromov-Wasserstein

153. Entropic Gromov Wasserstein
Projected mirror descent:
Def. Entropic Gromov-Wasserstein
Projected mirror descent:
Def.
Prop. for ⌧ = 1/", the iteration reads
repeat:
until convergence.
initialization:
return T
func T = GW(C, ¯
C, p, q)

154. phine
Suvrit Sra
MIT
Source Targets
Figure 1: Entropic GW can ﬁnd correspondences between a source
surface (left) and a surface with similar structure, a surface with
shared semantic structure, a noisy 3D point cloud, an icon, and a
hand drawing. Each fuzzy map was computed using the same code.
In this paper, we propose a new correspondence algorithm that
minimizes distortion of long- and short-range distances alike. We
study an entropically-regularized version of the Gromov-Wasserstein
(GW) mapping objective function from [M´
emoli 2011] measuring
the distortion of geodesic distances. The optimizer is a probabilistic
matching expressed as a “fuzzy” correspondence matrix in the style
of [Kim et al. 2012; Solomon et al. 2012]; we control sharpness of
the correspondence via the weight of the entropic regularizer.
0 0.02 0.04
0
0.02
0
0.02
Teddies
Humans
Four-legged
Figure 15: MDS embedding of four classes from SHREC dataset.
0 0.5 1
0
0.5
1
1
5
10
15
20
25
30
35
40
45
PCA 1
PCA 2
Figure 16: Recovery of galloping horse sequence.
0 is the base shape) as a feature vector for shape i. We reproduce
the result presented in the work of Rustamov et al., recovering
the circular structure of meshes from a galloping horse animation
sequence (Figure 16). Unlike Rustamov et al., however, our method
does not require ground truth maps between shapes as input.
5.2 Supervised Matching
An important feature of a matching tool is the ability to incorporate
user input, e.g. ground truth matches of points or regions. In the
GWα framework, one way to enforce these constraints is to provide
a stencil S specifying a sparsity pattern for the map Γ. Incorporating
Figure 18: Mapping a set of 185 images onto a two shapes while
preserving color similarity. (Images from Flickr public domain collection.)
Rn0×n0
+
and D ∈ Rn×n
+
we are given symmetric weight matrices
W0 ∈ Rn0×n0
+
and W ∈ Rn×n
+
. We could solve a weighted
version of the GWα matching problem (3) that prioritizes maps
preserving distances corresponding to large W values:
min
Γ∈M
ijkℓ
(D0ij −Dkℓ
)2ΓikΓjℓW0ijWjℓµ0iµ0jµkµℓ
. (8)
For instance, (W0, W) might contain conﬁdence values expressing
the quality of the entries of (D0, D). Or, W0, W could take values
in {ε, 1} reducing the weight of distances that are unknown or do
not need to be preserved by Γ.
Following the same simpliﬁcations as §3.1, we can optimize this
objective by minimizing ⟨Γ, ΛW
(Γ)⟩, where
ΛW
(Γ) :=
1
2
[D∧2
0
⊗ W0][[µ0
]]Γ[[µ]]W
− [D0 ⊗ W0][[µ0
]]Γ[[µ]][D ⊗ W]
+
1
2
W0[[µ0
]]Γ[[µ]][D∧2 ⊗ W]
Applications of GW: Shapes Analysis
Use T to deﬁne registration between:
Colors
distribution Shape
Shape
Shape

155. phine
Suvrit Sra
MIT
Source Targets
Figure 1: Entropic GW can ﬁnd correspondences between a source
surface (left) and a surface with similar structure, a surface with
shared semantic structure, a noisy 3D point cloud, an icon, and a
hand drawing. Each fuzzy map was computed using the same code.
In this paper, we propose a new correspondence algorithm that
minimizes distortion of long- and short-range distances alike. We
study an entropically-regularized version of the Gromov-Wasserstein
(GW) mapping objective function from [M´
emoli 2011] measuring
the distortion of geodesic distances. The optimizer is a probabilistic
matching expressed as a “fuzzy” correspondence matrix in the style
of [Kim et al. 2012; Solomon et al. 2012]; we control sharpness of
the correspondence via the weight of the entropic regularizer.
0 0.02 0.04
0
0.02
0
0.02
Teddies
Humans
Four-legged
Figure 15: MDS embedding of four classes from SHREC dataset.
0 0.5 1
0
0.5
1
1
5
10
15
20
25
30
35
40
45
PCA 1
PCA 2
Figure 16: Recovery of galloping horse sequence.
0 is the base shape) as a feature vector for shape i. We reproduce
the result presented in the work of Rustamov et al., recovering
the circular structure of meshes from a galloping horse animation
sequence (Figure 16). Unlike Rustamov et al., however, our method
does not require ground truth maps between shapes as input.
5.2 Supervised Matching
An important feature of a matching tool is the ability to incorporate
user input, e.g. ground truth matches of points or regions. In the
GWα framework, one way to enforce these constraints is to provide
a stencil S specifying a sparsity pattern for the map Γ. Incorporating
Figure 18: Mapping a set of 185 images onto a two shapes while
preserving color similarity. (Images from Flickr public domain collection.)
Rn0×n0
+
and D ∈ Rn×n
+
we are given symmetric weight matrices
W0 ∈ Rn0×n0
+
and W ∈ Rn×n
+
. We could solve a weighted
version of the GWα matching problem (3) that prioritizes maps
preserving distances corresponding to large W values:
min
Γ∈M
ijkℓ
(D0ij −Dkℓ
)2ΓikΓjℓW0ijWjℓµ0iµ0jµkµℓ
. (8)
For instance, (W0, W) might contain conﬁdence values expressing
the quality of the entries of (D0, D). Or, W0, W could take values
in {ε, 1} reducing the weight of distances that are unknown or do
not need to be preserved by Γ.
Following the same simpliﬁcations as §3.1, we can optimize this
objective by minimizing ⟨Γ, ΛW
(Γ)⟩, where
ΛW
(Γ) :=
1
2
[D∧2
0
⊗ W0][[µ0
]]Γ[[µ]]W
− [D0 ⊗ W0][[µ0
]]Γ[[µ]][D ⊗ W]
+
1
2
W0[[µ0
]]Γ[[µ]][D∧2 ⊗ W]
Applications of GW: Shapes Analysis
0 0.02 0.04
0
0.02
0
0.02
Te
Hu
Fo
Ar
Figure 1: The database that has been used, divid
MDS in 3-D
Use T to deﬁne registration between:
Colors
distribution Shape
Shape
Shape
Geodesic distances GW distances MDS
Vizualization
Shapes
(Xs)s 0 0.02 0.04
0
0.02
0
0.02
Teddies
Humans
Four-legged
Figure 15: MDS embedding of four classes from SHREC dataset.
0
0.5
1
1
5
10
15
20
25
30
35
40
45
PCA 2
Figure 18: Mapping a set of 185 images onto a two shapes while
preserving color similarity. (Images from Flickr public domain collection.)
Rn0×n0
+
and D ∈ Rn×n
+
we are given symmetric weight matrices
W0 ∈ Rn0×n0
+
and W ∈ Rn×n
+
. We could solve a weighted
version of the GWα matching problem (3) that prioritizes maps
preserving distances corresponding to large W values:
MDS in 2-D

156. Applications of GW: Quantum Chemistry
Regression problem:
! f by solving DFT approximation is too costly.

157. Applications of GW: Quantum Chemistry
Regression problem:
! f by solving DFT approximation is too costly.
[Rupp et al 2012]
2
6
6
6
6
4
3
7
7
7
7
5

158. Applications of GW: Quantum Chemistry
Regression problem:
! f by solving DFT approximation is too costly.
GW-interpolation: [Rupp et al 2012]
2
6
6
6
6
4
3
7
7
7
7
5

159. Gromov-Wasserstein Geodesics
Def. Gromov-Wasserstein Geodesic

160. Gromov-Wasserstein Geodesics
Def. Gromov-Wasserstein Geodesic
Prop.
[Sturm 2012]

161. Gromov-Wasserstein Geodesics
Def. Gromov-Wasserstein Geodesic
! X ⇥ Y is not practical for most applications.
(need to ﬁx the size of the geodesic embedding space)
! Extension to more than 2 input spaces?
Prop.
[Sturm 2012]

162. Gromov-Wasserstein Barycenters
Input:
1
2
3
Def. GW Barycenters

163. Gromov-Wasserstein Barycenters
Input:
1
2
3
Def. GW Barycenters
repeat:
until convergence.
initialization: C C0
for s = 1 to S do
return C
Alternating minimization:
On Ts

164. Conclusion: Toward High-dimensional OT
Monge Kantorovich Dantzig Brenier Otto McCann Villani
ansport framework Sliced Wasserstein projection Applications
lication to Color Transfer
Source image (X)
Style image (Y)
Sliced Wasserstein projection of X to style
image color statistics Y
Source image after color transfer
J. Rabin Wasserstein Regularization