Wasserstein Gradient Flows of MMD Functionals with Distance Kernel and Cauchy Problems on Quantile Functions

Wasserstein Gradient Flows of MMD Functionals with Distance Kernel and
Cauchy Problems on Quantile Functions joint work with Richard Duong, TU Berlin Robert Beinert, TU Berlin Johannes Hertrich, UCL Gabriele Steidl, TU Berlin University of South Carolina, Columbia, 30.08.2024. Joint ACM and RTG data science seminar (Changhui Tan, Siming He, Wuchen Li).

Motivation - Discrepancy minimization Problem: Target measure ν ∈ P2
(Rd) is unknown, only samples are available. Goal: Recover ν. Solution: minimize the metric kernel discrepancy Fν := MMDK (·, ν)2 : P2 (R d) → [0, ∞) (which can be estimated using samples) to ν by finding curve of measures (γt )t>0 ⊂ P2 (Rd) along which Fν decreases “the fastest”. ©Francis Bach One way to construct (γt )t>0: Wasserstein gradient flows. Different kernels K ⇝ very different behavior of (γt )t>0. Aim of this preprint: study behavior of (γt )t>0 for irregular kernel K(x, y) := −|x − y| for d = 1. Viktor Stein W2 Gradient Flows of MMD functionals with Distance Kernel August 30th, 2024 2 / 22

1. Optimal transport and Wasserstein gradient flow ©Petr Mokrov 2.
Maximal monotone inclusions on Hilbert spaces (I + λ∂F)−1 3. Negative distance kernel, Max. Mean Discrepancy https://www.mia.uni-saarland.de/Research/IP_Halftoning.shtml 4. Wasserstein gradient flow of the MMD with negative distance kernel 5. Invariant subsets & smoothing properties γ0 = δx , γt ∼ U[at , bt ], t > 0. 6. Numerical results 8 6 4 2 0 2 4 6 8 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Iteration 0 initial target explicit implicit −1 −0.5 0.5 1 1.5 2 1 2 3 µ0

Probability measures on the real line - CDF and quantile
function The cumulative distribution functions (CDFs) of ν ∈ P(R) are given by R+ ν : R → [0, 1], R+ ν (x) := ν (−∞, x] , R− ν : R → [0, 1], R− ν (x) := ν (−∞, x) . and its (generalized inverse, the) quantile function by Qν : (0, 1) → R, Qν (s) := min{x ∈ R : R+ ν (x) ≥ s}. s x W1 W2 W3 1 x1 x2 x3 x4 s x W1 W2 W3 1 x1 x2 x3 x4 s W1 W2 W3 1 0 x1 x2 x3 x4 Left to right: R+ ν , R− ν , Qν , for ν = 4 k=1 wkδxk , Wj := j k=1 wk. Viktor Stein W2 Gradient Flows of MMD functionals with Distance Kernel August 30th, 2024 4 / 22

Wasserstein-2 space on the real line Consider the subset of
probability measures P(R), P2 (R) := µ ∈ P(R) : R x2 dµ(x) < ∞ . On P2 (R), the Wasserstein-2 metric is W2 (µ, ν)2 = min π∈Γ(µ,ν) R × R (x − y)2 dπ(x, y), µ, ν ∈ P2 (R), Γ(µ, ν) := {π ∈ P(R × R) : (P1 )# π = µ, (P2 )# π = ν} with the projections Pi (x1 , x2 ) := xi. The push-forward # acts as (f# σ)(A) := σ(f−1(A)) for all measurable sets A ⊂ R. The optimal plan ˆ π is unique (in 1D). Push-forward ©Gabriel Péyre dsweber2.github.io/Optimal-Transport-Information-Geometry/ ˆ π realizes monotone rearrangement. juliaoptimaltransport.github.io/OptimalTransport.jl/dev/examples/OneDimension/ Viktor Stein W2 Gradient Flows of MMD functionals with Distance Kernel August 30th, 2024 5 / 22

Manifold-like structure of P2 (R) - Geodesics and geodesic convexity
Geodesic (= shortest constant-speed path) between µ and ν is γt := (1 − t)P1 + tP2 # ˆ π, t ∈ [0, 1], (1) Vertical (L2, gt := (1 − t)fµ + tfν ) vs. horizontal (W2) mass displacement. ©Anna Korba Definition (W2-Geodesic convexity) F : P2 (R) → R is convex along geodesics if F(γt ) ≤ (1 − t) F(µ) + t F(ν), ∀µ, ν ∈ P2 (R), γt from (1). Geodesic on a manifold. dx.doi.org/10.1049/iet-rsn.2020.0185 W2 geodesic ©Lénaïc Chizat Viktor Stein W2 Gradient Flows of MMD functionals with Distance Kernel August 30th, 2024 6 / 22

Isometric embedding via quantile functions Theorem (Isometric embedding of (P2
(R), W2 )) Let C(0, 1) := {Qµ : µ ∈ P2 (R)} ⊂ L2 (0, 1) be the set of quantile functions. The map P2 (R) → C(0, 1), µ → Qµ is an isometric isomorphism with inverse Q → Q# Λ(0,1) , where Λ is the Lebesgue measure. Isometric embedding. dx.doi.org/10.1155/IJMMS.2005.2241 KEY IDEA: instead of working with F : P2 (R) → (−∞, ∞], find F : L2 (0, 1) → (−∞, ∞] satisfying F(Qµ ) = F(µ). Viktor Stein W2 Gradient Flows of MMD functionals with Distance Kernel August 30th, 2024 7 / 22

From Euclidean to metric gradient flows Given: Hilbert space H,
energy functional f : H → (−∞, ∞] (convex, bounded below). Idea: minimize f ⇝ find an absolutely continuous curve t → x(t) such that t → f(x(t)) decreases “the fastest”. d dt x(t) = −∇f(x(t)) ©Francis Bach If f is not differentiable, there might be more than one descent direction. Viktor Stein W2 Gradient Flows of MMD functionals with Distance Kernel August 30th, 2024 8 / 22

Subdifferentials of convex functions are monotone Definition The subdifferential of
a convex function F : H → R is ∂F(u) := v ∈ H : F(w) ≥ F(u) + ⟨v, w − u⟩ ∀w ∈ H . ©Pontus Gielsson Theorem (Properties of the resolvent) Let F : H → R be convex and lower semicontinuous, F ̸≡ ∞. Then ∂F is maximal monotone, hence ∀ε > 0, the resolvent J∂F ε := (I + ε∂F)−1 : H → H is single-valued. ∂F, I + λ∂F, (I + λ∂F)−1, where F := | · |. Viktor Stein W2 Gradient Flows of MMD functionals with Distance Kernel August 30th, 2024 9 / 22

Maximal monotone inclusions on Hilbert spaces This theorem tells us
that gradient flows in Hilbert spaces exist and are unique. Theorem (Existence, regularity of strong solutions to Cauchy problems (Brezis ’67)) Let f : H → R s.t. ∂f : H → 2H is maximal monotone and g0 ∈ dom(∂f). Then ∃!g : [0, ∞) → H s.t. • g(0) = g0 and dg dt (t) ∈ −(∂f) g(t) for almost all t > 0, and g(t) ∈ dom(∂f) for all t > 0 • g is Lipschitz continuous on [0, ∞). • g is given by the exponential formula g(t) = lim n→∞ J∂f t n n (g0 ) uniformly in t on compact intervals. For P2 (R) instead of H: what are the analogs of the tangent vector d dt x(t) and the subdifferential ∂f? Viktor Stein W2 Gradient Flows of MMD functionals with Distance Kernel August 30th, 2024 10 / 22

Wasserstein gradient flows in one dimension Definition (Fréchet subdifferential in
Wasserstein space) The (reduced) Fréchet subdifferential of F : P2 (R) → R at µ is ∂ F(µ) := ξ ∈ L2(R; µ) : F(ν) − F(µ) ≥ R × R ξ(x1 )(x2 − x1 ) dˆ π(x, y) + o W2 (µ, ν) A curve γ : (0, ∞) → P2 (R) is absolutely continuous if ∃ L2-Borel velocity field (vt : R → R)t>0 s.t. ∂t γt + ∇ · (vt γt ) = 0, (t, x) ∈ (0, ∞) × R, weakly. (Continuity Eq.) Definition (Wasserstein gradient flow) A absolutely continuous curve γ : (0, ∞) → P2 (R) with velocity field vt ∈ Tγt P2 (R) t>0 is a Wasserstein gradient flow with respect to F : P2 (R) → R if vt ∈ −∂ F(γt ), for a.e. t > 0. Tangent vector. https://personal.math.ubc.ca/~CLP/CLP3/clp_3_mc/sec_curves.html Viktor Stein W2 Gradient Flows of MMD functionals with Distance Kernel August 30th, 2024 11 / 22

Fundamental theorem of Wasserstein gradient flows Theorem (Ambrosio, Gigli, Savaré
(2005)) Let F : P2 (R) → R be bounded from below, lower semicontinuous, geodesically convex. Existence and uniqueness. Then ∃! Wasserstein gradient flow γ : (0, ∞) → P2 (R) with respect to F starting at γ(0+) = µ0 ∈ P2 (R). Approximation scheme. The piecewise constant curves constructed from the iterates µn+1 := argmin µ∈P2(Rd) F(µ) + 1 2τ W2 2 (µn , µ) , τ > 0, (Minimizing movement scheme) i.e., γτ defined by γτ |(nτ,(n+1)τ] := µn, n ∈ N, converge locally uniformly to γ as τ ↓ 0. Convergence speed. If ¯ µ is a minimizer of F, then F(γt ) − F(¯ µ) ≤ 1 2t W2 2 (µ0 , ¯ µ). (Sublinear convergence rate) Viktor Stein W2 Gradient Flows of MMD functionals with Distance Kernel August 30th, 2024 12 / 22

Reformulation of WGF as Cauchy Problem We establish a correspondence
between L2 (0, 1)-gradient flows of F and Wasserstein gradient flows of F. Theorem (Quantile function reformulation [DSBHS24]) Let F : L2 (0, 1) → (−∞, ∞] be convex and lsc, F ̸≡ ∞ with F(Qµ ) = F(µ) for all µ ∈ P2 (R). Assume J∂F ε maps C(0, 1) into itself ∀ε > 0. Then, ∀ initial datum g0 ∈ C(0, 1) ∩ dom(∂F),      ∂t g(t) + ∂F(g(t)) ∋ 0, t ∈ (0, ∞), g(0) = g0 , (Cauchy Problem) has a unique strong solution g. The curve γt := (g(t))# Λ(0,1) has quantile functions Qγt = g(t) and is a Wasserstein gradient flow of F with γ(0+) = (g0 )# Λ(0,1) . Viktor Stein W2 Gradient Flows of MMD functionals with Distance Kernel August 30th, 2024 13 / 22

Maximum Mean Discrepancy Consider the negative distance kernel K(x, y)
:= −|x − y|. K is only conditionally positive definite. Motivation: electrostatic principles, interacting species, dithering. Definition (MMD) The maximum mean discrepancy of with respect to K is P2 (R) × P2 (R) → [0, ∞), (µ, ν) → 1 2 MMDK (µ, ν)2 = R × R K(x, y) d(µ − ν)(x) d(µ − ν)(y). Dithering doi.org/10.1137/100790197 Fν (µ) := − 1 2 R × R |x − y| dµ(x) dµ(y) interaction + Rd × Rd |x − y| dµ(x) dν(y) potential . (2) MMD2 is a metric, known also as energy distance. Viktor Stein W2 Gradient Flows of MMD functionals with Distance Kernel August 30th, 2024 14 / 22

Properties of Fν We now apply this theorem to the
MMD with the negative distance kernel, i.e. to Fν := 1 2 MMDK (·, ν). Lemma (Properties of Fν ) The functional Fν : L2 (0, 1) → R, u → 1 0 (1 − 2s) u(s) − Qν (s) + 1 0 |u(s) − Qν (t)| dt ds is convex and continuous, we have Fν (Qµ ) = Fν (µ) ∀µ ∈ P2 (R), and ∂Fν (u) = f ∈ L2 (0, 1) : f(s) ∈ 2 R− ν u(s) , R+ ν u(s) − 2s for a.e. s ∈ (0, 1) , u ∈ L2 (0, 1) and J∂Fν ε maps C(0, 1) into itself ∀ε > 0. Proof. By elementary means, nothing fancy happens here. Viktor Stein W2 Gradient Flows of MMD functionals with Distance Kernel August 30th, 2024 15 / 22

Smoothing properties and invariant subsets The lower Lipschitz constant of
g ∈ C(0, 1) is Llow (g) := max L ≥ 0 : g(s1 ) − g(s2 ) s1 − s2 ≥ L ∀s1 , s2 ∈ (0, 1) ≥ 0, If µ = δx, then Qµ ≡ x and Llow (Qµ ) = 0. Theorem (Time evolution of CDF’s lower Lipschitz constants) Let ν ∈ P2 (R) with Llow (Qν ) > 0, and g0 = Qµ0 ∈ C(0, 1). We have Lip(Rγt ) ≤ Llow (g0 ) · e− 2t Llow(Qν ) + Llow (Qν ) · (1 − e− 2t Llow(Qν ) ) −1 < ∞. Theorem (Continuity is preserved & monotonicity of the support) Let ν ∈ P2 (R), g0 be continuous and g the solution of the Cauchy problem starting in g0. • g(t) is continuous for all t ≥ 0. • The ranges fulfill g(t1 )(0, 1) ⊆ g(t2 )(0, 1) for all 0 ≤ t1 ≤ t2. Viktor Stein W2 Gradient Flows of MMD functionals with Distance Kernel August 30th, 2024 16 / 22

Explicit formula for discrete ν Corollary (Point measure target) Let
ν := n j=1 wj δxj . Then Fν flow is given by [g(t)](s) :=            Qµ0 (s) + 2 (s − Rs,0 ) t, t ∈ [ts,0 , ts,1 ), xs,j + 2 (s − Rs,j ) (t − ts,j ), t ∈ [ts,j , ts,j+1 ), Qν (s), t ≥ ts,|ℓs−ks| , where ts,0 := 0, ts,1 := xs,1 − Qµ0 (s) 2(s − Rs,0) , ts,j+1 := ts,j + xs,j+1 − xs,j 2(s − Rs,j ) , Qµ0 (s) ≤ Qν (s) Qµ0 (s) ≥ Qν (s) ℓs Wℓs−1 < s < Wℓs Wℓs−1 < s < Wℓs ks xks ≤ Qµ0 (s) < xks+1 xks−1 < Qµ0 (s) ≤ xks xs,j xks+j xks−j j ≤ |ℓs − ks | Rs,j Wks+j Wks−j−1 j ≤ |ℓs − ks | − 1 Viktor Stein W2 Gradient Flows of MMD functionals with Distance Kernel August 30th, 2024 17 / 22 −1 −0.5 0.5 1 1.5 2 1 2 3 µ0

Numerical experiments - Implicit Euler (backward) scheme Let τ >
0. The minimizing movement (or JKO) scheme, µn+1 := argmin µ∈P2(Rd) Fν (µ) + 1 2τ W2 2 (µn , µ) , can be rewritten using the isometry P2 (R) → C(0, 1) as gn+1 = argmin g∈C(0,1) Fν (g) + 1 2τ 1 0 |g − gn|2 ds Fν ∈Γ0(L2(0,1)) = (I + τ∂Fν )−1(gn ), which is equivalent to gn (s) + 2τs ∈ gn+1 (s) + 2τ R− ν (gn+1 (s)), R+ ν (gn+1 (s)) (3) for all s ∈ (0, 1). −2 −1 1 −2 −1 1 2 gn (s1 ) + 2τs1 s1 gn+1 (s1 ) s2 gn+1 (s2 ) gn + 2τ id id +2τ[R− ν , R+ ν ] Implicit Euler step visualized. (gn )# Λ(0,1) approximates WGF for t ∈ (nτ, (n + 1)τ]. (gn )n∈N → Qν weakly in L2 (0, 1) and µn := (gn )# Λ(0,1) → ν narrowly for fixed τ > 0. Viktor Stein W2 Gradient Flows of MMD functionals with Distance Kernel August 30th, 2024 18 / 22

Numerical experiments - Explicit Euler (forward) scheme If R+ ν
= R− ν =: Rν , we can also use explicit Euler discretization gn+1 = gn − τ∇Fν (gn ) = gn − 2τ(Rν ◦ gn − id), Advantage: we don’t have to solve an inclusion in each step. Disadvantage: weaker convergence guarantees, might not preserve C(0, 1) (iterates not monotone). µ0 = N(−5, 1), ν = N(5, 1). Viktor Stein W2 Gradient Flows of MMD functionals with Distance Kernel August 30th, 2024 19 / 22 8 6 4 2 0 2 4 6 8 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Iteration 0 initial target explicit implicit

Further work • Extension to higher dimensions. Lagrangian reformulation now
involves flow map. • Deterministic particle approximations and mean field limits. • Convergence of explicit scheme, distance of implicit and explicit iterates. • Generalize Lipschitz properties to Hölder properties. • Applying similar techniques to non-convex functional and then regularize. Viktor Stein W2 Gradient Flows of MMD functionals with Distance Kernel August 30th, 2024 20 / 22

Conclusion • Reformulation as maximal monotone inclusion Cauchy problem in
L2 (0, 1) via quantile functions. • Comprehensive description of solutions’ behavior, qualitative description of instantaneous measure-to-L∞ regularization. • Implicit Euler is simple. Viktor Stein W2 Gradient Flows of MMD functionals with Distance Kernel August 30th, 2024 21 / 22

Thank you for your attention! I am happy to take
any questions. Paper link: https://arxiv.org/abs/2408.07498 My website: viktorajstein.github.io [AGS08, Bre73, CL71, JKO98] Viktor Stein W2 Gradient Flows of MMD functionals with Distance Kernel August 30th, 2024 22 / 22

References I [AGS08] L. Ambrosio, N. Gigli, and G. Savare,
Gradient flows, 2nd ed., Lectures in Mathematics ETH Zürich, Birkhäuser, Basel, 2008. [Bre73] Haim Brezis, Operateurs maximaux monotones, North-Holland Mathematics Studies, 1973 (French). [CL71] M. G. Crandall and T. M. Liggett, Generation of semi-groups of nonlinear transformations on general banach spaces, American Journal of Mathematics 93 (1971), no. 2, 265–298. [JKO98] Richard Jordan, David Kinderlehrer, and Felix Otto, The variational formulation of the Fokker–Planck equation, SIAM Journal on Mathematical Analysis 29 (1998), no. 1, 1–17. Viktor Stein W2 Gradient Flows of MMD functionals with Distance Kernel August 30th, 2024 1 / 2

Shameless plug: other works Interpolating between OT and KL regularized
OT using Rényi Divergences Rényi divergence ̸∈ {f-div., Bregman div.}, α ∈ (0, 1) Rα (µ | ν) := 1 α − 1 ln X dµ dτ α dν dτ 1−α dτ , OTε,α (µ, ν) := min π∈Π(µ,ν) ⟨c, π⟩ + εRα (π | µ ⊗ ν) is a metric, where ε > 0, µ, ν ∈ P(X), X compact. OT(µ, ν) α↘0 ← − − − − or ε→0 OTε,α (µ, ν) α↗1 − − − → OTKL ε (µ, ν). In the works: debiased Rényi-Sinkhorn divergence OTε,α (µ, ν) − 1 2 OTε,α (µ, µ) − 1 2 OTε,α (ν, ν). Viktor Stein W2 Gradient Flows of MMD functionals with Distance Kernel August 30th, 2024 2 / 2

Wasserstein Gradient Flows of MMD Functionals w...

Wasserstein Gradient Flows of MMD Functionals with Distance Kernel and Cauchy Problems on Quantile Functions

Viktor Stein

More Decks by Viktor Stein

Other Decks in Research

Featured

Transcript

Wasserstein Gradient Flows of MMD Functionals with Distance Kernel and

Motivation - Discrepancy minimization Problem: Target measure ν ∈ P2

1. Optimal transport and Wasserstein gradient flow ©Petr Mokrov 2.

Probability measures on the real line - CDF and quantile

Wasserstein-2 space on the real line Consider the subset of

Manifold-like structure of P2 (R) - Geodesics and geodesic convexity

Isometric embedding via quantile functions Theorem (Isometric embedding of (P2

From Euclidean to metric gradient flows Given: Hilbert space H,

Subdifferentials of convex functions are monotone Definition The subdifferential of

Maximal monotone inclusions on Hilbert spaces This theorem tells us

Wasserstein gradient flows in one dimension Definition (Fréchet subdifferential in

Fundamental theorem of Wasserstein gradient flows Theorem (Ambrosio, Gigli, Savaré

Reformulation of WGF as Cauchy Problem We establish a correspondence

Maximum Mean Discrepancy Consider the negative distance kernel K(x, y)

Properties of Fν We now apply this theorem to the

Smoothing properties and invariant subsets The lower Lipschitz constant of

Explicit formula for discrete ν Corollary (Point measure target) Let

Numerical experiments - Implicit Euler (backward) scheme Let τ >

Numerical experiments - Explicit Euler (forward) scheme If R+ ν

Further work • Extension to higher dimensions. Lagrangian reformulation now

Conclusion • Reformulation as maximal monotone inclusion Cauchy problem in

Thank you for your attention! I am happy to take

References I [AGS08] L. Ambrosio, N. Gigli, and G. Savare,

Shameless plug: other works Interpolating between OT and KL regularized