Slide 1

Slide 1 text

Interpolating between Optimal Transport & KL regularized Optimal Transport with Rényi Divergences joint work with Jonas Bresch, TU Berlin University of South Carolina, Columbia, 12.09.2024. Graduate Colloquium (Alec Helm, Jonah Klein).

Slide 2

Slide 2 text

Motivation - Deficiencies of KL regularized OT Optimal transport (OT) ⇝ distance on probability measures via transport plan. Problem: O(N3) for N samples. Solution: Entropic OT (Cuturi, NeurIPS’13): add ε times KL-regularizer to OT problem for ε > 0. Sinkhorn algorithm ⇝ O(N1+ 1 d ln(N)). Problem in practice: need ε very small to get accurate plan, but ⇝ numerical instabilities. Our solution: Add instead ε times different (=α-Rényi) regularizer and let α ↘ 0 instead of ε ↘ 0. Viktor Stein Interpolating between OT and KL-reg. OT with Rényi divergences September 12th, 2024 2

Slide 3

Slide 3 text

1. Tsallis divergence and α- Rényi divergence https://upload.wikimedia.org/wikipedia/commons/thumb/e/e2/Constantino_Tsallis_February_2010.jpg/800px-Constantino_Tsallis_February_2010.jpg, https://repository.aip.org/islandora/object/nbla:310720 2. Optimal transport and its regularization dsweber2.github.io/Optimal-Transport-Information-Geometry/ 3. Rényi regularized OT c⊥ M(X × X) Π(µ, ν) Πα γ (µ, ν) µ ⊗ ν πc πα,γ c πα,γ→∞ c πα→0,γ c 4. Interpolation properties OT ⟨c, ·⟩ OTε,α OT OTε α↘0 ε→∞ ε↘0 α↗1 5. Solving Rényi regularized OT with mirror descent 6. Numerical results 0 20 40 0 20 40 KL-reg 0.0000 0.0025 0.0050 0.0075 0 20 40 0 20 40 Rényi-reg 0.00 0.02 0.04 0 20 40 0 20 40 Tsallis-reg 0.000 0.005 0.010 0.015 0 20 40 0 20 40 EMD 0.000 0.025 0.050 0.075 0 20 40 0 20 40 abs-diff(Rényi, EMD) 0.00 0.01 0.02 0.03 0 20 40 0 20 40 KL-reg 0.002 0.004 0 20 40 0 20 40 Rényi-reg 0.01 0.02 0 20 40 0 20 40 Tsallis-reg 0.000 0.005 0.010 0 20 40 0 20 40 EMD 0.00 0.01 0.02 0.03 0 20 40 0 20 40 abs-diff(Rényi, EMD) 0.005 0.010

Slide 4

Slide 4 text

q-Tsallis divergence and α-Rényi divergence Definition (α-Rényi divergence) The α-Rényi divergence of order α ∈ (0, 1) is Rα : P(X) × P(X) → [0, ∞], (µ | ν) → 1 α − 1 ln X ρµ (x) ρν (x) α dν(x) . where for σ ∈ P(X), ρσ is the density w.r.t. 1 2 (µ + ν), and ln(0) := −∞. Muzellec et. al (AAAI 2017) examine Tsallis-regularized OT. Definition (q-Tsallis divergence) The q-Tsallis divergence of order q > 0, q ̸= 1, is Tq = 1 q − 1 [exp ((q − 1)Rq ) − 1] : P(X) × P(X) → [0, ∞], (µ | ν) → 1 q − 1 X ρµ (x) ρν (x) q dν(x) − 1 Tsallis = 1st order approximation of Rényi since ln(y) ≈ y − 1 (1st order Taylor). Viktor Stein Interpolating between OT and KL-reg. OT with Rényi divergences September 12th, 2024 4

Slide 5

Slide 5 text

Rényi versus Tsallis 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Renyi divergence from (p,1 p) to (0.25,0.75) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Tsalllis divergence from (p,1 p) to (0.25,0.75) 0.0 0.2 0.4 0.6 0.8 1.0 Theorem (Properties of the Rényi divergence) • Divergence property: Rα (µ | ν) ≥ 0 and Rα (µ | ν) = 0 if and only if µ = ν. • Rα is nondecreasing and continuous in α ∈ [0, 1] with limα↗1 Rα = KL pointwise. • Rα jointly convex, jointly weakly lower semicontinuous for α ∈ (0, 1]. Viktor Stein Interpolating between OT and KL-reg. OT with Rényi divergences September 12th, 2024 5

Slide 6

Slide 6 text

1. Tsallis divergence and α- Rényi divergence https://upload.wikimedia.org/wikipedia/commons/thumb/e/e2/Constantino_Tsallis_February_2010.jpg/800px-Constantino_Tsallis_February_2010.jpg, https://repository.aip.org/islandora/object/nbla:310720 2. Optimal transport and its regularization dsweber2.github.io/Optimal-Transport-Information-Geometry/ 3. Rényi regularized OT c⊥ M(X × X) Π(µ, ν) Πα γ (µ, ν) µ ⊗ ν πc πα,γ c πα,γ→∞ c πα→0,γ c 4. Interpolation properties OT ⟨c, ·⟩ OTε,α OT OTε α↘0 ε→∞ ε↘0 α↗1 5. Solving Rényi regularized OT with mirror descent 6. Numerical results 0 20 40 0 20 40 KL-reg 0.0000 0.0025 0.0050 0.0075 0 20 40 0 20 40 Rényi-reg 0.00 0.02 0.04 0 20 40 0 20 40 Tsallis-reg 0.000 0.005 0.010 0.015 0 20 40 0 20 40 EMD 0.000 0.025 0.050 0.075 0 20 40 0 20 40 abs-diff(Rényi, EMD) 0.00 0.01 0.02 0.03 0 20 40 0 20 40 KL-reg 0.002 0.004 0 20 40 0 20 40 Rényi-reg 0.01 0.02 0 20 40 0 20 40 Tsallis-reg 0.000 0.005 0.010 0 20 40 0 20 40 EMD 0.00 0.01 0.02 0.03 0 20 40 0 20 40 abs-diff(Rényi, EMD) 0.005 0.010

Slide 7

Slide 7 text

Wasserstein-p metric space Let (X, d) metric space, with d lower semicontinuous. Let p ∈ [1, ∞), P(X) the set of probability measures. Pp (X) := µ ∈ P(X) : X d(x, x0 )p dµ(x) < ∞ , x0 ∈ X. On Pp (X), the Wasserstein-p metric is OT(µ, ν)p = min π∈Π(µ,ν) X×X d(x, y)p dπ(x, y), µ, ν ∈ Pp (X), where the transport polytope is Π(µ, ν) := {π ∈ P(X × X) : π(A × X) = µ(A), π(X × A) = ν(A) ∀A} https://dsweber2.github.io/Optimal-Transport-Information-Geometry/ The product measure µ ⊗ ν ∈ Π(µ, ν). Notation: ⟨f, µ⟩ := X f(x) dµ(x), so we can write OT(µ, ν)p = min{⟨dp, π⟩ : π ∈ Π(µ, ν)}. Viktor Stein Interpolating between OT and KL-reg. OT with Rényi divergences September 12th, 2024 7

Slide 8

Slide 8 text

Cuturi’s Entropic Optimal Transport Regularizer: Kullback-Leibler divergence KL(· | µ ⊗ ν): Π(µ, ν) → [0, ∞), π → X×X ln dπ dµ ⊗ ν (x, y) dµ(x) dν(y) KL-regularized OT: OTε (µ, ν) := min π∈Π(µ,ν) ⟨dp, π⟩ + ε KL(π | µ ⊗ ν) = max f,g∈C(X) f ⊕ g − ε exp − 1 ε (f ⊕ g − dp) , µ ⊗ ν . (f ⊕ g)(x, y) := f(x) + g(y) for f, g ∈ C(X). Primal-dual relation: ˆ πε = exp ˆ f⊕ˆ g−dp ε · µ ⊗ ν Here, c = dp. ©G. Péyre, M. Cuturi, 2019 argmin {KL(π | µ ⊗ ν) : ⟨dp, π⟩ = OT(µ, ν)} ε↘0 ← − − − ˆ πε ε→∞ − − − → µ ⊗ ν OT(µ, ν) ε↘0 ← − − − OTε (µ, ν) ε→∞ − − − → ⟨dp, µ ⊗ ν⟩ Viktor Stein Interpolating between OT and KL-reg. OT with Rényi divergences September 12th, 2024 8

Slide 9

Slide 9 text

Discretization and Sinkhorn algorithm Discretize X ≈ (xi )N i=1 µ, ν ∈ P(X) become vectors r := (µ(xi ))N i=1 , c := (ν(xi ))N i=1 ∈ ΣN , where ΣN := x ∈ [0, 1]N : N i=1 xi = 1 . cost matrix: M := (d(xi , xj )p)N i,j=1 . transport polytope: Π(r, c) := {P ∈ ΣN×N : P 1N = r, P T 1N = c} ©T. Vayer. Optimal transport plan as KL-projection of Gibbs kernel ˆ P ε = argmin P ∈Π(r,c) KL P exp −M ε Sinkhorn algorithm finds this projection via matrix scaling. Viktor Stein Interpolating between OT and KL-reg. OT with Rényi divergences September 12th, 2024 9

Slide 10

Slide 10 text

1. Tsallis divergence and α- Rényi divergence https://upload.wikimedia.org/wikipedia/commons/thumb/e/e2/Constantino_Tsallis_February_2010.jpg/800px-Constantino_Tsallis_February_2010.jpg, https://repository.aip.org/islandora/object/nbla:310720 2. Optimal transport and its regularization dsweber2.github.io/Optimal-Transport-Information-Geometry/ 3. Rényi regularized OT c⊥ M(X × X) Π(µ, ν) Πα γ (µ, ν) µ ⊗ ν πc πα,γ c πα,γ→∞ c πα→0,γ c 4. Interpolation properties OT ⟨c, ·⟩ OTε,α OT OTε α↘0 ε→∞ ε↘0 α↗1 5. Solving Rényi regularized OT with mirror descent 6. Numerical results 0 20 40 0 20 40 KL-reg 0.0000 0.0025 0.0050 0.0075 0 20 40 0 20 40 Rényi-reg 0.00 0.02 0.04 0 20 40 0 20 40 Tsallis-reg 0.000 0.005 0.010 0.015 0 20 40 0 20 40 EMD 0.000 0.025 0.050 0.075 0 20 40 0 20 40 abs-diff(Rényi, EMD) 0.00 0.01 0.02 0.03 0 20 40 0 20 40 KL-reg 0.002 0.004 0 20 40 0 20 40 Rényi-reg 0.01 0.02 0 20 40 0 20 40 Tsallis-reg 0.000 0.005 0.010 0 20 40 0 20 40 EMD 0.00 0.01 0.02 0.03 0 20 40 0 20 40 abs-diff(Rényi, EMD) 0.005 0.010

Slide 11

Slide 11 text

Rényi-regularized OT Definition (Rényi-regularized OT [Bresch, S. ’24]) The Rényi-regularized OT problem is OTε,α : Pp (X) × Pp (X) → [0, ∞), (µ, ν) → min π∈Π(µ,ν) ⟨c, π⟩ + εRα (π | µ ⊗ ν). Theorem (OTε,α is a pre-metric [Bresch, S. ’24]) Pp (X)2 ∋ (µ, ν) → 1[µ̸=ν] OTε,α (µ, ν) is a metric for α ∈ (0, 1), ε ∈ [0, ∞). Lemma (Monotonicity of Rényi regularized OT [Bresch, S. ’24]) Let µ, ν ∈ Pp (X), α, α′ ∈ (0, 1) and ε, ε′ ≥ 0 with α > α′ and ε < ε′. Then, we have OTε′,α (µ, ν) ≥ OTε,α (µ, ν) ≥ OTε,α′ (µ, ν). Viktor Stein Interpolating between OT and KL-reg. OT with Rényi divergences September 12th, 2024 11

Slide 12

Slide 12 text

Theorem (Interpolation properties). ⟨d, µ ⊗ ν⟩ OT(µ, ν) OTε,α (µ, ν) OTε (µ, ν) some π ∈ Π(µ, ν) s.t. ⟨d, π⟩ = OT(µ, ν) µ ⊗ ν πε,α (µ, ν) argmin π∈Π(µ,ν), ⟨d,π⟩=OT(µ,ν) Rα (π | µ ⊗ ν) πε α↘0 or ε↘0 ε→∞ α↗1 α↘0 ε↘0 ε→∞ α↗1 Viktor Stein Interpolating between OT and KL-reg. OT with Rényi divergences September 12th, 2024 12

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

1. Tsallis divergence and α- Rényi divergence https://upload.wikimedia.org/wikipedia/commons/thumb/e/e2/Constantino_Tsallis_February_2010.jpg/800px-Constantino_Tsallis_February_2010.jpg, https://repository.aip.org/islandora/object/nbla:310720 2. Optimal transport and its regularization dsweber2.github.io/Optimal-Transport-Information-Geometry/ 3. Rényi regularized OT c⊥ M(X × X) Π(µ, ν) Πα γ (µ, ν) µ ⊗ ν πc πα,γ c πα,γ→∞ c πα→0,γ c 4. Interpolation properties OT ⟨c, ·⟩ OTε,α OT OTε α↘0 ε→∞ ε↘0 α↗1 5. Solving Rényi regularized OT with mirror descent 6. Numerical results 0 20 40 0 20 40 KL-reg 0.0000 0.0025 0.0050 0.0075 0 20 40 0 20 40 Rényi-reg 0.00 0.02 0.04 0 20 40 0 20 40 Tsallis-reg 0.000 0.005 0.010 0.015 0 20 40 0 20 40 EMD 0.000 0.025 0.050 0.075 0 20 40 0 20 40 abs-diff(Rényi, EMD) 0.00 0.01 0.02 0.03 0 20 40 0 20 40 KL-reg 0.002 0.004 0 20 40 0 20 40 Rényi-reg 0.01 0.02 0 20 40 0 20 40 Tsallis-reg 0.000 0.005 0.010 0 20 40 0 20 40 EMD 0.00 0.01 0.02 0.03 0 20 40 0 20 40 abs-diff(Rényi, EMD) 0.005 0.010

Slide 15

Slide 15 text

Mirror descent Solve min x∈K f(x), where K ⊂ Rn compact. via the updates x(k+1) = argmin y∈K Dh y (∇h)−1 ∇h(x(k)) − ηk ∇f(x(k)) , x(0) ∈ K, ηk > 0, (1) for a convex function h: Rn → R with special properties. https://doi.org/10.48550/arXiv.2108.09489 Viktor Stein Interpolating between OT and KL-reg. OT with Rényi divergences September 12th, 2024 15

Slide 16

Slide 16 text

Numerical Experiments - Better transport plans Choose K = ΣN (probability simplex), −h = Shannon entropy =⇒ Dh = KL. Rényi-regularized OT objective Π(r, c) → [0, ∞), P → ⟨M, P ⟩ + εRα (P | rcT). is not Lipschitz continuous, but locally Lipschitz on {P ∈ Π(r, c) : P |supp(r⊗c) > 0} = Π(c, r) ∩ RN >0 , which suffices for convergence of a mirror descent with special step size (ηk )k∈N (You, Li, 2022). In each iteration one KL projection onto ΣN (using Sinkhorn algorithm) is performed: P (k) ← Sinkhorn P (k−1) ⊙ exp −ηk M − ηk λ α α − 1 (rcT ⊘ P )1−α ⟨P α, (rcT)1−α⟩ ; r, c , k ∈ N . Viktor Stein Interpolating between OT and KL-reg. OT with Rényi divergences September 12th, 2024 16

Slide 17

Slide 17 text

Rényi regularization yields more accurate plans Regularized OT plans for Gaussian (top) and Poisson (bottom) marginals with regularization parameter λ = 10, Rényi order α = 0.01, Tsallis order: q = 2. Viktor Stein Interpolating between OT and KL-reg. OT with Rényi divergences September 12th, 2024 17

Slide 18

Slide 18 text

Numerical Experiments - Predicting voter migration regularizer, ε = 1 abs error ± std KL error mean squared error KL 2.4221 × 101 ± 2.848 × 101 8.422 × 102 9.008 × 104 Tsallis 9.409 ± 1.529 × 101 3.173 × 102 2.063 × 104 OT 1.845 × 101 ± 2.358 × 101 7.655 × 102 5.738 × 104 3 10 -Renyi 6.611 ± 7.868 2.128 × 102 6.759 × 103 Viktor Stein Interpolating between OT and KL-reg. OT with Rényi divergences September 12th, 2024 18

Slide 19

Slide 19 text

Conclusion − Contribution. Regularize optimal transport problem using the α-Rényi-divergences Rα for α ∈ (0, 1). Prove dual formulation and interpolation properties. − Prior work. Regularization with KL = limα↗1 Rα and with q-Tsallis divergence − Method. Solve primal problem with mirror descent and dual problem with subgradient descent. − Result. Rényi-regularized OT plans outperform KL / Tsallis regularized OT plans on real and synthetic data. − Novelty. Rα ̸∈ {f-divergence, Bregman divergence} and Rα not “separable” due to the logarithm. Viktor Stein Interpolating between OT and KL-reg. OT with Rényi divergences September 12th, 2024 19

Slide 20

Slide 20 text

Thank you for your attention! I am happy to take any questions. Paper link: https://arxiv.org/abs/2404.18834 My website: https://viktorajstein.github.io [PC19, Cut13, MNPN17, vEH14, BT03, NY83, NS21, Rén61, Tsa88] Viktor Stein Interpolating between OT and KL-reg. OT with Rényi divergences September 12th, 2024 20

Slide 21

Slide 21 text

References I [BT03] Amir Beck and Marc Teboulle, Mirror descent and nonlinear projected subgradient methods for convex optimization, Oper. Res. Lett. 31 (2003), no. 3, 167–175. [Cut13] Marco Cuturi, Sinkhorn distances: lightspeed computation of optimal transport, Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2 (Red Hook, NY, USA), NIPS’13, Curran Associates Inc., 2013, p. 2292–2300. [MNPN17] Boris Muzellec, Richard Nock, Giorgio Patrini, and Frank Nielsen, Tsallis regularized optimal transport and ecological inference, Proceedings of the AAAI conference on Artificial Intelligence (Hilton San Francisco, San Francisco, California, USA), vol. 31, 2017. [NS21] Sebastian Neumayer and Gabriele Steidl, From optimal transport to discrepancy, Handbook of Mathematical Models and Algorithms in Computer Vision and Imaging: Mathematical Imaging and Vision (2021), 1–36. [NY83] Arkadij Semenovič Nemirovskij and David Borisovich Yudin, Problem complexity and method efficiency in optimization, Wiley, New York, 1983. Viktor Stein Interpolating between OT and KL-reg. OT with Rényi divergences September 12th, 2024 1

Slide 22

Slide 22 text

References II [PC19] Gabriel Peyré and Marco Cuturi, Computational optimal transport, Found. Trends Mach. Learn. 11 (2019), no. 5-6, 355–607. [Rén61] Alfréd Rényi, On measures of entropy and information, Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics (Statistical Laboratory of the University of California, Berkeley, California, USA), vol. 4, University of California Press, 1961, pp. 547–562. [Tsa88] Constantino Tsallis, Possible generalization of boltzmann-gibbs statistics, Journal of statistical physics 52 (1988), 479–487. [vEH14] Tim van Erven and Peter Harremos, Rényi divergence and Kullback-Leibler divergence, IEEE Trans. Inf. Theory 60 (2014), no. 7, 3797–3820. Viktor Stein Interpolating between OT and KL-reg. OT with Rényi divergences September 12th, 2024 2

Slide 23

Slide 23 text

Work in progress - Rényi-Sinkhorn Divergence OTε,α (µ, µ) ̸= 0 To obtain valid, differentiable distance: Dε,α (µ, ν) := OTε,α (µ, ν) − 1 2 OTε,α (µ, µ) − 1 2 OTε,α (ν, ν). Can be used for gradient flows. Viktor Stein Interpolating between OT and KL-reg. OT with Rényi divergences September 12th, 2024 3 0 5 10 15 20 25 0.00 0.05 0.10 0.15 0.20

Slide 24

Slide 24 text

First: A different way of regularizing For regularization parameter γ ∈ [0, ∞] and α ∈ (0, 1), the restricted transport polytope, Πα γ (µ, ν) := {π ∈ Π(µ, ν) : Rα (π | µ ⊗ ν) ≤ γ} , is weakly compact, since Rα (· | µ ⊗ ν) is weakly lsc. and Π(µ, ν) is weakly compact. Definition (Rényi-Sinkhorn distance) The Rényi-Sinkhorn distance between µ, ν ∈ Pp (X) is dγ,α : Pp (X) × Pp (X) → R, (µ, ν) → min ⟨dp, π⟩1 p : π ∈ Πα γ (µ, ν) . (2) Theorem (Bresch, S. ’24) • For (µ, ν) ∈ Pp (X), the optimization problem (2) is convex and has a unique minimizer. • Pp (X)2 ∋ (µ, ν) → 1[µ̸=ν] (µ, ν)dγ,α (µ, ν) is a metric for α ∈ (0, 1), γ ∈ [0, ∞]. Viktor Stein Interpolating between OT and KL-reg. OT with Rényi divergences September 12th, 2024 4

Slide 25

Slide 25 text

c⊥ M(X × X) Π(µ, ν) Πα γ (µ, ν) µ ⊗ ν πc πα,γ c Transport polytope Π(µ, ν), restricted transport polytope Πα γ (µ, ν) for c = dp. (Plot inspired by (Cuturi, 2013).) Viktor Stein Interpolating between OT and KL-reg. OT with Rényi divergences September 12th, 2024 5

Slide 26

Slide 26 text

The Dual Point of View - Penalizing the Constraint Instead of restricting the problems domain, penalize the Rényi divergence constraint in (2). Definition (Dual Rényi-Divergence-Sinkhorn distance) The dual Rényi-Divergence-Sinkhorn distance for α ∈ (0, 1), ε ∈ [0, ∞) is dα,ε : Pp (X) × Pp (X) → R, (µ, ν) → ⟨dp, πα,ε(µν)⟩1 p , where πα,ε(µ, ν) ∈ argmin {⟨dp, π⟩ + εRα (π | µ ⊗ ν) : π ∈ Π(µ, ν)} . (3) Theorem (Lagrangian point of view and pre-metric [Bresch, S. ’24]) Let (µ, ν) ∈ Pp (X). • The optimization problem (3) is convex and has a unique minimizer. • Rényi-Sinkhorn dγ,α (µ, ν) and dual Rényi-Sinkhorn dα,λ(µ, ν) are equivalent: for γ > 0, there exists ε ∈ [0, ∞], such that ⟨dp, πα,ε(µ, ν)⟩ = dγ,α (µ, ν)p. Viktor Stein Interpolating between OT and KL-reg. OT with Rényi divergences September 12th, 2024 6

Slide 27

Slide 27 text

Dual formulation, Representation of πα,λ From now on: X compact. The dual space of all finite signed Borel measures on X, M(X), is C(X), the space of real-valued continuous functions on X. Recall (f ⊕ g)(x, y) := f(x) + g(y). Theorem (Dual problem, dual representation [Bresch, S. ’24]) We have the strong duality OTε,α (µ, ν) = max f,g∈C(X) f⊕g≤d ⟨f ⊕ g, µ ⊗ ν⟩ − ε ln (d − f ⊕ g) α α−1 , µ ⊗ ν + Cα,ε . (4) The optimal dual potentials ˆ f, ˆ g ∈ C(X) from (4) are unique supp(µ ⊗ ν)-a.e. up to additive constants and the unique optimal plan is πα,ε ∝ (d − ˆ f ⊕ ˆ g) 1 α−1 · (µ ⊗ ν). Proof idea. Use Fenchel-Rockafellar theorem, extend objective to M(X) × M(X) by ∞. Viktor Stein Interpolating between OT and KL-reg. OT with Rényi divergences September 12th, 2024 7