Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Accelerated_Variational_Gradient_Flow_Slides.pdf

Avatar for Viktor Stein Viktor Stein
October 29, 2025
11

 Accelerated_Variational_Gradient_Flow_Slides.pdf

Avatar for Viktor Stein

Viktor Stein

October 29, 2025
Tweet

More Decks by Viktor Stein

Transcript

  1. Accelerated Stein Variational Gradient Flow Analysis of Generalized Bilinear Kernels

    for Gaussian targets joint work with Wuchen Li, University of South Carolina Geometric Science of Information 2025. Viktor Stein (TU Berlin), 29.10.2025
  2. Motivation: Sampling for Bayesian Inference Given: labelled i.i.d. data D

    := (wi , yi ) m i=1 . Assume underlying model y = gθ (w)+ξ, where ξ ∼ N(0, σ2 id), and, e.g., gθ is neural network with weights θ ∈ Θ ⊂ Rd. Goal: learn best distribution of parameter θ to fit D. Solution: Since ξ ∼ N(0, σ2 id), the likelihood is p(D | θ) = m i=1 p(yi | θ, wi ) ∝ m i=1 exp − 1 2σ2 ∥yi − gθ (wi )∥2 2 = exp − 1 2σ2 m i=1 ∥yi − gθ (wi )∥2 2 , θ ∈ Θ. After choosing prior p on θ, Bayes’ rule yields π(θ) := p(θ | D) ∝ p(D | θ)p(θ) =: e−V (θ), where V known, normalization unknown. ⇝ Need π to predict new output ynew = gθ (wnew ) dπ(θ). Viktor Stein Accelerated Stein Variational Gradient Flow 29.10.2025 2 / 18
  3. Motivation: Sampling for Bayesian Inference Task: sample density π ∝

    e−V with potential V : Rd → R. 1) Diffusion-based (“Langevin MCMC”): discretize diffusion processes dYt = −∇V (Yt) dt + √ 2 dBt explicitly in time ⇝ single particle system in Rd: yn+1 ← yn − τ∇V (yn) + √ 2τun, un ∼ N(0, 1), τ > 0 (Durmus and Moulines 2019; Hagemann et al. 2025; Cheng et al. 2018). 2) ODE-based / Variational Inference: (Wasserstein) gradient flow to minimize discrepancy E = KL(· | π): P(Rd) → [0, ∞]: ∂t ρt = −∇P(Rd) E(ρt ), ρt ∈ P(Rd), t > 0. ⇝ time discretization: ρn+1 ← ρn − τ∇P(Rd) E(ρn), can be realized as inter- acting particle system (Arbel et al. 2019; Chen et al. 2025; Duane et al. 1987; Mills-Williams et al. 2024; Korba et al. 2021). ©Francis Bach Viktor Stein Accelerated Stein Variational Gradient Flow 29.10.2025 3 / 18
  4. Stein Variational Gradient Descent (SVGD) (Q. Liu and D. Wang

    2016) Symmetric, positive definite, differentiable “kernel” K : Rd × Rd → R. SVGD update in the n-th iteration with step size τ > 0: xn+1 i ← xn i − τ N N j=1 K(xn i , xn j )∇V (xn j ) attraction − ∇2 K(xn i , xn j ) repulsion , i ∈ {1, . . . , N}. Charles Stein https://news.stanford.edu/__data/assets/image/0027/70758/Obit_stein_feach.jpeg Reason: suppose (xn i )N i=1 ∼ ρn, then xn+1 i = Tn (xn i ) with Tn := id +ηϕn , where ϕn := argmax ϕ∈Hd K ∥ϕ∥2 Hd K ≤C(ρn,π) −∂η η=0 KL (id +ηϕ)# ρn π = x → Eρn [(∇2 K)(x, ·) − ∇V · K(x, ·)], Comparison to the Euclidean setting: ∇f(x) ∥∇f(x)∥2 = argmax v∈Rd,∥v∥≤1 ∂η η=0 f(x + ηv). ⇝ SVGD is gradient descent of KL(· | π) with respect to a "Stein geometry" (induced by K) when replacing ρ by a sample average (Duncan et al. 2023) Viktor Stein Accelerated Stein Variational Gradient Flow 29.10.2025 4 / 18
  5. Accelerating gradient descent (Nesterov 1983; Beck and Teboulle 2009) Goal:

    find minimizer x∗ of convex, differentiable function f : Rd → R with L-Lipschitz gradient. Gradient descent xn+1 ← xn − τ∇f(xn) with step size τ ∈ 0, 2 L converges linearly: f(xn) − f(x∗) ∈ O(n−1). Nesterov’s accelerated gradient descent:      xn+1 = yn − τ∇f(yn) gradient step yn+1 = xn+1 + αn+1 (xn+1 − xn) momentum step converges quadratically: f(xn) − f(x∗) ∈ O(n−2). Damping: αn = n−1 n+2 or αn = √ L− √ β √ L+ √ β if f is β-strongly convex. f(xn) − f(x∗) ∈ O 1 − β L n vs. O 1 − β L n for gradient descent. ©Francis Bach ©A. Wibisono Viktor Stein Accelerated Stein Variational Gradient Flow 29.10.2025 5 / 18
  6. The density manifold (Lafferty 1988; Kriegl and Michor 1997) The

    set of smooth positive probability densities P(Ω) := ρ ∈ C∞(Ω) : ρ(x) > 0 ∀x ∈ Ω, Ω ρ(x) dx = 1 . under assumptions on Ω, forms an infinite-dimensional smooth manifold. • tangent space to P(Ω) at ρ: Tρ P(Ω) := σ ∈ C∞(Ω) : Ω σ(x) dx = 0 . • Cotangent space T∗ ρ P(Ω) := C∞(Ω)/ R. • Metric tensor field on P(Ω) is G: P(Ω) ∋ ρ → Gρ • Fisher-Rao metric G−1 ρ [Φ] = ρ Φ − Eρ [Φ] , • Otto’s Wasserstein-2 metric: [GW ρ ]−1[Φ] := −∇ · (ρ∇Φ). • Stein metric (G(K) ρ )−1(Φ) := −∇ · ρ Ω K(·, y)ρ(y)∇Φ(y) dy . John Lafferty https://wti.yale.edu/profile/john-lafferty Viktor Stein Accelerated Stein Variational Gradient Flow 29.10.2025 6 / 18
  7. Gradient flows in the density manifold Definition (First linear functional

    derivative) If it exists, the first functional derivative of E : P(Ω) → R is δE : P(Ω) → C∞(Ω)/ R with ⟨δE(ρ), ϕ⟩L2(Ω) = ∂t t=0 E(ρ + tϕ), ∀ϕ ∈ C∞(Ω) : ρ + tϕ ∈ P(Ω), |t| small. Definition (Metric gradient flow on P(Ω)) A smooth curve ρ: [0, ∞) → P(Ω), t → ρt is a (P(Ω), G)-gradient flow of E starting at ρ(0) if ∂t ρt = −G−1 ρt [δE(ρt )], ∀t > 0. https://www.offconvex.org/2022/01/06/gf-gd/ The (P(Ω), GW )-gradient flow of E = KL(· | Z−1e−V ) is ∂t ρt = ∇ · (ρt (∇ log ρt + ∇V )) = ∆ρt + ∇ · (ρt ∇V ), where we used ρt ∇ log ρt = ∇ρt (Jordan et al. 1998). Viktor Stein Accelerated Stein Variational Gradient Flow 29.10.2025 7 / 18
  8. Accelerated gradient flows: from Rd to P(Ω) Su et al.

    2016 Continuous time limit of accelerated gradient descent is the 2nd order ODE ¨ xt + αt ˙ xt + ∇f(xt ) = 0, where αt = 3 t or αt = 2 √ β. This can be rewritten as damped Hamiltonian flow (Maddison et al. 2018):   ˙ xt ˙ pt   +   0 αt pt   −   0 id − id 0     ∇x H(xt , pt ) ∇p H(xt , pt ),   = 0. Weijie Su, Stephen Boyd, Emmanuel Candés https://www.weijie-su.com/,https://stanford.edu/~boyd/,https://stanforddaily.com/2017/11/07/qa-with-new where x ∈ Rd represents position, p = ˙ x ∈ Rd represents momentum, and the Hamiltonian is H : Rd × Rd → R, (x, p) → 1 2 ∥p∥2 2 + f(x) = kinetic energy + potential energy ⇝ Hamiltonian can be generalized from Rd to Riemannian manifolds (Alimisis et al. 2020) and to P(Ω) (Y. Wang and Li 2022). Damping αt is a friction coefficient that induces entropy dissipation. Viktor Stein Accelerated Stein Variational Gradient Flow 29.10.2025 8 / 18
  9. Accelerated Stein variational gradient flow on densities The Hamiltonian on

    P(Ω) is H : TP(Ω) → R ∪{∞}, (ρ, Φ) → 1 2 Ω Φ(x)G−1 ρ [Φ](x) dx + E(ρ). As before: ρ ∆ = position, Φ ∆ = momentum. The accelerated gradient flow in P(Ω) is      ∂t ρt = G−1 ρt [Φt ] ∂t Φt + αt Φt + 1 2 δ Ω Φt (x)G−1 • [Φt ](x) dx (ρt ) + δE(ρt ) = 0, Φ0 = 0. ⇝ accelerated Stein variational gradient flow is      ∂t ρt + ∇ · ρt Ω K(·, y)ρt (y)∇Φt (y) dy = 0, ∂t Φt + αt Φt + Ω K(y, ·)⟨∇Φt (y), ∇Φt (·)⟩ρt (y) dy + δE(ρt ) = 0. Viktor Stein Accelerated Stein Variational Gradient Flow 29.10.2025 9 / 18
  10. Discretizing accelerated gradient flows in the density manifold Replace ρt

    by its empirical estimation 1 N N j=1 δ Xj t . ⇝ N particles (Xj t )N j=1 ⊂ Rd and their accelerations (∇Φt (X(j) t ))N j=1 ⊂ Rd at time t ((Y. Wang and Li 2022)), but this needs kernel density estimation (not robust). We introduce score-free particle discretization using particle momentum Y : (0, ∞) → Rd, t → ˙ Xt , ˙ Xt = Yt = Ω K(Xt , y)∇Φt (y)ρt (y) dy. Lemma (Accelerated Stein variational gradient flows with particles’ momenta)            ˙ Xt = Yt , ˙ Yt = −αt Yt + Ω (K(Xt , y)∇V (y) − ∇2 K(Xt , y)) ρt (y) dy + Ω2 ⟨∇Φt (z), ∇Φt (y)⟩ · K(y, z)(∇2 K)(Xt , y) + K(Xt , z)(∇1 K)(Xt , y) − K(Xt , y)(∇2 K)(z, y) ρt (y) dy ρt (z) dz. ⇝ replace expectation w.r.t. ρt by sample averages. Then discretize explicitly in time. Viktor Stein Accelerated Stein Variational Gradient Flow 29.10.2025 10 / 18
  11. Gaussians stay Gaussians for bilinear kernel (T. Liu et al.

    2024) [S., Li 2025] Let (ρt )t≥0 be the bilinear kernel (K(x, y) := xTAy + 1) Stein gradient flow of KL(· | ρ∗ ∼ N(b, Q)), starting at ρ0 ∼ N(µ0 , Σ0 ). i) ρt ∼ N(µt , Σt ) for all t ≥ 0, where (this has a unique solution on [0, ∞))      ˙ µt = (id −Q−1Σt )Aµt − K(µt , µt )Q−1(µt − b), µt t=0 = µ0 , ˙ Σt = 2 Sym(Σt A) − 2 Sym Σt A Σt + µt (µt − b)T Q−1 , Σt t=0 = Σ0 . ii) For t → ∞ we have ρt ⇀ ρ∗ and ∥µt − b∥, ∥Σt − Q∥ ∈ O e−2 C(b,λmin(A)λmax(Q))−ε t for any ε > 0. iii) On covariances matrices only we have ˙ Σt = 2 Sym Σt A(id −Σt Q−1) . Under commutativity assumptions we have the closed form solution Σ−1 t = Q−1 + e−2tA Σ−1 0 − Q−1 . For the accelerated flow (ρt , Φt )t>0 we have ρt ∼ N(µt , Σt ), where                  ˙ µt = 2St Σt Aµt + K(µt , µt )νt , µt |t=0 = µ0 , ˙ Σt = νt µT t AΣt + Sym(Σt A(2Σt St + µt νT t )) + 2 Sym(Σt AΣt St ), Σt |t=0 = Σ0 , ˙ νt = −αt νt − 2AΣt St νt − Aµt ∥νt ∥2 2 − Q−1(µt − b), ν0 = 0, ˙ St = −αt St − 2St νt µT t A − 4 Sym(S2 t Σt A) − 1 2 (Q−1 − Σ−1 t ), S0 = 0. Viktor Stein Accelerated Stein Variational Gradient Flow 29.10.2025 11 / 18
  12. Gaussian-Stein metric tensor (T. Liu et al. 2024) [S., Li

    2025] The pulled-back Stein metric on Gaussian submanifold is ˜ g(µ,Σ) ((˜ µ1 , ˜ Σ1 ), (˜ µ2 , ˜ Σ2 )) = tr(S1 S2 ΣAΣ) + (bT 1 S2 + bT 2 S1 )ΣAµ + K(µ, µ)bT 1 b2 , (˜ µi , ˜ Σi ) ∈ T(µ,Σ) , where bi , 1 2 Si = ˜ G(µ,Σ) (˜ µi , ˜ Σi ), i ∈ {1, 2}, and the associated tangent-cotangent automorphism is ˜ G−1 (µ,Σ) : Rd × Sym(d) → Rd × Sym(d), (ν, S) →   2SΣAµ + K(µ, µ)ν 2 Sym ΣA(2ΣS + µνT)   . (1) In the zero-mean case, (1) becomes ˜ G−1 Σ : Sym(d) → Sym(d), S → 4 Sym (ΣAΣS) . time-dependent choice A = Σ−1 ⇝ Wasserstein metric, and A = Σ−2 ⇝ natural gradient descent. ⇝ search for fixed matrix A ∈ Sym+ (d) with fastest convergence. Viktor Stein Accelerated Stein Variational Gradient Flow 29.10.2025 12 / 18
  13. Determining asymptotically best parameters by linearization (S., Li 2025) Linearizing

    the evolutions near equilibrium yield these optimal parameters A, α, and step sizes h. SVGD ASVGD d = 1 A = 2Q + b2 −1 , h∗ = Q α = √ 8A, A = θ A, Q simult. diag. µ0 = b = 0 A = 1 2 Q−1 α = 8λmin (A), A = θ id continuous convergence rate ∥(µt,Σt)∥ ∥(µ0,Σ0)∥ ≤ exp − 1 Q t exp − √ 8At , h∗ := 2 max 1≤i,j≤d µi,j + 2λmin (A) −1 discrete convergence rate ∥(Σk,Sk)∥2 κ(V )∥(Σ0,S0)∥2 ≤ ρk with ρ := κ(Q) − 1 κ(Q) + 1 < κ(Q) − 1 κ(Q) + 1 . Here, A = V DA V T, where µi,j := qi qj ai + qj qi aj , and qi resp. ai are the eigenvalues of Q resp. A. Viktor Stein Accelerated Stein Variational Gradient Flow 29.10.2025 13 / 18
  14. Numerical examples - Generalized bilinear kernel and Gaussian target Fig.

    1: Particle trajectories of ASVGD, SVGD, with the generalized bilinear kernel, MALA, and ULD (from left to right) and the Monte-Carlo-estimated KL divergence for two different choice of A. The potential is V (x) = 1 2 xTQx, with Q = 3 −2 −2 3 . Particles are initialized from a Gaussian distribution with mean [1, 1]T and covariance 3 2 2 3 . Viktor Stein Accelerated Stein Variational Gradient Flow 29.10.2025 14 / 18
  15. Numerical examples - Gaussian kernel Particle trajectories. ASVGD, SVGD, with

    Gaussian kernel, MALA, and ULD (from left to right) for V (x, y) = 1 4 (x4 + y4) (convex, non-Lipschitz) (top), the double bananas target (middle) and an anisotropic Gaussian target (bottom). Double bananas target: constant high damping β = 0.985. Other targets: speed restart and gradient restart. Viktor Stein Accelerated Stein Variational Gradient Flow 29.10.2025 15 / 18
  16. Numerical examples - Bayesian Neural Network (Hernández-Lobato and Adams 2015)

    Train NN to minimize negative log-likelihood loss on UCI datasets. RMSE Log-likelihood time (seconds) Dataset ASVGD SVGD ASVGD SVGD ASVGD SVGD Concrete 5.536±0.060 7.349±0.067 −3.135±0.016 −3.439±0.010 14.867±0.040 14.001±0.044 Energy 0.899±0.057 1.950±0.028 −1.268±0.068 −2.088±0.016 14.870±0.051 13.942±0.035 Housing 2.346±0.077 2.386±0.048 −2.305±0.020 −2.343±0.014 18.278±0.045 17.214±0.077 Kin8mn 0.118±0.001 0.165±0.001 0.71±0.011 0.384±0.004 14.859±0.029 14.001±0.033 Naval 0.005±0.000 0.007±0.000 3.801±0.01 3.504±0.003 20.912±0.57 19.704±0.05 power 3.951±0.005 4.035±0.008 −2.799±0.001 −2.825±0.002 12.142±0.053 11.435±0.054 protein 4.777±0.007 4.987±0.009 −2.983±0.001 −3.026±0.002 15.616±0.04 14.752±0.047 wine 0.185±0.013 0.191±0.017 0.201±0.039 0.146±0.046 18.293±0.097 17.244±0.077 Table 1: 2000 training iterations, 1 hidden layer, ReLU activation, Gaussian prior, 10 particles, 20 randomized runs, Gaussian kernel, median heuristic for kernel width σ2, AdaGrad for better gradients, step size: η = 10−4, batch size: 100, 50 hidden neurons, training ratio: 90%, damping: αk ≡ .95 Viktor Stein Accelerated Stein Variational Gradient Flow 29.10.2025 16 / 18
  17. Outlook and open questions • Find best kernel and damping

    parameters without linearization. • Conformal symplectic discretization instead of explicit Euler (retains structure of continuous dynamics). • Investigate bias of the algorithm. • incorporate annealing strategy. • finite particle convergence guarantees Viktor Stein Accelerated Stein Variational Gradient Flow 29.10.2025 17 / 18
  18. Thank you for your attention! I am happy to take

    any questions. Extensive follow-up preprint: https://arxiv.org/abs/2509.04008 Code: https://github.com/ViktorAJStein/Accelerated_Stein_Variational_Gradient_Flows My website: https://viktorajstein.github.io Thanks to my advisor, Gabriele Steidl, and to the DAAD for funding this trip. Viktor Stein Accelerated Stein Variational Gradient Flow 29.10.2025 18 / 18
  19. References I [1] Alain Durmus and Éric Moulines, “High-dimensional Bayesian

    inference via the unadjusted Langevin algorithm,” Bernoulli, vol. 25, no. 4A, pp. 2854–2882, 2019. [2] Paul Hagemann, Sophie Mildenberger, Lars Ruthotto, Gabriele Steidl, and Nicole Tianjiao Yang, “Multilevel diffusion: Infinite dimensional score-based diffusion models for image generation,” SIAM Journal on Mathematics of Data Science, vol. 7, no. 3, pp. 1337–1366, 2025. [3] Xiang Cheng, Niladri S Chatterji, Peter L Bartlett, and Michael I Jordan, “Underdamped Langevin MCMC: A non-asymptotic analysis,” in Conference on learning theory, PMLR, 2018, pp. 300–323. [4] Michael Arbel, Anna Korba, Adil Salim, and Arthur Gretton, “Maximum mean discrepancy gradient flow,” Advances in Neural Information Processing Systems, vol. 32, pp. 6484–6494, 2019. [5] Shi Chen, Qin Li, Oliver Tse, and Stephen J. Wright, “Accelerating optimization over the space of probability measures,” Journal of Machine Learning Research, vol. 26, no. 31, pp. 1–40, 2025. [Online]. Available: http://jmlr.org/papers/v26/23-1288.html. Viktor Stein Accelerated Stein Variational Gradient Flow 29.10.2025 1 / 5
  20. References II [6] Simon Duane, A.D. Kennedy, Brian J. Pendleton,

    and Duncan Roweth, “Hybrid Monte Carlo,” Physics Letters B, vol. 195, no. 2, pp. 216–222, 1987, issn: 0370-2693. doi: https://doi.org/10.1016/0370-2693(87)91197-X. [7] R. D. Mills-Williams, B. D. Goddard, and A. J. Archer, “Dynamic density functional theory with inertia and background flow,” The Journal of Chemical Physics, vol. 160, no. 17, p. 174 901, May 2024, issn: 0021-9606. doi: 10.1063/5.0208943. [8] Anna Korba, Pierre-Cyril Aubin-Frankowski, Szymon Majewski, and Pierre Ablin, “Kernel stein discrepancy descent,” in International Conference on Machine Learning, PMLR, 2021, pp. 5719–5730. [9] Andrew Duncan, Nikolas Nüsken, and Lukasz Szpruch, “On the geometry of Stein variational gradient descent,” Journal of Machine Learning Research, vol. 24, no. 56, pp. 1–39, 2023. [Online]. Available: http://jmlr.org/papers/v24/20-602.html. Viktor Stein Accelerated Stein Variational Gradient Flow 29.10.2025 2 / 5
  21. References III [10] Qiang Liu and Dilin Wang, “Stein variational

    gradient descent: A general purpose Bayesian inference algorithm,” in Proceedings of the 30th International Conference on Neural Information Processing Systems, ser. NIPS’16, Barcelona, Spain: Curran Associates Inc., 2016, pp. 2378–2386, isbn: 9781510838819. [11] Yurii Nesterov, “A method for solving the convex programming problem with convergence rate O(k−2),” Russian, Doklady Akademii Nauk SSSR, vol. 269, p. 543, 1983. [12] Amir Beck and Marc Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM journal on imaging sciences, vol. 2, no. 1, pp. 183–202, 2009. [13] John D. Lafferty, “The density manifold and configuration space quantization,” Transactions of the American Mathematical Society, vol. 305, no. 2, pp. 699–741, 1988. [14] Andreas Kriegl and Peter W. Michor, The convenient setting of global analysis. American Mathematical Society, 1997, vol. 53. Viktor Stein Accelerated Stein Variational Gradient Flow 29.10.2025 3 / 5
  22. References IV [15] Richard Jordan, David Kinderlehrer, and Felix Otto,

    “The variational formulation of the Fokker–Planck equation,” SIAM Journal on Mathematical Analysis, vol. 29, no. 1, pp. 1–17, 1998. [16] Chris J Maddison, Daniel Paulin, Yee Whye Teh, Brendan O’Donoghue, and Arnaud Doucet, “Hamiltonian descent methods,” arXiv preprint arXiv:1809.05042, 2018. [17] Foivos Alimisis, Antonio Orvieto, Gary Bécigneul, and Aurelien Lucchi, “A continuous-time perspective for modeling acceleration in Riemannian optimization,” in International Conference on Artificial Intelligence and Statistics, PMLR, 2020, pp. 1297–1307. [18] Yifei Wang and Wuchen Li, “Accelerated information gradient flow,” Journal of Scientific Computing, vol. 90, pp. 1–47, 2022. [19] Weijie Su, Stephen Boyd, and Emmanuel J Candes, “A differential equation for modeling Nesterov’s accelerated gradient method: Theory and insights,” Journal of Machine Learning Research, vol. 17, no. 153, pp. 1–43, 2016. Viktor Stein Accelerated Stein Variational Gradient Flow 29.10.2025 4 / 5
  23. References V [20] Tianle Liu, Promit Ghosal, Krishnakumar Balasubramanian, and

    Natesh Pillai, “Towards understanding the dynamics of Gaussian-Stein variational gradient descent,” Advances in Neural Information Processing Systems, vol. 36, 2024. [21] José Miguel Hernández-Lobato and Ryan Adams, “Probabilistic backpropagation for scalable learning of Bayesian neural networks,” in International Conference on Machine Learning, PMLR, 2015, pp. 1861–1869. Viktor Stein Accelerated Stein Variational Gradient Flow 29.10.2025 5 / 5