Avetik Karagulyan

Langevin sampling, federated learning and their interconnections L2S, 2025 Avetik
Karagulyan CNRS / L2S / Université Paris-Saclay 1 1 Based on a joint paper with P. Richtárik

Statistical learning Figure: Made with Remarkable tablet A. Karagulyan 2

Sampling A. Karagulyan 3

Approximate integration Mathematical formulation: Eπ[g(X)] = Rd g(x)π(x)dx. (1) Classical
solutions: • LLN based methods (Importance sampling): ˆ In = 1/n n i g(Xi)π(Xi)/ν(Xi), where Xi iid ∼ ν. • Markov chain based methods (MCMC): Construct a chain Xn s.t. L(Xn) = νn ≈ π, then take ˆ In = 1/n n i g(Xi). A. Karagulyan 4

Approximate integration Mathematical formulation: Eπ[g(X)] = Rd g(x)π(x)dx. (2) Classical
solutions: • LLN based methods (Importance sampling): ˆ In = 1/n n i g(Xi)π(Xi)/ν(Xi), where Xi iid ∼ ν. • Markov chain based methods (MCMC): Construct a chain Xn s.t. L(Xn) = νn ≈ π, then take ˆ In = 1/n n i g(Xi). A. Karagulyan 5

Sampling Sampling is widely used in modern Machine Learning: •
Bayesian Neural Networks [IVHW21]; • Diffusion models [CCL+22, CHIS23]; • Computer Vision [LZBG20]; • Theoretical ML and generalization [DDB17, RZGS23]; • Bayesian statistics [Rob07, RC13]; • Non-convex optimization [RRT17, Lam21]. A. Karagulyan 6

Formulation • Problem: sample from a given target distribution π
defined on Rd with a large value of d. • More precisely, for a given precision level ε, construct a probability distribution µ on Rd which is easy to sample from and KL(µ | π) ≤ ε. • Important particular case: π has a density (w.r.t. the Lebesgue measure) given by π(θ) ∝ exp(−F(θ)), with a “potential” F : Rd → R. A. Karagulyan 7

KL and FI We are going to use the following
probability measure distances • The Kullback-Leibler divergence defined as: KL(µ | ν) = Rd log µ(x) ν(x) µ(dx), if µ ν +∞, otherwise. (3) • The Fisher information is defined as J(µ | ν) =    Rd ∇ log µ(x) ν(x) 2 µ(dx), if µ ν +∞, otherwise. (4) A. Karagulyan 8

Approximate sampling as optimization Let us define by P2 (Rd)
the family of square integrable measures (equipped with Wasserstein distance): P2 (Rd) = µ : Rd θ 2µ(dθ) < +∞ . (5) Define the functional F : P2 (Rd) → R+ as F(µ) = KL(µ | π). (6) Therefore, approximate sampling becomes the minimization of F on some class C whose elements are easier to sample from: ˆ µ = arg min µ∈C F(µ). (7) A. Karagulyan 9

Langevin sampling A. Karagulyan 10

Langevin Diffusion • Vanilla Langevin diffusion: dLLD t = −∇F(LLD
t )dt + √ 2dWt. (LD) The solution of this equation is a Markov process having π as an invariant distribution: If LLD 0 ∼ π, then LLD t ∼ π, for ∀t > 0. (8) • When the potential function F is λ-strongly convex, the Markov process is ergodic and its distribution converges to π with linear speed [Bha78]: KL(νLD t , π) ≤ e−λt KL(νLD 0 , π). (9) A. Karagulyan 11

Langevin Monte-Carlo Vanilla Langevin diffusion (integrated): LLD γ = LLD
0 − γ 0 ∇F(LLD s )ds + √ 2Wγ (10) ≈ LLD 0 − γ 0 ∇F(LLD 0 )ds + √ 2Wγ (11) = LLD 0 − γ∇F(LLD 0 ) + √ 2Wγ , (12) for a small γ. Langevin Monte-Carlo (LMC) is defined as: xk+1 = xk − γ∇F(xk ) + 2γ ξk+1 ; k = 0, 1, 2, . . . (LMC) (ξk )k∈N are i.i.d standard Gaussians independent from xk. This Markov Chain does not preserve π, meaning Xk ∼ π Xk+1 ∼ π. Using a Metropolis-Hastings correction step [RT96] proved asymptotic convergence of LMC in TV. A. Karagulyan 12

Unadjusted Langevin Algorithm [Dal17] bounded the error induced by the
discretization. That is they show that the sequence νn → πγ, where the latter is the invariant measure of LMC, with stepsize γ. Then they control the error between πγ and π, by choosing γ to be small. Figure: Made with Remarkable tablet. This then led to a series of work studying LMC in various settings - [DM17], [CB18], [DMM19], [DK19], [CDJB19] etc. A. Karagulyan 13

Langevin Monte-Carlo: Theorem Theorem Suppose F is λ-strongly convex and
L-smooth λId ∇2F(x) LId. If γ < 1/L, then the following upper bound is satisfied: W2 (νn, π) ≤ (1 − λγ)nW2 (ν0, π) + 1.65κ{γd}1/2 (13) where νn is the law of xn and κ = L/λ is the condition number. A. Karagulyan 14

Relaxing strong convexity A. Karagulyan 15

PL inequality Suppose that g is L-smooth (∇2f (x) LId)
and min x∈Rd g(x) = 0. We say that a function g satisfies the Polyak-Łojasiewicz inequality, if for every x ∈ Rd g(x) ≤ 1 2λ ∇g(x) 2. (PL) • If g is λ-strongly convex, it satisfies PL. • If g satisfies the PL inequality and γ < 1/L, then the Gradient Descent (GD) xk+1 = xk − γ∇g(xk) satisfies g(xk+1) ≤ g(xk) + (−γ + Lγ2/2) ∇g(xk) 2 ≤ (1 − λγ/2)g(xk). A. Karagulyan 16

Log-Sobolev inequality The Log-Sobolev inequality (LSI) is the analog of
PL inequality for the functional F : P2 (Rd) → R. Definition We say the π satisfies the λ-LSI if for every µ F(µ) = KL(µ | π) ≤ 1 2λ J(µ | π), (14) where J(· | ·) is the Fisher information. • If F is λ-strongly convex, then π satisfies λ-LSI (Bakry, Émery ’85). LSI is stable w.r.t. Lipschitz maps and small perturbations LSI remains true (Holley-Strock theorem). • Vempala and Wibisono proved the convergence of LMC under LSI. A. Karagulyan 17

Federated Learning A. Karagulyan 18

Federated learning “Federated learning is a machine learning setting where
multiple entities (clients) collaborate in solving a machine learning problem, under the coordination of a central server or service provider. Each client’s raw data is stored locally and not exchanged or transferred; instead, focused updates intended for immediate aggregation are used to achieve the learning objective.” - [KMA+21] Nowadays, cross-device FL mechanisms are widely used: • Medical research [CKLT18, BCM+18]; • Distributed systems [XIZ+23]; • Gboard mobile keyboard, Android messages, Apple’s Siri [EPK14, Pic19]. A. Karagulyan 19

Federated learning paradigm We consider the case when the potential
function is sum-decomposable: F(x) = 1 n n i=1 fi (x). (15) • 1 server, n clients • Each fi is stored on the client i. • The clients in parallel compute local gradients. • Compresses and sends them to the server. • Server aggregates, compresses and sends back the new iterate in parallel. Figure: Federated learning protocol. Source: Wiki A. Karagulyan 20

Federated learning paradigm We consider the case when the potential
function is sum-decomposable: F(x) = 1 n n i=1 fi (x). (16) See [KMY+16, KMA+21] for details. A. Karagulyan 21

Speed comparison A. Karagulyan 22

optimization → sampling Let us recall the LMC algorithm: xk+1
= xk − γ∇F(xk ) gradient descent + 2γξk noise . (17) In particular, federated learning algorithms can be used for sampling with an additive noise. • LMC + generic SGD [DK19] • LMC + SVRG [CFM+18] • LMC + Proximal GD [BDMS19] • LMC + Mirror GD [HKRC18] • LMC + FedAvg [PMD23] • LMC + MARINA [SSR22] • LMC + QSGD [VPD+22] • LMC + EF21 + EF21-P [KR23] A. Karagulyan 23

Compression Definition (Contractive compressor) A stochastic mapping Q : Rd
→ Rd is a contractive compression operator with a coefficient α ∈ (0, 1] if for any x ∈ Rd, E Q(x) − x 2 ≤ (1 − α) x 2. We denote it shortly as Q ∈ B(α). • Top-k returns the k coordinates with the largest absolute values of its entry. Example: for x = (−4, 3, 10, −1) we have Qtop-2 (x) = (−4, 0, 10, 0) . • The compressor can be biased. A. Karagulyan 24

ELF = Error Feedback + Langevin A. Karagulyan 25

ELF = Error Feedback + Langevin A. Karagulyan 26

Main theorem Theorem Assume that LSI holds with constant λ
> 0 and let xk be the iterates of the B-ELF algorithm. We denote by ρk := D(xk ) for every k ∈ N. If each function fi is Li-smooth, then for a small enough step-size γ the following is true for the KL error of the M-ELF algorithm: KL(ρK | π) ≤ e−λKγΨ + τ(γ, d) λ , where Ψ and τ explicitly depend on the parameters of the problem. Taking γ small enough, we can make the second term small. Then we take large enough K to reduce the first one. A. Karagulyan 27

Discussion • We do not assume strong convexity of the
potential. Instead we assume Log-Sobolev inequality, which is the analog PL inequality. • To obtain ε KL error D-ELF and P-ELF need ˜ O(d/λ2α2ε) iterations. • To obtain ε KL error B-ELF needs ˜ O(d/λ2α4ε) iterations. • The contraction coefficient for top-1 is α = 1/d. • In practice, the algorithms are significantly faster. 0e+005e+041e+052e+052e+052e+053e+05 Bits 0.76 0.78 0.80 0.82 0.84 0.86 Test Accuracy a9a, =1, Top-10 B-ELF P-ELF D-ELF LMC 0e+00 5e+04 1e+05 2e+05 2e+05 2e+05 3e+05 Bits 0.5 0.6 0.7 0.8 0.9 1.0 Test Accuracy mushrooms, =1, Top-10 B-ELF P-ELF D-ELF LMC A. Karagulyan 28

This is the last slide. Thank you! A. Karagulyan 29

References [BCM+18] Theodora S Brisimi, Ruidi Chen, Theofanie Mela, Alex
Olshevsky, Ioannis Ch Paschalidis, and Wei Shi. Federated learning of predictive models from federated electronic health records. International journal of medical informatics, 112:59–67, 2018. [BDMS19] Nicolas Brosse, Alain Durmus, Éric Moulines, and Sotirios Sabanis. The tamed unadjusted Langevin algorithm. Stochastic Processes and their Applications, 129(10):3638–3663, 2019. [Bha78] R. N. Bhattacharya. Criteria for recurrence and existence of invariant measures for multidimensional diffusions. Ann. Probab., 6(4):541–553, 08 1978. [CB18] Xiang Cheng and Peter Bartlett. Convergence of Langevin MCMC in KL-divergence. In Proceedings of ALT2018, 2018. [CCL+22] Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim, and Anru R Zhang. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. arXiv preprint arXiv:2209.11215, 2022. [CDJB19] Niladri S Chatterji, Jelena Diakonikolas, Michael I Jordan, and Peter L Bartlett. Langevin Monte Carlo without smoothness. arXiv preprint arXiv:1905.13285, 2019. [CFM+18] Niladri Chatterji, Nicolas Flammarion, Yian Ma, Peter Bartlett, and Michael Jordan. On the theory of variance reduction for stochastic gradient Monte Carlo. In International Conference on Machine Learning, pages 764–773. PMLR, 2018. [CHIS23] Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah. Diffusion models in vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023. [CKLT18] Rachel Cummings, Sara Krehbiel, Kevin A Lai, and Uthaipon Tantipongpipat. Differential privacy for growing databases. Advances in Neural Information Processing Systems, 31, 2018. [Dal17] Arnak S. Dalalyan. Theoretical guarantees for approximate sampling from a smooth and log-concave density. J. R. Stat. Soc. B, 79:651 – 676, 2017. [DDB17] Aymeric Dieuleveut, Alain Durmus, and Francis Bach. Bridging the gap between constant step size stochastic gradient descent and Markov chains. arXiv preprint arXiv:1707.06386, 2017. [DK19] Arnak S Dalalyan and Avetik Karagulyan. User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient. Stochastic Processes and their Applications, 2019. [DM17] Alain Durmus and Eric Moulines. Nonasymptotic convergence analysis for the unadjusted Langevin algorithm. Ann. Appl. Probab., 27(3):1551–1587, 06 2017. A. Karagulyan 29

[DMM19] Alain Durmus, Szymon Majewski, and Blazej Miasojedow. Analysis of
Langevin Monte Carlo via convex optimization. J. Mach. Learn. Res., 20:73–1, 2019. [EPK14] Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. Rappor: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference on computer and communications security, pages 1054–1067, 2014. [HKRC18] Ya-Ping Hsieh, Ali Kavis, Paul Rolland, and Volkan Cevher. Mirrored langevin dynamics. Advances In Neural Information Processing Systems 31 (Nips 2018), 31, 2018. [IVHW21] Pavel Izmailov, Sharad Vikram, Matthew D Hoffman, and Andrew Gordon Gordon Wilson. What are bayesian neural network posteriors really like? In International conference on machine learning, pages 4629–4640. PMLR, 2021. [KMA+21] Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. Foundations and Trends® in Machine Learning, 14(1–2):1–210, 2021. [KMY+16] Jakub Konečnỳ, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492, 2016. [KR23] Avetik Karagulyan and Peter Richtárik. ELF: Federated Langevin algorithms with primal, dual and bidirectional compression. arXiv preprint arXiv:2303.04622, 2023. [Lam21] Andrew Lamperski. Projected stochastic gradient Langevin algorithms for constrained sampling and non-convex learning. In Conference on Learning Theory, pages 2891–2937. PMLR, 2021. [LZBG20] Fujun Luan, Shuang Zhao, Kavita Bala, and Ioannis Gkioulekas. Langevin monte carlo rendering with gradient-based adaptation. ACM Trans. Graph., 39(4):140, 2020. [Pic19] Sundar Pichai. Privacy should not be a luxury good. The New York Times, 8:25, 2019. [PMD23] Vincent Plassier, Eric Moulines, and Alain Durmus. Federated averaging langevin dynamics: Toward a unified theory and new algorithms. In International Conference on Artificial Intelligence and Statistics, pages 5299–5356. PMLR, 2023. [RC13] Christian Robert and George Casella. Monte Carlo statistical methods. Springer Science & Business Media, 2013. [Rob07] Christian Robert. The Bayesian choice: from decision-theoretic foundations to computational implementation. New York: Springer, 2007. [RRT17] Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. In Satyen Kale and Ohad Shamir, editors, Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pages 1674–1703, 07–10 Jul 2017. A. Karagulyan 29

[RT96] G. O. Roberts and R. L. Tweedie. Exponential convergence
of Langevin distributions and their discrete approximations. Bernoulli, 2(4):341–363, 1996. [RZGS23] Anant Raj, Lingjiong Zhu, Mert Gurbuzbalaban, and Umut Simsekli. Algorithmic stability of heavy-tailed sgd with general loss functions. In International Conference on Machine Learning, pages 28578–28597. PMLR, 2023. [SSR22] Lukang Sun, Adil Salim, and Peter Richtárik. Federated learning with a sampling algorithm under isoperimetry. arXiv preprint arXiv:2206.00920, 2022. [VPD+22] Maxime Vono, Vincent Plassier, Alain Durmus, Aymeric Dieuleveut, and Eric Moulines. Qlsd: Quantised Langevin stochastic dynamics for Bayesian federated learning. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, editors, Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pages 6459–6500. PMLR, 28–30 Mar 2022. [XIZ+23] Jihao Xin, Ivan Ilin, Shunkang Zhang, Marco Canini, and Peter Richtárik. Kimad: Adaptive Gradient Compression with Bandwidth Awareness. In Proceedings of DistributedML’23, Dec 2023. A. Karagulyan 29

Avetik Karagulyan

Avetik Karagulyan

S³ Seminar

More Decks by S³ Seminar

Other Decks in Research

Featured

Transcript

Langevin sampling, federated learning and their interconnections L2S, 2025 Avetik

Statistical learning Figure: Made with Remarkable tablet A. Karagulyan 2

Sampling A. Karagulyan 3

Approximate integration Mathematical formulation: Eπ[g(X)] = Rd g(x)π(x)dx. (1) Classical

Approximate integration Mathematical formulation: Eπ[g(X)] = Rd g(x)π(x)dx. (2) Classical

Sampling Sampling is widely used in modern Machine Learning: •

Formulation • Problem: sample from a given target distribution π

KL and FI We are going to use the following

Approximate sampling as optimization Let us define by P2 (Rd)

Langevin sampling A. Karagulyan 10

Langevin Diffusion • Vanilla Langevin diffusion: dLLD t = −∇F(LLD

Langevin Monte-Carlo Vanilla Langevin diffusion (integrated): LLD γ = LLD

Unadjusted Langevin Algorithm [Dal17] bounded the error induced by the

Langevin Monte-Carlo: Theorem Theorem Suppose F is λ-strongly convex and

Relaxing strong convexity A. Karagulyan 15

PL inequality Suppose that g is L-smooth (∇2f (x) LId)

Log-Sobolev inequality The Log-Sobolev inequality (LSI) is the analog of

Federated Learning A. Karagulyan 18

Federated learning “Federated learning is a machine learning setting where

Federated learning paradigm We consider the case when the potential

Federated learning paradigm We consider the case when the potential

Speed comparison A. Karagulyan 22

optimization → sampling Let us recall the LMC algorithm: xk+1

Compression Definition (Contractive compressor) A stochastic mapping Q : Rd

ELF = Error Feedback + Langevin A. Karagulyan 25

ELF = Error Feedback + Langevin A. Karagulyan 26

Main theorem Theorem Assume that LSI holds with constant λ

Discussion • We do not assume strong convexity of the

This is the last slide. Thank you! A. Karagulyan 29

References [BCM+18] Theodora S Brisimi, Ruidi Chen, Theofanie Mela, Alex

[DMM19] Alain Durmus, Szymon Majewski, and Blazej Miasojedow. Analysis of

[RT96] G. O. Roberts and R. L. Tweedie. Exponential convergence