Slide 1

Slide 1 text

Langevin sampling, federated learning and their interconnections L2S, 2025 Avetik Karagulyan CNRS / L2S / Université Paris-Saclay 1 1 Based on a joint paper with P. Richtárik

Slide 2

Slide 2 text

Statistical learning Figure: Made with Remarkable tablet A. Karagulyan 2

Slide 3

Slide 3 text

Sampling A. Karagulyan 3

Slide 4

Slide 4 text

Approximate integration Mathematical formulation: Eπ[g(X)] = Rd g(x)π(x)dx. (1) Classical solutions: • LLN based methods (Importance sampling): ˆ In = 1/n n i g(Xi)π(Xi)/ν(Xi), where Xi iid ∼ ν. • Markov chain based methods (MCMC): Construct a chain Xn s.t. L(Xn) = νn ≈ π, then take ˆ In = 1/n n i g(Xi). A. Karagulyan 4

Slide 5

Slide 5 text

Approximate integration Mathematical formulation: Eπ[g(X)] = Rd g(x)π(x)dx. (2) Classical solutions: • LLN based methods (Importance sampling): ˆ In = 1/n n i g(Xi)π(Xi)/ν(Xi), where Xi iid ∼ ν. • Markov chain based methods (MCMC): Construct a chain Xn s.t. L(Xn) = νn ≈ π, then take ˆ In = 1/n n i g(Xi). A. Karagulyan 5

Slide 6

Slide 6 text

Sampling Sampling is widely used in modern Machine Learning: • Bayesian Neural Networks [IVHW21]; • Diffusion models [CCL+22, CHIS23]; • Computer Vision [LZBG20]; • Theoretical ML and generalization [DDB17, RZGS23]; • Bayesian statistics [Rob07, RC13]; • Non-convex optimization [RRT17, Lam21]. A. Karagulyan 6

Slide 7

Slide 7 text

Formulation • Problem: sample from a given target distribution π defined on Rd with a large value of d. • More precisely, for a given precision level ε, construct a probability distribution µ on Rd which is easy to sample from and KL(µ | π) ≤ ε. • Important particular case: π has a density (w.r.t. the Lebesgue measure) given by π(θ) ∝ exp(−F(θ)), with a “potential” F : Rd → R. A. Karagulyan 7

Slide 8

Slide 8 text

KL and FI We are going to use the following probability measure distances • The Kullback-Leibler divergence defined as: KL(µ | ν) = Rd log µ(x) ν(x) µ(dx), if µ ν +∞, otherwise. (3) • The Fisher information is defined as J(µ | ν) =    Rd ∇ log µ(x) ν(x) 2 µ(dx), if µ ν +∞, otherwise. (4) A. Karagulyan 8

Slide 9

Slide 9 text

Approximate sampling as optimization Let us define by P2 (Rd) the family of square integrable measures (equipped with Wasserstein distance): P2 (Rd) = µ : Rd θ 2µ(dθ) < +∞ . (5) Define the functional F : P2 (Rd) → R+ as F(µ) = KL(µ | π). (6) Therefore, approximate sampling becomes the minimization of F on some class C whose elements are easier to sample from: ˆ µ = arg min µ∈C F(µ). (7) A. Karagulyan 9

Slide 10

Slide 10 text

Langevin sampling A. Karagulyan 10

Slide 11

Slide 11 text

Langevin Diffusion • Vanilla Langevin diffusion: dLLD t = −∇F(LLD t )dt + √ 2dWt. (LD) The solution of this equation is a Markov process having π as an invariant distribution: If LLD 0 ∼ π, then LLD t ∼ π, for ∀t > 0. (8) • When the potential function F is λ-strongly convex, the Markov process is ergodic and its distribution converges to π with linear speed [Bha78]: KL(νLD t , π) ≤ e−λt KL(νLD 0 , π). (9) A. Karagulyan 11

Slide 12

Slide 12 text

Langevin Monte-Carlo Vanilla Langevin diffusion (integrated): LLD γ = LLD 0 − γ 0 ∇F(LLD s )ds + √ 2Wγ (10) ≈ LLD 0 − γ 0 ∇F(LLD 0 )ds + √ 2Wγ (11) = LLD 0 − γ∇F(LLD 0 ) + √ 2Wγ , (12) for a small γ. Langevin Monte-Carlo (LMC) is defined as: xk+1 = xk − γ∇F(xk ) + 2γ ξk+1 ; k = 0, 1, 2, . . . (LMC) (ξk )k∈N are i.i.d standard Gaussians independent from xk. This Markov Chain does not preserve π, meaning Xk ∼ π Xk+1 ∼ π. Using a Metropolis-Hastings correction step [RT96] proved asymptotic convergence of LMC in TV. A. Karagulyan 12

Slide 13

Slide 13 text

Unadjusted Langevin Algorithm [Dal17] bounded the error induced by the discretization. That is they show that the sequence νn → πγ, where the latter is the invariant measure of LMC, with stepsize γ. Then they control the error between πγ and π, by choosing γ to be small. Figure: Made with Remarkable tablet. This then led to a series of work studying LMC in various settings - [DM17], [CB18], [DMM19], [DK19], [CDJB19] etc. A. Karagulyan 13

Slide 14

Slide 14 text

Langevin Monte-Carlo: Theorem Theorem Suppose F is λ-strongly convex and L-smooth λId ∇2F(x) LId. If γ < 1/L, then the following upper bound is satisfied: W2 (νn, π) ≤ (1 − λγ)nW2 (ν0, π) + 1.65κ{γd}1/2 (13) where νn is the law of xn and κ = L/λ is the condition number. A. Karagulyan 14

Slide 15

Slide 15 text

Relaxing strong convexity A. Karagulyan 15

Slide 16

Slide 16 text

PL inequality Suppose that g is L-smooth (∇2f (x) LId) and min x∈Rd g(x) = 0. We say that a function g satisfies the Polyak-Łojasiewicz inequality, if for every x ∈ Rd g(x) ≤ 1 2λ ∇g(x) 2. (PL) • If g is λ-strongly convex, it satisfies PL. • If g satisfies the PL inequality and γ < 1/L, then the Gradient Descent (GD) xk+1 = xk − γ∇g(xk) satisfies g(xk+1) ≤ g(xk) + (−γ + Lγ2/2) ∇g(xk) 2 ≤ (1 − λγ/2)g(xk). A. Karagulyan 16

Slide 17

Slide 17 text

Log-Sobolev inequality The Log-Sobolev inequality (LSI) is the analog of PL inequality for the functional F : P2 (Rd) → R. Definition We say the π satisfies the λ-LSI if for every µ F(µ) = KL(µ | π) ≤ 1 2λ J(µ | π), (14) where J(· | ·) is the Fisher information. • If F is λ-strongly convex, then π satisfies λ-LSI (Bakry, Émery ’85). LSI is stable w.r.t. Lipschitz maps and small perturbations LSI remains true (Holley-Strock theorem). • Vempala and Wibisono proved the convergence of LMC under LSI. A. Karagulyan 17

Slide 18

Slide 18 text

Federated Learning A. Karagulyan 18

Slide 19

Slide 19 text

Federated learning “Federated learning is a machine learning setting where multiple entities (clients) collaborate in solving a machine learning problem, under the coordination of a central server or service provider. Each client’s raw data is stored locally and not exchanged or transferred; instead, focused updates intended for immediate aggregation are used to achieve the learning objective.” - [KMA+21] Nowadays, cross-device FL mechanisms are widely used: • Medical research [CKLT18, BCM+18]; • Distributed systems [XIZ+23]; • Gboard mobile keyboard, Android messages, Apple’s Siri [EPK14, Pic19]. A. Karagulyan 19

Slide 20

Slide 20 text

Federated learning paradigm We consider the case when the potential function is sum-decomposable: F(x) = 1 n n i=1 fi (x). (15) • 1 server, n clients • Each fi is stored on the client i. • The clients in parallel compute local gradients. • Compresses and sends them to the server. • Server aggregates, compresses and sends back the new iterate in parallel. Figure: Federated learning protocol. Source: Wiki A. Karagulyan 20

Slide 21

Slide 21 text

Federated learning paradigm We consider the case when the potential function is sum-decomposable: F(x) = 1 n n i=1 fi (x). (16) See [KMY+16, KMA+21] for details. A. Karagulyan 21

Slide 22

Slide 22 text

Speed comparison A. Karagulyan 22

Slide 23

Slide 23 text

optimization → sampling Let us recall the LMC algorithm: xk+1 = xk − γ∇F(xk ) gradient descent + 2γξk noise . (17) In particular, federated learning algorithms can be used for sampling with an additive noise. • LMC + generic SGD [DK19] • LMC + SVRG [CFM+18] • LMC + Proximal GD [BDMS19] • LMC + Mirror GD [HKRC18] • LMC + FedAvg [PMD23] • LMC + MARINA [SSR22] • LMC + QSGD [VPD+22] • LMC + EF21 + EF21-P [KR23] A. Karagulyan 23

Slide 24

Slide 24 text

Compression Definition (Contractive compressor) A stochastic mapping Q : Rd → Rd is a contractive compression operator with a coefficient α ∈ (0, 1] if for any x ∈ Rd, E Q(x) − x 2 ≤ (1 − α) x 2. We denote it shortly as Q ∈ B(α). • Top-k returns the k coordinates with the largest absolute values of its entry. Example: for x = (−4, 3, 10, −1) we have Qtop-2 (x) = (−4, 0, 10, 0) . • The compressor can be biased. A. Karagulyan 24

Slide 25

Slide 25 text

ELF = Error Feedback + Langevin A. Karagulyan 25

Slide 26

Slide 26 text

ELF = Error Feedback + Langevin A. Karagulyan 26

Slide 27

Slide 27 text

Main theorem Theorem Assume that LSI holds with constant λ > 0 and let xk be the iterates of the B-ELF algorithm. We denote by ρk := D(xk ) for every k ∈ N. If each function fi is Li-smooth, then for a small enough step-size γ the following is true for the KL error of the M-ELF algorithm: KL(ρK | π) ≤ e−λKγΨ + τ(γ, d) λ , where Ψ and τ explicitly depend on the parameters of the problem. Taking γ small enough, we can make the second term small. Then we take large enough K to reduce the first one. A. Karagulyan 27

Slide 28

Slide 28 text

Discussion • We do not assume strong convexity of the potential. Instead we assume Log-Sobolev inequality, which is the analog PL inequality. • To obtain ε KL error D-ELF and P-ELF need ˜ O(d/λ2α2ε) iterations. • To obtain ε KL error B-ELF needs ˜ O(d/λ2α4ε) iterations. • The contraction coefficient for top-1 is α = 1/d. • In practice, the algorithms are significantly faster. 0e+005e+041e+052e+052e+052e+053e+05 Bits 0.76 0.78 0.80 0.82 0.84 0.86 Test Accuracy a9a, =1, Top-10 B-ELF P-ELF D-ELF LMC 0e+00 5e+04 1e+05 2e+05 2e+05 2e+05 3e+05 Bits 0.5 0.6 0.7 0.8 0.9 1.0 Test Accuracy mushrooms, =1, Top-10 B-ELF P-ELF D-ELF LMC A. Karagulyan 28

Slide 29

Slide 29 text

This is the last slide. Thank you! A. Karagulyan 29

Slide 30

Slide 30 text

References [BCM+18] Theodora S Brisimi, Ruidi Chen, Theofanie Mela, Alex Olshevsky, Ioannis Ch Paschalidis, and Wei Shi. Federated learning of predictive models from federated electronic health records. International journal of medical informatics, 112:59–67, 2018. [BDMS19] Nicolas Brosse, Alain Durmus, Éric Moulines, and Sotirios Sabanis. The tamed unadjusted Langevin algorithm. Stochastic Processes and their Applications, 129(10):3638–3663, 2019. [Bha78] R. N. Bhattacharya. Criteria for recurrence and existence of invariant measures for multidimensional diffusions. Ann. Probab., 6(4):541–553, 08 1978. [CB18] Xiang Cheng and Peter Bartlett. Convergence of Langevin MCMC in KL-divergence. In Proceedings of ALT2018, 2018. [CCL+22] Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim, and Anru R Zhang. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. arXiv preprint arXiv:2209.11215, 2022. [CDJB19] Niladri S Chatterji, Jelena Diakonikolas, Michael I Jordan, and Peter L Bartlett. Langevin Monte Carlo without smoothness. arXiv preprint arXiv:1905.13285, 2019. [CFM+18] Niladri Chatterji, Nicolas Flammarion, Yian Ma, Peter Bartlett, and Michael Jordan. On the theory of variance reduction for stochastic gradient Monte Carlo. In International Conference on Machine Learning, pages 764–773. PMLR, 2018. [CHIS23] Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah. Diffusion models in vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023. [CKLT18] Rachel Cummings, Sara Krehbiel, Kevin A Lai, and Uthaipon Tantipongpipat. Differential privacy for growing databases. Advances in Neural Information Processing Systems, 31, 2018. [Dal17] Arnak S. Dalalyan. Theoretical guarantees for approximate sampling from a smooth and log-concave density. J. R. Stat. Soc. B, 79:651 – 676, 2017. [DDB17] Aymeric Dieuleveut, Alain Durmus, and Francis Bach. Bridging the gap between constant step size stochastic gradient descent and Markov chains. arXiv preprint arXiv:1707.06386, 2017. [DK19] Arnak S Dalalyan and Avetik Karagulyan. User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient. Stochastic Processes and their Applications, 2019. [DM17] Alain Durmus and Eric Moulines. Nonasymptotic convergence analysis for the unadjusted Langevin algorithm. Ann. Appl. Probab., 27(3):1551–1587, 06 2017. A. Karagulyan 29

Slide 31

Slide 31 text

[DMM19] Alain Durmus, Szymon Majewski, and Blazej Miasojedow. Analysis of Langevin Monte Carlo via convex optimization. J. Mach. Learn. Res., 20:73–1, 2019. [EPK14] Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. Rappor: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference on computer and communications security, pages 1054–1067, 2014. [HKRC18] Ya-Ping Hsieh, Ali Kavis, Paul Rolland, and Volkan Cevher. Mirrored langevin dynamics. Advances In Neural Information Processing Systems 31 (Nips 2018), 31, 2018. [IVHW21] Pavel Izmailov, Sharad Vikram, Matthew D Hoffman, and Andrew Gordon Gordon Wilson. What are bayesian neural network posteriors really like? In International conference on machine learning, pages 4629–4640. PMLR, 2021. [KMA+21] Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. Foundations and Trends® in Machine Learning, 14(1–2):1–210, 2021. [KMY+16] Jakub Konečnỳ, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492, 2016. [KR23] Avetik Karagulyan and Peter Richtárik. ELF: Federated Langevin algorithms with primal, dual and bidirectional compression. arXiv preprint arXiv:2303.04622, 2023. [Lam21] Andrew Lamperski. Projected stochastic gradient Langevin algorithms for constrained sampling and non-convex learning. In Conference on Learning Theory, pages 2891–2937. PMLR, 2021. [LZBG20] Fujun Luan, Shuang Zhao, Kavita Bala, and Ioannis Gkioulekas. Langevin monte carlo rendering with gradient-based adaptation. ACM Trans. Graph., 39(4):140, 2020. [Pic19] Sundar Pichai. Privacy should not be a luxury good. The New York Times, 8:25, 2019. [PMD23] Vincent Plassier, Eric Moulines, and Alain Durmus. Federated averaging langevin dynamics: Toward a unified theory and new algorithms. In International Conference on Artificial Intelligence and Statistics, pages 5299–5356. PMLR, 2023. [RC13] Christian Robert and George Casella. Monte Carlo statistical methods. Springer Science & Business Media, 2013. [Rob07] Christian Robert. The Bayesian choice: from decision-theoretic foundations to computational implementation. New York: Springer, 2007. [RRT17] Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. In Satyen Kale and Ohad Shamir, editors, Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pages 1674–1703, 07–10 Jul 2017. A. Karagulyan 29

Slide 32

Slide 32 text

[RT96] G. O. Roberts and R. L. Tweedie. Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, 2(4):341–363, 1996. [RZGS23] Anant Raj, Lingjiong Zhu, Mert Gurbuzbalaban, and Umut Simsekli. Algorithmic stability of heavy-tailed sgd with general loss functions. In International Conference on Machine Learning, pages 28578–28597. PMLR, 2023. [SSR22] Lukang Sun, Adil Salim, and Peter Richtárik. Federated learning with a sampling algorithm under isoperimetry. arXiv preprint arXiv:2206.00920, 2022. [VPD+22] Maxime Vono, Vincent Plassier, Alain Durmus, Aymeric Dieuleveut, and Eric Moulines. Qlsd: Quantised Langevin stochastic dynamics for Bayesian federated learning. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, editors, Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pages 6459–6500. PMLR, 28–30 Mar 2022. [XIZ+23] Jihao Xin, Ivan Ilin, Shunkang Zhang, Marco Canini, and Peter Richtárik. Kimad: Adaptive Gradient Compression with Bandwidth Awareness. In Proceedings of DistributedML’23, Dec 2023. A. Karagulyan 29