Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Avetik Karagulyan

S³ Seminar
February 07, 2025

Avetik Karagulyan

(Université Paris-Saclay, CNRS, CentraleSupélec, L2S)

Title — ELF: Federated Langevin Algorithms with Primal, Dual and Bidirectional Compression.

Abstract — Federated sampling algorithms have recently gained great popularity in the community of machine learning and statistics. This paper studies variants of such algorithms called Error Feedback Langevin algorithms (ELF). In particular, we analyze the combinations of EF21 and EF21-P with the federated Langevin Monte-Carlo. We propose three algorithms: P-ELF, D-ELF, and B-ELF that use, respectively, primal, dual, and bidirectional compressors. We analyze the proposed methods under Log-Sobolev inequality and provide non-asymptotic convergence guarantees.

Bio
I am a Research Scientist at CNRS/L2S. Previously, I was a PostDoctoral fellow at KAUST in the team of professor Peter Richtárik. I have defended my thesis at Center of Research in Economics and STatistics (CREST), Paris under the supervision of professor Arnak Dalalyan. In 2018, I received my MSc Mathematics, Vision, Learning (MVA) diploma at ENS Paris-Saclay with highest honors (mention “très bien”). I graduated from Yerevan State University’s faculty of Mathematics and Mechanics in 2017 with excellence. My research focuses on the study of different methods of sampling and their connections to optimization.

S³ Seminar

February 07, 2025
Tweet

More Decks by S³ Seminar

Other Decks in Research

Transcript

  1. Langevin sampling, federated learning and their interconnections L2S, 2025 Avetik

    Karagulyan CNRS / L2S / Université Paris-Saclay 1 1 Based on a joint paper with P. Richtárik
  2. Approximate integration Mathematical formulation: Eπ[g(X)] = Rd g(x)π(x)dx. (1) Classical

    solutions: • LLN based methods (Importance sampling): ˆ In = 1/n n i g(Xi)π(Xi)/ν(Xi), where Xi iid ∼ ν. • Markov chain based methods (MCMC): Construct a chain Xn s.t. L(Xn) = νn ≈ π, then take ˆ In = 1/n n i g(Xi). A. Karagulyan 4
  3. Approximate integration Mathematical formulation: Eπ[g(X)] = Rd g(x)π(x)dx. (2) Classical

    solutions: • LLN based methods (Importance sampling): ˆ In = 1/n n i g(Xi)π(Xi)/ν(Xi), where Xi iid ∼ ν. • Markov chain based methods (MCMC): Construct a chain Xn s.t. L(Xn) = νn ≈ π, then take ˆ In = 1/n n i g(Xi). A. Karagulyan 5
  4. Sampling Sampling is widely used in modern Machine Learning: •

    Bayesian Neural Networks [IVHW21]; • Diffusion models [CCL+22, CHIS23]; • Computer Vision [LZBG20]; • Theoretical ML and generalization [DDB17, RZGS23]; • Bayesian statistics [Rob07, RC13]; • Non-convex optimization [RRT17, Lam21]. A. Karagulyan 6
  5. Formulation • Problem: sample from a given target distribution π

    defined on Rd with a large value of d. • More precisely, for a given precision level ε, construct a probability distribution µ on Rd which is easy to sample from and KL(µ | π) ≤ ε. • Important particular case: π has a density (w.r.t. the Lebesgue measure) given by π(θ) ∝ exp(−F(θ)), with a “potential” F : Rd → R. A. Karagulyan 7
  6. KL and FI We are going to use the following

    probability measure distances • The Kullback-Leibler divergence defined as: KL(µ | ν) = Rd log µ(x) ν(x) µ(dx), if µ ν +∞, otherwise. (3) • The Fisher information is defined as J(µ | ν) =    Rd ∇ log µ(x) ν(x) 2 µ(dx), if µ ν +∞, otherwise. (4) A. Karagulyan 8
  7. Approximate sampling as optimization Let us define by P2 (Rd)

    the family of square integrable measures (equipped with Wasserstein distance): P2 (Rd) = µ : Rd θ 2µ(dθ) < +∞ . (5) Define the functional F : P2 (Rd) → R+ as F(µ) = KL(µ | π). (6) Therefore, approximate sampling becomes the minimization of F on some class C whose elements are easier to sample from: ˆ µ = arg min µ∈C F(µ). (7) A. Karagulyan 9
  8. Langevin Diffusion • Vanilla Langevin diffusion: dLLD t = −∇F(LLD

    t )dt + √ 2dWt. (LD) The solution of this equation is a Markov process having π as an invariant distribution: If LLD 0 ∼ π, then LLD t ∼ π, for ∀t > 0. (8) • When the potential function F is λ-strongly convex, the Markov process is ergodic and its distribution converges to π with linear speed [Bha78]: KL(νLD t , π) ≤ e−λt KL(νLD 0 , π). (9) A. Karagulyan 11
  9. Langevin Monte-Carlo Vanilla Langevin diffusion (integrated): LLD γ = LLD

    0 − γ 0 ∇F(LLD s )ds + √ 2Wγ (10) ≈ LLD 0 − γ 0 ∇F(LLD 0 )ds + √ 2Wγ (11) = LLD 0 − γ∇F(LLD 0 ) + √ 2Wγ , (12) for a small γ. Langevin Monte-Carlo (LMC) is defined as: xk+1 = xk − γ∇F(xk ) + 2γ ξk+1 ; k = 0, 1, 2, . . . (LMC) (ξk )k∈N are i.i.d standard Gaussians independent from xk. This Markov Chain does not preserve π, meaning Xk ∼ π Xk+1 ∼ π. Using a Metropolis-Hastings correction step [RT96] proved asymptotic convergence of LMC in TV. A. Karagulyan 12
  10. Unadjusted Langevin Algorithm [Dal17] bounded the error induced by the

    discretization. That is they show that the sequence νn → πγ, where the latter is the invariant measure of LMC, with stepsize γ. Then they control the error between πγ and π, by choosing γ to be small. Figure: Made with Remarkable tablet. This then led to a series of work studying LMC in various settings - [DM17], [CB18], [DMM19], [DK19], [CDJB19] etc. A. Karagulyan 13
  11. Langevin Monte-Carlo: Theorem Theorem Suppose F is λ-strongly convex and

    L-smooth λId ∇2F(x) LId. If γ < 1/L, then the following upper bound is satisfied: W2 (νn, π) ≤ (1 − λγ)nW2 (ν0, π) + 1.65κ{γd}1/2 (13) where νn is the law of xn and κ = L/λ is the condition number. A. Karagulyan 14
  12. PL inequality Suppose that g is L-smooth (∇2f (x) LId)

    and min x∈Rd g(x) = 0. We say that a function g satisfies the Polyak-Łojasiewicz inequality, if for every x ∈ Rd g(x) ≤ 1 2λ ∇g(x) 2. (PL) • If g is λ-strongly convex, it satisfies PL. • If g satisfies the PL inequality and γ < 1/L, then the Gradient Descent (GD) xk+1 = xk − γ∇g(xk) satisfies g(xk+1) ≤ g(xk) + (−γ + Lγ2/2) ∇g(xk) 2 ≤ (1 − λγ/2)g(xk). A. Karagulyan 16
  13. Log-Sobolev inequality The Log-Sobolev inequality (LSI) is the analog of

    PL inequality for the functional F : P2 (Rd) → R. Definition We say the π satisfies the λ-LSI if for every µ F(µ) = KL(µ | π) ≤ 1 2λ J(µ | π), (14) where J(· | ·) is the Fisher information. • If F is λ-strongly convex, then π satisfies λ-LSI (Bakry, Émery ’85). LSI is stable w.r.t. Lipschitz maps and small perturbations LSI remains true (Holley-Strock theorem). • Vempala and Wibisono proved the convergence of LMC under LSI. A. Karagulyan 17
  14. Federated learning “Federated learning is a machine learning setting where

    multiple entities (clients) collaborate in solving a machine learning problem, under the coordination of a central server or service provider. Each client’s raw data is stored locally and not exchanged or transferred; instead, focused updates intended for immediate aggregation are used to achieve the learning objective.” - [KMA+21] Nowadays, cross-device FL mechanisms are widely used: • Medical research [CKLT18, BCM+18]; • Distributed systems [XIZ+23]; • Gboard mobile keyboard, Android messages, Apple’s Siri [EPK14, Pic19]. A. Karagulyan 19
  15. Federated learning paradigm We consider the case when the potential

    function is sum-decomposable: F(x) = 1 n n i=1 fi (x). (15) • 1 server, n clients • Each fi is stored on the client i. • The clients in parallel compute local gradients. • Compresses and sends them to the server. • Server aggregates, compresses and sends back the new iterate in parallel. Figure: Federated learning protocol. Source: Wiki A. Karagulyan 20
  16. Federated learning paradigm We consider the case when the potential

    function is sum-decomposable: F(x) = 1 n n i=1 fi (x). (16) See [KMY+16, KMA+21] for details. A. Karagulyan 21
  17. optimization → sampling Let us recall the LMC algorithm: xk+1

    = xk − γ∇F(xk ) gradient descent + 2γξk noise . (17) In particular, federated learning algorithms can be used for sampling with an additive noise. • LMC + generic SGD [DK19] • LMC + SVRG [CFM+18] • LMC + Proximal GD [BDMS19] • LMC + Mirror GD [HKRC18] • LMC + FedAvg [PMD23] • LMC + MARINA [SSR22] • LMC + QSGD [VPD+22] • LMC + EF21 + EF21-P [KR23] A. Karagulyan 23
  18. Compression Definition (Contractive compressor) A stochastic mapping Q : Rd

    → Rd is a contractive compression operator with a coefficient α ∈ (0, 1] if for any x ∈ Rd, E Q(x) − x 2 ≤ (1 − α) x 2. We denote it shortly as Q ∈ B(α). • Top-k returns the k coordinates with the largest absolute values of its entry. Example: for x = (−4, 3, 10, −1) we have Qtop-2 (x) = (−4, 0, 10, 0) . • The compressor can be biased. A. Karagulyan 24
  19. Main theorem Theorem Assume that LSI holds with constant λ

    > 0 and let xk be the iterates of the B-ELF algorithm. We denote by ρk := D(xk ) for every k ∈ N. If each function fi is Li-smooth, then for a small enough step-size γ the following is true for the KL error of the M-ELF algorithm: KL(ρK | π) ≤ e−λKγΨ + τ(γ, d) λ , where Ψ and τ explicitly depend on the parameters of the problem. Taking γ small enough, we can make the second term small. Then we take large enough K to reduce the first one. A. Karagulyan 27
  20. Discussion • We do not assume strong convexity of the

    potential. Instead we assume Log-Sobolev inequality, which is the analog PL inequality. • To obtain ε KL error D-ELF and P-ELF need ˜ O(d/λ2α2ε) iterations. • To obtain ε KL error B-ELF needs ˜ O(d/λ2α4ε) iterations. • The contraction coefficient for top-1 is α = 1/d. • In practice, the algorithms are significantly faster. 0e+005e+041e+052e+052e+052e+053e+05 Bits 0.76 0.78 0.80 0.82 0.84 0.86 Test Accuracy a9a, =1, Top-10 B-ELF P-ELF D-ELF LMC 0e+00 5e+04 1e+05 2e+05 2e+05 2e+05 3e+05 Bits 0.5 0.6 0.7 0.8 0.9 1.0 Test Accuracy mushrooms, =1, Top-10 B-ELF P-ELF D-ELF LMC A. Karagulyan 28
  21. References [BCM+18] Theodora S Brisimi, Ruidi Chen, Theofanie Mela, Alex

    Olshevsky, Ioannis Ch Paschalidis, and Wei Shi. Federated learning of predictive models from federated electronic health records. International journal of medical informatics, 112:59–67, 2018. [BDMS19] Nicolas Brosse, Alain Durmus, Éric Moulines, and Sotirios Sabanis. The tamed unadjusted Langevin algorithm. Stochastic Processes and their Applications, 129(10):3638–3663, 2019. [Bha78] R. N. Bhattacharya. Criteria for recurrence and existence of invariant measures for multidimensional diffusions. Ann. Probab., 6(4):541–553, 08 1978. [CB18] Xiang Cheng and Peter Bartlett. Convergence of Langevin MCMC in KL-divergence. In Proceedings of ALT2018, 2018. [CCL+22] Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim, and Anru R Zhang. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. arXiv preprint arXiv:2209.11215, 2022. [CDJB19] Niladri S Chatterji, Jelena Diakonikolas, Michael I Jordan, and Peter L Bartlett. Langevin Monte Carlo without smoothness. arXiv preprint arXiv:1905.13285, 2019. [CFM+18] Niladri Chatterji, Nicolas Flammarion, Yian Ma, Peter Bartlett, and Michael Jordan. On the theory of variance reduction for stochastic gradient Monte Carlo. In International Conference on Machine Learning, pages 764–773. PMLR, 2018. [CHIS23] Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah. Diffusion models in vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023. [CKLT18] Rachel Cummings, Sara Krehbiel, Kevin A Lai, and Uthaipon Tantipongpipat. Differential privacy for growing databases. Advances in Neural Information Processing Systems, 31, 2018. [Dal17] Arnak S. Dalalyan. Theoretical guarantees for approximate sampling from a smooth and log-concave density. J. R. Stat. Soc. B, 79:651 – 676, 2017. [DDB17] Aymeric Dieuleveut, Alain Durmus, and Francis Bach. Bridging the gap between constant step size stochastic gradient descent and Markov chains. arXiv preprint arXiv:1707.06386, 2017. [DK19] Arnak S Dalalyan and Avetik Karagulyan. User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient. Stochastic Processes and their Applications, 2019. [DM17] Alain Durmus and Eric Moulines. Nonasymptotic convergence analysis for the unadjusted Langevin algorithm. Ann. Appl. Probab., 27(3):1551–1587, 06 2017. A. Karagulyan 29
  22. [DMM19] Alain Durmus, Szymon Majewski, and Blazej Miasojedow. Analysis of

    Langevin Monte Carlo via convex optimization. J. Mach. Learn. Res., 20:73–1, 2019. [EPK14] Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. Rappor: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference on computer and communications security, pages 1054–1067, 2014. [HKRC18] Ya-Ping Hsieh, Ali Kavis, Paul Rolland, and Volkan Cevher. Mirrored langevin dynamics. Advances In Neural Information Processing Systems 31 (Nips 2018), 31, 2018. [IVHW21] Pavel Izmailov, Sharad Vikram, Matthew D Hoffman, and Andrew Gordon Gordon Wilson. What are bayesian neural network posteriors really like? In International conference on machine learning, pages 4629–4640. PMLR, 2021. [KMA+21] Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. Foundations and Trends® in Machine Learning, 14(1–2):1–210, 2021. [KMY+16] Jakub Konečnỳ, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492, 2016. [KR23] Avetik Karagulyan and Peter Richtárik. ELF: Federated Langevin algorithms with primal, dual and bidirectional compression. arXiv preprint arXiv:2303.04622, 2023. [Lam21] Andrew Lamperski. Projected stochastic gradient Langevin algorithms for constrained sampling and non-convex learning. In Conference on Learning Theory, pages 2891–2937. PMLR, 2021. [LZBG20] Fujun Luan, Shuang Zhao, Kavita Bala, and Ioannis Gkioulekas. Langevin monte carlo rendering with gradient-based adaptation. ACM Trans. Graph., 39(4):140, 2020. [Pic19] Sundar Pichai. Privacy should not be a luxury good. The New York Times, 8:25, 2019. [PMD23] Vincent Plassier, Eric Moulines, and Alain Durmus. Federated averaging langevin dynamics: Toward a unified theory and new algorithms. In International Conference on Artificial Intelligence and Statistics, pages 5299–5356. PMLR, 2023. [RC13] Christian Robert and George Casella. Monte Carlo statistical methods. Springer Science & Business Media, 2013. [Rob07] Christian Robert. The Bayesian choice: from decision-theoretic foundations to computational implementation. New York: Springer, 2007. [RRT17] Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. In Satyen Kale and Ohad Shamir, editors, Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pages 1674–1703, 07–10 Jul 2017. A. Karagulyan 29
  23. [RT96] G. O. Roberts and R. L. Tweedie. Exponential convergence

    of Langevin distributions and their discrete approximations. Bernoulli, 2(4):341–363, 1996. [RZGS23] Anant Raj, Lingjiong Zhu, Mert Gurbuzbalaban, and Umut Simsekli. Algorithmic stability of heavy-tailed sgd with general loss functions. In International Conference on Machine Learning, pages 28578–28597. PMLR, 2023. [SSR22] Lukang Sun, Adil Salim, and Peter Richtárik. Federated learning with a sampling algorithm under isoperimetry. arXiv preprint arXiv:2206.00920, 2022. [VPD+22] Maxime Vono, Vincent Plassier, Alain Durmus, Aymeric Dieuleveut, and Eric Moulines. Qlsd: Quantised Langevin stochastic dynamics for Bayesian federated learning. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, editors, Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pages 6459–6500. PMLR, 28–30 Mar 2022. [XIZ+23] Jihao Xin, Ivan Ilin, Shunkang Zhang, Marco Canini, and Peter Richtárik. Kimad: Adaptive Gradient Compression with Bandwidth Awareness. In Proceedings of DistributedML’23, Dec 2023. A. Karagulyan 29