Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Federated ADMM from Bayesian Duality

Avatar for Jia-Jie Zhu Jia-Jie Zhu
August 18, 2025
11

Federated ADMM from Bayesian Duality

Thomas Möllenhoff

ICSP 2025 invited session

Avatar for Jia-Jie Zhu

Jia-Jie Zhu

August 18, 2025
Tweet

More Decks by Jia-Jie Zhu

Transcript

  1. Federated ADMM from Bayesian Duality Thomas Möllenhoff ICSP 2025, Paris

    Session: Exploring the synergy between stochastic optimization, dynamics, sampling, inference, and optimal transport
  2. 2 Siddharth Swaroop Emtiyaz Khan Emtiyaz Khan Finale Doshi-Velez Thomas

    Möllenhoff Code available at: https://github.com/team-approx-bayes/bayes-admm 1. T. Möllenhoff*, S. Swaroop*, F. Doshi-Velez, M. E. Khan, Federated ADMM from Bayesian Duality, arXiv:2506.13150, 2025
  3. Why Federated Learning? • Collaborative training of large open models

    across institutions • Perhaps we don’t want or cannot share all data • Specialized models which share their knowledge with each other • We may have highly heterogeneous data… • … or train on entirely different tasks (biology, math, physics, …) • What knowledge should the local models exchange with each other? • How should shared information be weighted? 3 This talk: Improving Federated ADMM via Bayes
  4. Why is Uncertainty Important in Distributed Learning? min θ K

    ∑ i=1 ℓi (θ) 4 • K clients; each minimize a loss function to fi nd their local solution • What information should be exchanged between the clients? • Example: ℓi θi ℓi (θ) = 1 2 Ni ∑ j=1 (θ − xi,j )2 1, 2, 3 4, 6, 5, 5, 10 2 6 4? A: 4.5 this quantity is the posterior precision (inverse covariance). Ni = ∇2ℓi
  5. 5

  6. 6

  7. Existing Federated Learning Methods 7 (starting from ) θi ←

    arg min θ ℓi (θ) ¯ θ ¯ θ ← 1 K K ∑ k=1 θk FedAvg [1] θi ← arg min θ ℓi (θ) + ρ 2 ∥θ − ¯ θ∥2 ¯ θ ← 1 K K ∑ k=1 θk FedProx [2] θi ← arg min θ ℓi (θ) + v⊤ i θ + ρ 2 ∥θ − ¯ θ∥2 vi ← vi + ρ(θi − ¯ θ) ¯ θ ← (1 − α) ( 1 K K ∑ k=1 θk) + α K ∑ k=1 vk Alternating Direction Method of Multipliers (ADMM) (e.g., [3]) 1. McMahan, Brendan, et al. Communication-ef fi cient learning of deep networks from decentralized data. AISTATS, 2017 2. Li, Tian, et al. "Federated optimization in heterogeneous networks." Proceedings of Machine Learning and Systems, 2020 3. Zhang, Xinwei, et al. "FedPD: A federated learning framework with adaptivity to non-iid data." IEEE Transactions on Signal Processing 69 (2021): 6055-6070. θi ← arg min θ ℓi (θ) + v⊤ i θ+ ρ 2 ∥θ − ¯ θ∥2 vi ← vi + ρ(θi − ¯ θ) ¯ θ ← arg min θ R(θ) + K ∑ i=1 ( ρ 2 ∥θ − θi ∥2 −v⊤ i θ)
  8. Lagrangian View on ADMM 8 Proposal: Lift the problem to

    probability distributions. Variational Bayesian formulation:
  9. Recovers and Extending Existing ADMM 10 Novel “Adam”-like ADMM Novel

    Newton-like ADMM Exponential Family: Different choices of Exponential Family recover existing algorithms and give new ones!
  10. Convergence in a Single Round 11 ℓi (θ) = 1

    2 Ni ∑ j=1 ∥Xi,j θ − yi,j ∥2 1. Wang & Banerjee, Bregman Alternating Direction Method of Multipliers, NIPS 2013. 1
  11. Connections to Existing Algorithms μi ← arg min μ 𝔼

    q [ℓi (θ)] + μ⊤ ̂ λi +ρKL(q| ¯ q) ̂ λi ← ̂ λi + ρ(λi − ¯ λ) ¯ λ ← 1 − α K K ∑ k=1 λk + α K ∑ k=0 ̂ λk BayesADMM: μi ← arg min μ 𝔼 q [ℓi (θ)] + μ⊤ ̂ λi + KL(q| ¯ q) ̂ λi ← ̂ λi + ρ(λi − ¯ λ) ¯ λ ← K ∑ k=0 ̂ λk Partitioned Variational Inference [2, 3]: 1. Khan and Rue, “The Bayesian Learning Rule”, JMLR, 2023. 2. Swaroop, S., Khan, M. E., and Doshi-Velez F. "Connecting Federated ADMM to Bayes.”, ICLR 2025 3. Bui, Thang D., et al. "Partitioned variational inference: A uni fi ed framework encompassing federated and continual learning." arXiv:1811.11206 (2018). α = 1/(1 + ρK) λ ← (1 − α)λ + α K ∑ k=0 ∇μ 𝔼 q [−ℓi ] μ=∇A(λ) Bayesian Learning Rule [1]: ̂ λi = ∇μ 𝔼 q [−ℓi ] μ=μk 12
  12. Comparison to Partitioned Variational Inference ADMM uses , PVI uses

    (with damping) or (without damping). ρ = 1 ρ = 1/K ρ = 1 1. Bui, Thang D., et al. "Partitioned variational inference: A uni fi ed framework encompassing federated and continual learning." arXiv:1811.11206 (2018). 2. Wang & Banerjee, Bregman Alternating Direction Method of Multipliers, NIPS 2013. 13
  13. Scalable Implementation for Large Neural Nets • Main dif fi

    culty: fi rst step is a variational learning/inference problem • We restrict the family to Gaussians with diagonal covariance • We use the recent IVON (Improved Variational Online Newton) optimizer [1] • IVON is a scalable drop-in replacement for Adam/RMSprop which does mean- fi eld VI 1. Y. Shen*, N. Daheim*, …, T. Möllenhoff. Variational Learning is Effective for Large Deep Networks, ICML 2024 (Spotlight). 14
  14. 15 RMSprop/Adam Variational Online Newton <latexit sha1_base64="6cqgdMpwEvsCUhlzFD20sqy/oaU=">AAACt3icbVFNj9MwEHXC11K+Chy5jKhAqdCWZFW+bitx4bhIdHdFXaqJ49TWOnHWdkBVlL/IgRv/BidtJbq7I1l6eu/N83icVkpaF8d/g/DW7Tt37x3cHzx4+Ojxk+HTZ6dW14bxGdNKm/MULVey5DMnneLnleFYpIqfpRefO/3sJzdW6vKbW1d8UeCqlLlk6Dy1HP6mAl2zauE1VTx3aIz+BT1HS0wVtkC5UhF1gjscUzroNXGDf9X+OAJvEHtSlBxSI/QYBLyBDsEuoMvqU/ejNtQhUFSVQIBol971Z1x5sYDxW4iovTR90I7vplsOR/Ek7guug2QLRmRbJ8vhH5ppVhe8dEyhtfMkrtyiQeMkU7wd0NryCtkFrvjcwxILbhdNv/cWXnkmg1wbf0oHPft/R4OFtesi9c4CnbBXtY68SZvXLv+4aGRZ1Y6XbHNRXitwGrpPhEwazpxae4DMSD8rMIEGmfNf3S0hufrk6+D0aJK8n7z7Oh0dT7frOCAvyEsSkYR8IMfkCzkhM8KCafA9YEEWfgqXYR6KjTUMtj3PyV6Fl/8ARwHSgg==</latexit> ˆ g ˆ

    r`(✓) ˆ h ˆ g2 h (1 ⇢)h + ⇢ˆ h ✓ ✓ ↵(ˆ g + m)/( p h + ) <latexit sha1_base64="lMR1EEq84mMqe3ZqmU42lhTl4A0=">AAADlnicbVLbbtNAEN3YXEq4NKUvSLyMiKhspbk4KrcXVFEheKqCRNpK2SRabzZZq7u2tV5TIst/xNfwxt+wdl2C065l6ficmTkz4/VjESR6MPjTsOx79x883HnUfPzk6bPd1t7zsyRKFWVjGolIXfgkYSII2VgHWrCLWDEifcHO/cuTQj//wVQSROF3vY7ZVJJVGCwDSrSh5nuNX5gTna1yOMCCLTVRKrqCksMh8QXJATMhHKw508SFDJeemWILo2j2UyuZwRVnikFBFFGAk0AClkRzSkR2mjvy0FArSWZDF3KMm2V9fofnKsd0EemaS+XdlW7/popxwk1eS3e8LlY8coFDBwoElQk+LJ56450iYDZ0eLcKcmfDvmO+O3jBhJnTLbqUNQMJXcBExJwAODfdFl5lhpHdPjg1F76RXVOuvrp/kxxs7XTj6PWdU4dvarj5vNUe9AblgdvAq0AbVWc0b/3Gi4imkoWaCpIkE28Q62lGlA6oYHkTpwmLCb0kKzYxMCSSJdOsbCeH14ZZwDJS5g01lOz/GRmRSbKWvoks/nWyrRXkXdok1cv30ywI41SzkF4bLVMBOoLijsIiUIxqsTaAUBWYXoFyogjV5iY3zRK87ZFvg7Nhz3vbe/PtqH18VK1jB71Er5CDPPQOHaOvaITGiFr71gfrk3Viv7A/2p/tL9ehVqPK2Ue1Y4/+AgL8INo=</latexit> ˆ g ˆ r`(✓) where ✓ ⇠ N(m, 2) ˆ h ˆ g · (✓ m)/ 2 h (1 ⇢)h + ⇢ˆ h +⇢2(h ˆ h)2/(2(h + )) m m ↵(ˆ g + m)/(h + ) 2 1/(N(h + )) 1 2 3 4 5 1 2 3 4 5 τ min q(✓) E ✓⇠q h `(✓) + R(✓) i H(q) <latexit sha1_base64="zxzGAg831qqlwA1I9DetXeeD+WE=">AAAEvXichVNdbxJBFL1bUVv8KNVHXyaSGoiVgKnRNxu1WqMm9YO2SbdpZpcBNgy7sDtUcLPEv+mzf8MHz1ygKSLtbIZ758y55947M3g9HSSmWv3lrFzLXb9xc3Utf+v2nbvrhY17B0k0iH1V9yMdxUeeTJQOQlU3gdHqqBcr2fW0OvQ6r+3+4ZmKkyAKv5lRT510ZSsMmoEvDaDTwu/NR8LtBuFp6pq2MjITbr8/kA0xHo9dpXVpApfFY/Hl3Hfd/CwqFS4XkcaqkfVLCGeOSC/A5SwT4zkEwdK0PS/dzWaJhZsEXdHP3FdBSx8LUJZlnxcCPT7JLlYhnkzkfanTvazUL58WitVKlYdYdGpTp0jTsR9tOJJcalBEPg2oS4pCMvA1SUrwHVONqtQDdkIpsBhewPuKMsojdgCWAkMC7eC3hdXxFA2xtpoJR/vIojFjRAraxHzLih7YNquCn8D+wfzBWGtphpSVbYUjWA+Ka6z4CbihNhhXRXanzFktV0fargw16QV3E6C+HiO2T/9c5w12YmAd3hG0y8wWNDxen+EEQtg6KrCnPFMQ3HEDVrJVrBJOFSX0Ylh7+rae/CX9zTpLsKO4suXcIWcN+G4al96p5WroDjFjYBOm5RnOaLtscm8G3HdYa3yCvnI/yQJ/UuEsosRdtrjDhLboIzOs5ha07EnYEy3/XxmvvvbvG190Dp5WatuVZ5+3izsffk7e/yo9oIfIXaPntEN7tI9b8Z33TuQMnVHuZU7ldC6cUFec6X/mPs2N3Pe/N14VtQ==</latexit> M. E. Khan, H. Rue, The Bayesian Learning Rule, J. Mach. Learn. Res., 2023 λ′  = λ − ρF−1 λ ∇λ( 𝔼 q [ℓ + R] − H(q)) λ weighted by τ δ 2 ∥θ∥2 Unbiased estimator of (expected) diagonal Hessian Prompt/Context P(Next Word) Noisy weights!
  15. Bayesian-style Training of Large Deep Nets 17 • State-of-the-art uncertainty

    estimation with a few lines of PyTorch code • Easily scales to GPT-2 sized models
  16. 19 1. T. Möllenhoff*, S. Swaroop*, F. Doshi-Velez, M. E.

    Khan, Federated ADMM from Bayesian Duality, arXiv:2506.13150, 2025 Practical Deep Learning Implementation with IVON Required changes to FederatedADMM are highlighted in red.
  17. 20 7% better accuracy than FedLap same cost/speed! BayesADMM works

    well for heterogeneous data! 1% improved accuracy, no need for expensive Laplace-approx.
  18. 22 Thank you for your attention! Questions? Code available at:

    https://github.com/team-approx-bayes/bayes-admm