Federated ADMM from Bayesian Duality

Federated ADMM from Bayesian Duality Thomas Möllenhoff ICSP 2025, Paris
Session: Exploring the synergy between stochastic optimization, dynamics, sampling, inference, and optimal transport

2 Siddharth Swaroop Emtiyaz Khan Emtiyaz Khan Finale Doshi-Velez Thomas
Möllenhoff Code available at: https://github.com/team-approx-bayes/bayes-admm 1. T. Möllenhoff*, S. Swaroop*, F. Doshi-Velez, M. E. Khan, Federated ADMM from Bayesian Duality, arXiv:2506.13150, 2025

Why Federated Learning? • Collaborative training of large open models
across institutions • Perhaps we don’t want or cannot share all data • Specialized models which share their knowledge with each other • We may have highly heterogeneous data… • … or train on entirely different tasks (biology, math, physics, …) • What knowledge should the local models exchange with each other? • How should shared information be weighted? 3 This talk: Improving Federated ADMM via Bayes

Why is Uncertainty Important in Distributed Learning? min θ K
∑ i=1 ℓi (θ) 4 • K clients; each minimize a loss function to fi nd their local solution • What information should be exchanged between the clients? • Example: ℓi θi ℓi (θ) = 1 2 Ni ∑ j=1 (θ − xi,j )2 1, 2, 3 4, 6, 5, 5, 10 2 6 4? A: 4.5 this quantity is the posterior precision (inverse covariance). Ni = ∇2ℓi

Existing Federated Learning Methods 7 (starting from ) θi ←
arg min θ ℓi (θ) ¯ θ ¯ θ ← 1 K K ∑ k=1 θk FedAvg [1] θi ← arg min θ ℓi (θ) + ρ 2 ∥θ − ¯ θ∥2 ¯ θ ← 1 K K ∑ k=1 θk FedProx [2] θi ← arg min θ ℓi (θ) + v⊤ i θ + ρ 2 ∥θ − ¯ θ∥2 vi ← vi + ρ(θi − ¯ θ) ¯ θ ← (1 − α) ( 1 K K ∑ k=1 θk) + α K ∑ k=1 vk Alternating Direction Method of Multipliers (ADMM) (e.g., [3]) 1. McMahan, Brendan, et al. Communication-ef fi cient learning of deep networks from decentralized data. AISTATS, 2017 2. Li, Tian, et al. "Federated optimization in heterogeneous networks." Proceedings of Machine Learning and Systems, 2020 3. Zhang, Xinwei, et al. "FedPD: A federated learning framework with adaptivity to non-iid data." IEEE Transactions on Signal Processing 69 (2021): 6055-6070. θi ← arg min θ ℓi (θ) + v⊤ i θ+ ρ 2 ∥θ − ¯ θ∥2 vi ← vi + ρ(θi − ¯ θ) ¯ θ ← arg min θ R(θ) + K ∑ i=1 ( ρ 2 ∥θ − θi ∥2 −v⊤ i θ)

Lagrangian View on ADMM 8 Proposal: Lift the problem to
probability distributions. Variational Bayesian formulation:

ADMM from Bayesian Duality 9

Recovers and Extending Existing ADMM 10 Novel “Adam”-like ADMM Novel
Newton-like ADMM Exponential Family: Different choices of Exponential Family recover existing algorithms and give new ones!

Convergence in a Single Round 11 ℓi (θ) = 1
2 Ni ∑ j=1 ∥Xi,j θ − yi,j ∥2 1. Wang & Banerjee, Bregman Alternating Direction Method of Multipliers, NIPS 2013. 1

Connections to Existing Algorithms μi ← arg min μ 𝔼
q [ℓi (θ)] + μ⊤ ̂ λi +ρKL(q| ¯ q) ̂ λi ← ̂ λi + ρ(λi − ¯ λ) ¯ λ ← 1 − α K K ∑ k=1 λk + α K ∑ k=0 ̂ λk BayesADMM: μi ← arg min μ 𝔼 q [ℓi (θ)] + μ⊤ ̂ λi + KL(q| ¯ q) ̂ λi ← ̂ λi + ρ(λi − ¯ λ) ¯ λ ← K ∑ k=0 ̂ λk Partitioned Variational Inference [2, 3]: 1. Khan and Rue, “The Bayesian Learning Rule”, JMLR, 2023. 2. Swaroop, S., Khan, M. E., and Doshi-Velez F. "Connecting Federated ADMM to Bayes.”, ICLR 2025 3. Bui, Thang D., et al. "Partitioned variational inference: A uni fi ed framework encompassing federated and continual learning." arXiv:1811.11206 (2018). α = 1/(1 + ρK) λ ← (1 − α)λ + α K ∑ k=0 ∇μ 𝔼 q [−ℓi ] μ=∇A(λ) Bayesian Learning Rule [1]: ̂ λi = ∇μ 𝔼 q [−ℓi ] μ=μk 12

Comparison to Partitioned Variational Inference ADMM uses , PVI uses
(with damping) or (without damping). ρ = 1 ρ = 1/K ρ = 1 1. Bui, Thang D., et al. "Partitioned variational inference: A uni fi ed framework encompassing federated and continual learning." arXiv:1811.11206 (2018). 2. Wang & Banerjee, Bregman Alternating Direction Method of Multipliers, NIPS 2013. 13

Scalable Implementation for Large Neural Nets • Main dif fi
culty: fi rst step is a variational learning/inference problem • We restrict the family to Gaussians with diagonal covariance • We use the recent IVON (Improved Variational Online Newton) optimizer [1] • IVON is a scalable drop-in replacement for Adam/RMSprop which does mean- fi eld VI 1. Y. Shen*, N. Daheim*, …, T. Möllenhoff. Variational Learning is Effective for Large Deep Networks, ICML 2024 (Spotlight). 14

15 RMSprop/Adam Variational Online Newton <latexit sha1_base64="6cqgdMpwEvsCUhlzFD20sqy/oaU=">AAACt3icbVFNj9MwEHXC11K+Chy5jKhAqdCWZFW+bitx4bhIdHdFXaqJ49TWOnHWdkBVlL/IgRv/BidtJbq7I1l6eu/N83icVkpaF8d/g/DW7Tt37x3cHzx4+Ojxk+HTZ6dW14bxGdNKm/MULVey5DMnneLnleFYpIqfpRefO/3sJzdW6vKbW1d8UeCqlLlk6Dy1HP6mAl2zauE1VTx3aIz+BT1HS0wVtkC5UhF1gjscUzroNXGDf9X+OAJvEHtSlBxSI/QYBLyBDsEuoMvqU/ejNtQhUFSVQIBol971Z1x5sYDxW4iovTR90I7vplsOR/Ek7guug2QLRmRbJ8vhH5ppVhe8dEyhtfMkrtyiQeMkU7wd0NryCtkFrvjcwxILbhdNv/cWXnkmg1wbf0oHPft/R4OFtesi9c4CnbBXtY68SZvXLv+4aGRZ1Y6XbHNRXitwGrpPhEwazpxae4DMSD8rMIEGmfNf3S0hufrk6+D0aJK8n7z7Oh0dT7frOCAvyEsSkYR8IMfkCzkhM8KCafA9YEEWfgqXYR6KjTUMtj3PyV6Fl/8ARwHSgg==</latexit> ˆ g ˆ
r`(✓) ˆ h ˆ g2 h (1 ⇢)h + ⇢ˆ h ✓ ✓ ↵(ˆ g + m)/( p h + ) <latexit sha1_base64="lMR1EEq84mMqe3ZqmU42lhTl4A0=">AAADlnicbVLbbtNAEN3YXEq4NKUvSLyMiKhspbk4KrcXVFEheKqCRNpK2SRabzZZq7u2tV5TIst/xNfwxt+wdl2C065l6ficmTkz4/VjESR6MPjTsOx79x883HnUfPzk6bPd1t7zsyRKFWVjGolIXfgkYSII2VgHWrCLWDEifcHO/cuTQj//wVQSROF3vY7ZVJJVGCwDSrSh5nuNX5gTna1yOMCCLTVRKrqCksMh8QXJATMhHKw508SFDJeemWILo2j2UyuZwRVnikFBFFGAk0AClkRzSkR2mjvy0FArSWZDF3KMm2V9fofnKsd0EemaS+XdlW7/popxwk1eS3e8LlY8coFDBwoElQk+LJ56450iYDZ0eLcKcmfDvmO+O3jBhJnTLbqUNQMJXcBExJwAODfdFl5lhpHdPjg1F76RXVOuvrp/kxxs7XTj6PWdU4dvarj5vNUe9AblgdvAq0AbVWc0b/3Gi4imkoWaCpIkE28Q62lGlA6oYHkTpwmLCb0kKzYxMCSSJdOsbCeH14ZZwDJS5g01lOz/GRmRSbKWvoks/nWyrRXkXdok1cv30ywI41SzkF4bLVMBOoLijsIiUIxqsTaAUBWYXoFyogjV5iY3zRK87ZFvg7Nhz3vbe/PtqH18VK1jB71Er5CDPPQOHaOvaITGiFr71gfrk3Viv7A/2p/tL9ehVqPK2Ue1Y4/+AgL8INo=</latexit> ˆ g ˆ r`(✓) where ✓ ⇠ N(m, 2) ˆ h ˆ g · (✓ m)/ 2 h (1 ⇢)h + ⇢ˆ h +⇢2(h ˆ h)2/(2(h + )) m m ↵(ˆ g + m)/(h + ) 2 1/(N(h + )) 1 2 3 4 5 1 2 3 4 5 τ min q(✓) E ✓⇠q h `(✓) + R(✓) i H(q) <latexit sha1_base64="zxzGAg831qqlwA1I9DetXeeD+WE=">AAAEvXichVNdbxJBFL1bUVv8KNVHXyaSGoiVgKnRNxu1WqMm9YO2SbdpZpcBNgy7sDtUcLPEv+mzf8MHz1ygKSLtbIZ758y55947M3g9HSSmWv3lrFzLXb9xc3Utf+v2nbvrhY17B0k0iH1V9yMdxUeeTJQOQlU3gdHqqBcr2fW0OvQ6r+3+4ZmKkyAKv5lRT510ZSsMmoEvDaDTwu/NR8LtBuFp6pq2MjITbr8/kA0xHo9dpXVpApfFY/Hl3Hfd/CwqFS4XkcaqkfVLCGeOSC/A5SwT4zkEwdK0PS/dzWaJhZsEXdHP3FdBSx8LUJZlnxcCPT7JLlYhnkzkfanTvazUL58WitVKlYdYdGpTp0jTsR9tOJJcalBEPg2oS4pCMvA1SUrwHVONqtQDdkIpsBhewPuKMsojdgCWAkMC7eC3hdXxFA2xtpoJR/vIojFjRAraxHzLih7YNquCn8D+wfzBWGtphpSVbYUjWA+Ka6z4CbihNhhXRXanzFktV0fargw16QV3E6C+HiO2T/9c5w12YmAd3hG0y8wWNDxen+EEQtg6KrCnPFMQ3HEDVrJVrBJOFSX0Ylh7+rae/CX9zTpLsKO4suXcIWcN+G4al96p5WroDjFjYBOm5RnOaLtscm8G3HdYa3yCvnI/yQJ/UuEsosRdtrjDhLboIzOs5ha07EnYEy3/XxmvvvbvG190Dp5WatuVZ5+3izsffk7e/yo9oIfIXaPntEN7tI9b8Z33TuQMnVHuZU7ldC6cUFec6X/mPs2N3Pe/N14VtQ==</latexit> M. E. Khan, H. Rue, The Bayesian Learning Rule, J. Mach. Learn. Res., 2023 λ′ = λ − ρF−1 λ ∇λ( 𝔼 q [ℓ + R] − H(q)) λ weighted by τ δ 2 ∥θ∥2 Unbiased estimator of (expected) diagonal Hessian Prompt/Context P(Next Word) Noisy weights!

16 pip install ivon-opt

Bayesian-style Training of Large Deep Nets 17 • State-of-the-art uncertainty
estimation with a few lines of PyTorch code • Easily scales to GPT-2 sized models

IVON wins the NeurIPS Challenge on Approximate Inference in Deep
Learning 18

19 1. T. Möllenhoff*, S. Swaroop*, F. Doshi-Velez, M. E.
Khan, Federated ADMM from Bayesian Duality, arXiv:2506.13150, 2025 Practical Deep Learning Implementation with IVON Required changes to FederatedADMM are highlighted in red.

20 7% better accuracy than FedLap same cost/speed! BayesADMM works
well for heterogeneous data! 1% improved accuracy, no need for expensive Laplace-approx.

Federated Learning with 100 Clients 21 Improvements in calibration and
accuracy in 100 client setup.

22 Thank you for your attention! Questions? Code available at:
https://github.com/team-approx-bayes/bayes-admm

Federated ADMM from Bayesian Duality

Federated ADMM from Bayesian Duality

Jia-Jie Zhu

More Decks by Jia-Jie Zhu

Featured

Transcript

Federated ADMM from Bayesian Duality Thomas Möllenhoff ICSP 2025, Paris

2 Siddharth Swaroop Emtiyaz Khan Emtiyaz Khan Finale Doshi-Velez Thomas

Why Federated Learning? • Collaborative training of large open models

Why is Uncertainty Important in Distributed Learning? min θ K

5

6

Existing Federated Learning Methods 7 (starting from ) θi ←

Lagrangian View on ADMM 8 Proposal: Lift the problem to

ADMM from Bayesian Duality 9

Recovers and Extending Existing ADMM 10 Novel “Adam”-like ADMM Novel

Convergence in a Single Round 11 ℓi (θ) = 1

Connections to Existing Algorithms μi ← arg min μ 𝔼

Comparison to Partitioned Variational Inference ADMM uses , PVI uses

Scalable Implementation for Large Neural Nets • Main dif fi

16 pip install ivon-opt

Bayesian-style Training of Large Deep Nets 17 • State-of-the-art uncertainty

IVON wins the NeurIPS Challenge on Approximate Inference in Deep

19 1. T. Möllenhoff, S. Swaroop, F. Doshi-Velez, M. E.

20 7% better accuracy than FedLap same cost/speed! BayesADMM works

Federated Learning with 100 Clients 21 Improvements in calibration and

22 Thank you for your attention! Questions? Code available at: