Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scalable Decentralized Learning with Teleportat...

Yuki Takezawa
February 15, 2025
27

Scalable Decentralized Learning with Teleportation (ICLRΒ 2025)

Yuki Takezawa

February 15, 2025
Tweet

Transcript

  1. 1 KYOTO UNIVERSITY / OIST KYOTO UNIVERSITY / OIST Scalable

    Decentralized Learning with Teleportation Yuki Takezawa1,2, Sebastian Stich3 1Kyoto Univ., 2OIST, 3CISPA
  2. 2 KYOTO UNIVERSITY / OIST Background Decentralized Learning It is

    difficult to aggregate all training data due to the privacy. : server, which has its own (private) training datasets. Decentralized Learning Usual Training
  3. 3 KYOTO UNIVERSITY / OIST Background Decentralized Learning It is

    difficult to aggregate all training data due to the privacy. : server, which has its own (private) training datasets. Decentralized Learning Usual Training Only neural network parameters are exchanged between noes.
  4. 4 KYOTO UNIVERSITY / OIST Background Decentralized Learning 𝑓1 𝑓2

    𝑓3 𝑓4 𝑓5 𝑛: the number of nodes 𝑓𝑖 : loss function of node 𝑖. nodes can exchange parameters. π‘Šπ‘–π‘— : the weight of edge 𝑖, 𝑗 . Goal: min 𝒙 1 𝑛 ෍ 𝑖=1 𝑛 𝑓𝑖 𝒙
  5. 5 KYOTO UNIVERSITY / OIST Background Decentralized SGD 𝒙 1

    (π‘Ÿ) 𝒙 2 (π‘Ÿ) 𝒙 3 (π‘Ÿ) 𝒙 4 (π‘Ÿ) 𝒙 5 (π‘Ÿ)
  6. 6 KYOTO UNIVERSITY / OIST Background Decentralized SGD Each node

    update its parameter by SGD 𝒙 1 (π‘Ÿ+ 1 2 ) = 𝒙 1 (π‘Ÿ) βˆ’ πœ‚βˆ‡πΉ1 𝒙 1 (π‘Ÿ); πœ‰
  7. 7 KYOTO UNIVERSITY / OIST Background Decentralized SGD Each node

    exchange its parameter with neighbors. 𝒙 1 (π‘Ÿ+1 2 ) , 𝒙 5 (π‘Ÿ+1 2 ) 𝒙 1 (π‘Ÿ+1 2 ) , 𝒙 2 (π‘Ÿ+1 2 )
  8. 8 KYOTO UNIVERSITY / OIST Background Decentralized SGD Each node

    compute the weighted average with neighbors. 𝒙 1 (π‘Ÿ+1) = ෍ 𝑖=1 𝑛 π‘Š1𝑗 𝒙 𝑗 (π‘Ÿ+ 1 2 )
  9. 10 KYOTO UNIVERSITY / OIST Background Challenges in Decentralized Learning

    The training becomes difficult when the number of nodes are substantial. ・・・
  10. 11 KYOTO UNIVERSITY / OIST Background Challenges in Decentralized Learning

    When the number of nodes are extremely large, the parameters of nodes drift away. ・・・ The parameter of this node might be very different from the parameter of the white node.
  11. 12 KYOTO UNIVERSITY / OIST Background Challenge in Decentralized SGD

    Update Rule of Node 𝑖: 𝒙 𝑖 (π‘Ÿ+1) = ෍ 𝑗=1 𝑛 π‘Šπ‘–π‘— 𝒙 𝑗 (π‘Ÿ) βˆ’ πœ‚βˆ‡πΉ 𝑗 𝒙 𝑗 (π‘Ÿ); πœ‰ 𝑖 (π‘Ÿ) Update Rule of Average Parameter ΰ΄₯ 𝒙 ≔ 1 𝑛 Οƒ 𝑖=1 𝑛 π’™π’Š : ΰ΄₯ 𝒙(π‘Ÿ+1) = ΰ΄₯ 𝒙(π‘Ÿ) βˆ’ πœ‚ 𝑛 ෍ 𝑖=1 𝑛 βˆ‡πΉπ‘– 𝒙 𝑖 (π‘Ÿ); πœ‰ 𝑖 (π‘Ÿ)
  12. 13 KYOTO UNIVERSITY / OIST Background Challenge in Decentralized SGD

    Update Rule of Node 𝑖: 𝒙 𝑖 (π‘Ÿ+1) = ෍ 𝑗=1 𝑛 π‘Šπ‘–π‘— 𝒙 𝑗 (π‘Ÿ) βˆ’ πœ‚βˆ‡πΉ 𝑗 𝒙 𝑗 (π‘Ÿ); πœ‰ 𝑖 (π‘Ÿ) Update Rule of Average Parameter ΰ΄₯ 𝒙 ≔ 1 𝑛 Οƒ 𝑖=1 𝑛 π’™π’Š : ΰ΄₯ 𝒙(π‘Ÿ+1) = ΰ΄₯ 𝒙(π‘Ÿ) βˆ’ πœ‚ 𝑛 ෍ 𝑖=1 𝑛 βˆ‡πΉπ‘– 𝒙 𝑖 (π‘Ÿ); πœ‰ 𝑖 (π‘Ÿ) β—Ό If 𝒙𝑖 = ΰ΄₯ 𝒙, it is SGD. (i.e., every node has same parameters) β—Ό If 𝒙𝑖 is far from ΰ΄₯ 𝒙, the training becomes difficult.
  13. 14 KYOTO UNIVERSITY / OIST Background Challenges in Decentralized Learning

    When the number of nodes are extremely large, the parameters of nodes drift away. ・・・ The parameter of this node might be very different from the parameter of the white node.
  14. 15 KYOTO UNIVERSITY / OIST Background Challenges in Decentralized Learning

    When the number of nodes are extremely large, the parameters of nodes drift away. ・・・ The parameter of this node might be very different from the parameter of the white node. In theory, it require more number of iterations as the number of node incrases.
  15. 16 KYOTO UNIVERSITY / OIST Background Assumptions Assumption 1 (Smoothness)

    βˆ‡π‘“π‘– 𝒙 βˆ’ βˆ‡π‘“π‘– π’š ≀ 𝐿 𝒙 βˆ’ π’š for any 𝒙, π’š ∈ 𝑅𝑑. Assumption 3 (Data Heterogeneity) 1 𝑛 σ𝑖=1 𝑛 βˆ‡π‘“π‘– π‘₯ βˆ’ βˆ‡π‘“ π‘₯ 2 ≀ 𝜁2 for any 𝒙 ∈ 𝑅𝑑 Assumption 2 (Stochastic Gradient Noise) 𝐸 βˆ‡π‘“π‘– 𝒙 βˆ’ βˆ‡π‘“π‘– 𝒙; πœ‰ 2 ≀ 𝜎2 for any 𝒙 ∈ 𝑅𝑑 Assumption 4 (Spectral Gap) The mixing matrix π‘Š ∈ 0,1 𝑛×𝑛 has a spectral gap 𝑝𝑛 ∈ (0,1]
  16. 17 KYOTO UNIVERSITY / OIST Background Assumptions Assumption 1 (Smoothness)

    βˆ‡π‘“π‘– 𝒙 βˆ’ βˆ‡π‘“π‘– π’š ≀ 𝐿 𝒙 βˆ’ π’š for any 𝒙, π’š ∈ 𝑅𝑑. Assumption 3 (Data Heterogeneity) 1 𝑛 σ𝑖=1 𝑛 βˆ‡π‘“π‘– π‘₯ βˆ’ βˆ‡π‘“ π‘₯ 2 ≀ 𝜁2 for any 𝒙 ∈ 𝑅𝑑 Assumption 2 (Stochastic Gradient Noise) 𝐸 βˆ‡π‘“π‘– 𝒙 βˆ’ βˆ‡π‘“π‘– 𝒙; πœ‰ 2 ≀ 𝜎2 for any 𝒙 ∈ 𝑅𝑑 Assumption 4 (Spectral Gap) The mixing matrix π‘Š ∈ 0,1 𝑛×𝑛 has a spectral gap 𝑝𝑛 ∈ (0,1] 𝑝𝑛 represents how dense the graph is. 𝑝𝑛 approaches zero as 𝑛 increases.
  17. 18 KYOTO UNIVERSITY / OIST Background Decentralized SGD Convergence Rate

    of Decentralized SGD Decentralized SGD converges with the following rate: 1 𝑇 ෍ 𝑑=0 π‘‡βˆ’1 E βˆ‡π‘“ ΰ΄₯ 𝒙 𝑑 2 ≀ 𝑂 πΏπ‘Ÿ0 𝜎2 𝑛 𝑇 + 𝐿2π‘Ÿ0 2 𝑝𝑛 𝜎2 + 𝜁2 1 βˆ’ 𝑝𝑛 𝑇2𝑝𝑛 2 1 3 + πΏπ‘Ÿ0 𝑇𝑝𝑛 , where π‘Ÿ0 ≔ 𝑓 π‘₯ 0 βˆ’ 𝑓⋆. 𝑛: #nodes 𝑇: #iteration 𝑝𝑛 : spectral gap
  18. 19 KYOTO UNIVERSITY / OIST Background Decentralized SGD Convergence Rate

    of Decentralized SGD Decentralized SGD converges with the following rate: 1 𝑇 ෍ 𝑑=0 π‘‡βˆ’1 E βˆ‡π‘“ ΰ΄₯ 𝒙 𝑑 2 ≀ 𝑂 1 𝑛𝑇 + 1 𝑇2𝑝𝑛 2 1 3 + 1 𝑇𝑝𝑛 . β—Ό 𝑝𝑛 reaches zero as 𝑛 increases. β—Ό It requires more iterations 𝑇 as 𝑛 is substantially large. 𝑛: #nodes 𝑇: #iteration 𝑝𝑛 : spectral gap
  19. 20 KYOTO UNIVERSITY / OIST Convergence Rate of Decentralized SGD

    Decentralized SGD converges with the following rate: 1 𝑇 ෍ 𝑑=0 π‘‡βˆ’1 E βˆ‡π‘“ ΰ΄₯ 𝒙 𝑑 2 ≀ 𝑂 1 𝑛𝑇 + 1 𝑇2𝑝𝑛 2 1 3 + 1 𝑇𝑝𝑛 . Background Decentralized SGD β—Ό 𝑝𝑛 reaches zero as 𝑛 increases. β—Ό It requires more iterations 𝑇 as 𝑛 is substantially large. 𝑛: #nodes 𝑇: #iteration 𝑝𝑛 : spectral gap
  20. 21 KYOTO UNIVERSITY / OIST Background Decentralized SGD with Ring

    Convergence Rate of Decentralized SGD with Ring Decentralized SGD with ring converges with the following rate: 1 𝑇 ෍ 𝑑=0 π‘‡βˆ’1 E βˆ‡π‘“ ΰ΄₯ 𝒙 𝑑 2 ≀ 𝑂 1 𝑛𝑇 + 𝑛4 𝑇2 1 3 + 𝑛2 𝑇 . 𝑛: #nodes 𝑇: #iteration β—Ό It requires more iterations 𝑇 as 𝑛 is substantially large.
  21. 22 KYOTO UNIVERSITY / OIST Background Challenges in Decentralized Learning

    When the number of nodes are extremely large, the parameters of nodes drift away. ・・・ The parameter of this node might be very different from the parameter of the white node.
  22. 23 KYOTO UNIVERSITY / OIST Background Topology Design for Scalable

    Decentralized Learning If the graph is dense, we can prevent the parameters from drifting away.
  23. 24 KYOTO UNIVERSITY / OIST Background Decentralized SGD with SOTA

    Graph Convergence Rate of Decentralized SGD with Base-2 Graph Decentralized SGD with Base-2 Graph achieves the following: 1 𝑇 ෍ 𝑑=0 π‘‡βˆ’1 E βˆ‡π‘“ ΰ΄₯ 𝒙 𝑑 2 ≀ 𝑂 1 𝑛𝑇 + log2 2 𝑛 𝑇2 1/3 + log2 𝑛 𝑇 . β—Ό Convergence rate degrades as 𝑛 is substantially large. 𝑛: #nodes 𝑇: #iteration
  24. 25 KYOTO UNIVERSITY / OIST Background Decentralized SGD with SOTA

    Graph Convergence Rate of Decentralized SGD with Base-2 Graph Decentralized SGD with Base-2 Graph achieves the following: 1 𝑇 ෍ 𝑑=0 π‘‡βˆ’1 E βˆ‡π‘“ ΰ΄₯ 𝒙 𝑑 2 ≀ 𝑂 1 𝑛𝑇 + log2 2 𝑛 𝑇2 1/3 + log2 𝑛 𝑇 . β—Ό Convergence rate degrades as 𝑛 is substantially large. 𝑛: #nodes 𝑇: #iteration
  25. 26 KYOTO UNIVERSITY / OIST Background Research Question Can we

    develop a method that can completely alleviate the degradation caused by a large number of nodes? Convergence Rate of Proposed Method (Informal) We can achieve: 1 𝑇 ෍ 𝑑=0 π‘‡βˆ’1 E βˆ‡π‘“ ΰ΄₯ 𝒙 𝑑 2 ≀ 𝑂 1 𝑛𝑇 + 1 𝑇 4 7 + 1 𝑇 3 5 + 1 𝑇 .
  26. 28 KYOTO UNIVERSITY / OIST Proposed Method Key Idea Problem:

    The parameters of nodes drift away since the number of nodes is large. Idea: Even if the total number of nodes is large, we can simulate Decentralized SGD with small number of nodes ・・・
  27. 31 KYOTO UNIVERSITY / OIST Proposed Method Example with 𝑛

    = 9: Run Decentralized SGD no active nodes. β—Ό 𝒙 𝑖 (π‘Ÿ+1 2 ) = 𝒙 𝑖 (π‘Ÿ) βˆ’ πœ‚βˆ‡πΉπ‘– 𝒙 𝑖 π‘Ÿ ; πœ‰ 𝑖 π‘Ÿ β—Ό Exchange parameters 𝒙 𝑖 (π‘Ÿ+1 2 ) with neighbors β—Ό 𝒙 𝑖 (π‘Ÿ+1) = Οƒ 𝑗=1 𝑛 π‘Šπ‘–π‘— 𝒙 𝑗 (π‘Ÿ+1 2 )
  28. 36 KYOTO UNIVERSITY / OIST Idea: Even if the total

    number of nodes is large, we can simulate Decentralized SGD with small number of nodes
  29. 37 KYOTO UNIVERSITY / OIST Proposed Method Convergence Analysis Assumption

    The mixing matrix π‘Šπ‘˜ ∈ 0,1 π‘˜Γ—π‘˜ has a spectral gap π‘π‘˜ . Convergence Rate of Teleportation Let π‘˜ be the number of active nodes, we can achieve: 𝑂 πΏπ‘Ÿ0 𝜎2 + 1 βˆ’ 𝑛 βˆ’ 1 π‘˜ βˆ’ 1 𝜁2 π‘˜π‘‡ + 𝐿2π‘Ÿ0 2 𝜎2 + 𝜁2 𝑇2π‘π‘˜ 1 3 + πΏπ‘Ÿ0 π‘π‘˜ 𝑇 , where π‘Ÿ0 ≔ 𝑓 π‘₯ 0 βˆ’ 𝑓⋆. π‘π‘˜ depends on the small graph consisting of only active nodes. 𝑛: #nodes 𝑇: #iteration
  30. 38 KYOTO UNIVERSITY / OIST Proposed Method By carefully tuning

    the number of active nodes π‘˜, we get Convergence Rate of Proposed Method If we use ring as the topology and set π‘˜ as follows: π‘˜ = max 1, min 𝑇 𝜎2 + 𝜁2 πΏπ‘Ÿ0 1 5 , 𝑇 𝜎2 + 𝜁2 πΏπ‘Ÿ0 1 7 , 𝑛 we can achieve: 𝑂 πΏπ‘Ÿ0 𝜎2 𝑛𝑇 + 𝐿2π‘Ÿ0 2 𝜎2 + 𝜁2 3 4 𝑇 4 7 + 𝐿2π‘Ÿ0 2 𝜎2 + 𝜁2 2 3 𝑇 3 5 + πΏπ‘Ÿ0 𝑇 , where π‘Ÿ0 ≔ 𝑓 π‘₯ 0 βˆ’ 𝑓⋆. 𝑛: #nodes 𝑇: #iteration
  31. 39 KYOTO UNIVERSITY / OIST Proposed Method By carefully tuning

    the number of active nodes π‘˜, we get Convergence Rate of Proposed Method If we use ring as the topology and set π‘˜ as follows: π‘˜ = max 1, min 𝑇 𝜎2 + 𝜁2 πΏπ‘Ÿ0 1 5 , 𝑇 𝜎2 + 𝜁2 πΏπ‘Ÿ0 1 7 , 𝑛 we can achieve: 𝑂 πΏπ‘Ÿ0 𝜎2 𝑛𝑇 + 𝐿2π‘Ÿ0 2 𝜎2 + 𝜁2 3 4 𝑇 4 7 + 𝐿2π‘Ÿ0 2 𝜎2 + 𝜁2 2 3 𝑇 3 5 + πΏπ‘Ÿ0 𝑇 , where π‘Ÿ0 ≔ 𝑓 π‘₯ 0 βˆ’ 𝑓⋆. The convergence rate consistently improve as 𝑛 increases!
  32. 40 KYOTO UNIVERSITY / OIST Proposed Method Research Question 2

    Can we develop an efficient hyperparameter-tuning method? We must tune the number of active nodes π‘˜ from 1 to 𝑛. Grid search requires 𝑛𝑇 iterations Yes! We can obtain the proper π‘˜ within 2𝑇 iterations.
  33. 41 KYOTO UNIVERSITY / OIST Proposed Method Efficient Hyperparameter-tuning Method

    β—Ό Grid search on 𝐾 can find a proper π‘˜. β—Ό Moreover, this grid search can run in parallel. Key Lemma Let 𝐾 define 𝐾 ≔ 1, 2, 4, 8, … , 2ہlog2(𝑛+1 ] ) βˆ’1 . For any π‘˜β‹† < 𝑛, there exists π‘˜ ∈ 𝐾 such that π‘˜β‹† 4 < π‘˜ ≀ π‘˜β‹†. Furthermore it holds that Οƒπ‘˜βˆˆπΎ π‘˜ ≀ 𝑛.
  34. 44 KYOTO UNIVERSITY / OIST Proposed Method Efficient Hyperparameter-tuning Method

    Theorem (Informal) This hyperparameter-tuning methods achieve exactly the same convergence rate as that with the optimal π‘˜β‹†. 1. Run Teleportation with βˆ€π‘˜ ∈ 𝐾 in parallel. 2. Run Teleportation with 𝑛. (full activation.) 3. Chose the best parameters. We can obtain the proper π‘˜ within 2𝑇 iterations.
  35. 47 KYOTO UNIVERSITY / OIST Experiment Synthetic Function 𝜎2: stochastic

    grad. noise 𝜁2: heterogeneity of loss functions
  36. 48 KYOTO UNIVERSITY / OIST Experiment Synthetic Function 𝜎2: stochastic

    grad. noise 𝜁2: heterogeneity of loss functions Teleportation converges faster than Decentralized SGD
  37. 51 KYOTO UNIVERSITY / OIST Conclusion We proposed Teleportation: β—Ό

    The convergence rate of Teleportation does not degrade as 𝑛 increases. β—Ό We also propose an efficient hyperparameter-tuning method. β—Ό We numerically demonstrate the effectiveness. paper