Scalable Decentralized Learning with Teleportation (ICLR 2025)

1 KYOTO UNIVERSITY / OIST KYOTO UNIVERSITY / OIST Scalable
Decentralized Learning with Teleportation Yuki Takezawa1,2, Sebastian Stich3 1Kyoto Univ., 2OIST, 3CISPA

2 KYOTO UNIVERSITY / OIST Background Decentralized Learning It is
difficult to aggregate all training data due to the privacy. : server, which has its own (private) training datasets. Decentralized Learning Usual Training

3 KYOTO UNIVERSITY / OIST Background Decentralized Learning It is
difficult to aggregate all training data due to the privacy. : server, which has its own (private) training datasets. Decentralized Learning Usual Training Only neural network parameters are exchanged between noes.

4 KYOTO UNIVERSITY / OIST Background Decentralized Learning 𝑓1 𝑓2
𝑓3 𝑓4 𝑓5 𝑛: the number of nodes 𝑓𝑖 : loss function of node 𝑖. nodes can exchange parameters. 𝑊𝑖𝑗 : the weight of edge 𝑖, 𝑗 . Goal: min 𝒙 1 𝑛 ෍ 𝑖=1 𝑛 𝑓𝑖 𝒙

5 KYOTO UNIVERSITY / OIST Background Decentralized SGD 𝒙 1
(𝑟) 𝒙 2 (𝑟) 𝒙 3 (𝑟) 𝒙 4 (𝑟) 𝒙 5 (𝑟)

6 KYOTO UNIVERSITY / OIST Background Decentralized SGD Each node
update its parameter by SGD 𝒙 1 (𝑟+ 1 2 ) = 𝒙 1 (𝑟) − 𝜂∇𝐹1 𝒙 1 (𝑟); 𝜉

exchange its parameter with neighbors. 𝒙 1 (𝑟+1 2 ) , 𝒙 5 (𝑟+1 2 ) 𝒙 1 (𝑟+1 2 ) , 𝒙 2 (𝑟+1 2 )

compute the weighted average with neighbors. 𝒙 1 (𝑟+1) = ෍ 𝑖=1 𝑛 𝑊1𝑗 𝒙 𝑗 (𝑟+ 1 2 )

9 KYOTO UNIVERSITY / OIST Background Decentralized Learning

10 KYOTO UNIVERSITY / OIST Background Challenges in Decentralized Learning
The training becomes difficult when the number of nodes are substantial. ・・・

When the number of nodes are extremely large, the parameters of nodes drift away. ・・・ The parameter of this node might be very different from the parameter of the white node.

12 KYOTO UNIVERSITY / OIST Background Challenge in Decentralized SGD
Update Rule of Node 𝑖: 𝒙 𝑖 (𝑟+1) = ෍ 𝑗=1 𝑛 𝑊𝑖𝑗 𝒙 𝑗 (𝑟) − 𝜂∇𝐹 𝑗 𝒙 𝑗 (𝑟); 𝜉 𝑖 (𝑟) Update Rule of Average Parameter ഥ 𝒙 ≔ 1 𝑛 σ 𝑖=1 𝑛 𝒙𝒊 : ഥ 𝒙(𝑟+1) = ഥ 𝒙(𝑟) − 𝜂 𝑛 ෍ 𝑖=1 𝑛 ∇𝐹𝑖 𝒙 𝑖 (𝑟); 𝜉 𝑖 (𝑟)

13 KYOTO UNIVERSITY / OIST Background Challenge in Decentralized SGD
Update Rule of Node 𝑖: 𝒙 𝑖 (𝑟+1) = ෍ 𝑗=1 𝑛 𝑊𝑖𝑗 𝒙 𝑗 (𝑟) − 𝜂∇𝐹 𝑗 𝒙 𝑗 (𝑟); 𝜉 𝑖 (𝑟) Update Rule of Average Parameter ഥ 𝒙 ≔ 1 𝑛 σ 𝑖=1 𝑛 𝒙𝒊 : ഥ 𝒙(𝑟+1) = ഥ 𝒙(𝑟) − 𝜂 𝑛 ෍ 𝑖=1 𝑛 ∇𝐹𝑖 𝒙 𝑖 (𝑟); 𝜉 𝑖 (𝑟) ◼ If 𝒙𝑖 = ഥ 𝒙, it is SGD. (i.e., every node has same parameters) ◼ If 𝒙𝑖 is far from ഥ 𝒙, the training becomes difficult.

When the number of nodes are extremely large, the parameters of nodes drift away. ・・・ The parameter of this node might be very different from the parameter of the white node. In theory, it require more number of iterations as the number of node incrases.

16 KYOTO UNIVERSITY / OIST Background Assumptions Assumption 1 (Smoothness)
∇𝑓𝑖 𝒙 − ∇𝑓𝑖 𝒚 ≤ 𝐿 𝒙 − 𝒚 for any 𝒙, 𝒚 ∈ 𝑅𝑑. Assumption 3 (Data Heterogeneity) 1 𝑛 σ𝑖=1 𝑛 ∇𝑓𝑖 𝑥 − ∇𝑓 𝑥 2 ≤ 𝜁2 for any 𝒙 ∈ 𝑅𝑑 Assumption 2 (Stochastic Gradient Noise) 𝐸 ∇𝑓𝑖 𝒙 − ∇𝑓𝑖 𝒙; 𝜉 2 ≤ 𝜎2 for any 𝒙 ∈ 𝑅𝑑 Assumption 4 (Spectral Gap) The mixing matrix 𝑊 ∈ 0,1 𝑛×𝑛 has a spectral gap 𝑝𝑛 ∈ (0,1]

17 KYOTO UNIVERSITY / OIST Background Assumptions Assumption 1 (Smoothness)
∇𝑓𝑖 𝒙 − ∇𝑓𝑖 𝒚 ≤ 𝐿 𝒙 − 𝒚 for any 𝒙, 𝒚 ∈ 𝑅𝑑. Assumption 3 (Data Heterogeneity) 1 𝑛 σ𝑖=1 𝑛 ∇𝑓𝑖 𝑥 − ∇𝑓 𝑥 2 ≤ 𝜁2 for any 𝒙 ∈ 𝑅𝑑 Assumption 2 (Stochastic Gradient Noise) 𝐸 ∇𝑓𝑖 𝒙 − ∇𝑓𝑖 𝒙; 𝜉 2 ≤ 𝜎2 for any 𝒙 ∈ 𝑅𝑑 Assumption 4 (Spectral Gap) The mixing matrix 𝑊 ∈ 0,1 𝑛×𝑛 has a spectral gap 𝑝𝑛 ∈ (0,1] 𝑝𝑛 represents how dense the graph is. 𝑝𝑛 approaches zero as 𝑛 increases.

18 KYOTO UNIVERSITY / OIST Background Decentralized SGD Convergence Rate
of Decentralized SGD Decentralized SGD converges with the following rate: 1 𝑇 ෍ 𝑡=0 𝑇−1 E ∇𝑓 ഥ 𝒙 𝑡 2 ≤ 𝑂 𝐿𝑟0 𝜎2 𝑛 𝑇 + 𝐿2𝑟0 2 𝑝𝑛 𝜎2 + 𝜁2 1 − 𝑝𝑛 𝑇2𝑝𝑛 2 1 3 + 𝐿𝑟0 𝑇𝑝𝑛 , where 𝑟0 ≔ 𝑓 𝑥 0 − 𝑓⋆. 𝑛: #nodes 𝑇: #iteration 𝑝𝑛 : spectral gap

19 KYOTO UNIVERSITY / OIST Background Decentralized SGD Convergence Rate
of Decentralized SGD Decentralized SGD converges with the following rate: 1 𝑇 ෍ 𝑡=0 𝑇−1 E ∇𝑓 ഥ 𝒙 𝑡 2 ≤ 𝑂 1 𝑛𝑇 + 1 𝑇2𝑝𝑛 2 1 3 + 1 𝑇𝑝𝑛 . ◼ 𝑝𝑛 reaches zero as 𝑛 increases. ◼ It requires more iterations 𝑇 as 𝑛 is substantially large. 𝑛: #nodes 𝑇: #iteration 𝑝𝑛 : spectral gap

20 KYOTO UNIVERSITY / OIST Convergence Rate of Decentralized SGD
Decentralized SGD converges with the following rate: 1 𝑇 ෍ 𝑡=0 𝑇−1 E ∇𝑓 ഥ 𝒙 𝑡 2 ≤ 𝑂 1 𝑛𝑇 + 1 𝑇2𝑝𝑛 2 1 3 + 1 𝑇𝑝𝑛 . Background Decentralized SGD ◼ 𝑝𝑛 reaches zero as 𝑛 increases. ◼ It requires more iterations 𝑇 as 𝑛 is substantially large. 𝑛: #nodes 𝑇: #iteration 𝑝𝑛 : spectral gap

21 KYOTO UNIVERSITY / OIST Background Decentralized SGD with Ring
Convergence Rate of Decentralized SGD with Ring Decentralized SGD with ring converges with the following rate: 1 𝑇 ෍ 𝑡=0 𝑇−1 E ∇𝑓 ഥ 𝒙 𝑡 2 ≤ 𝑂 1 𝑛𝑇 + 𝑛4 𝑇2 1 3 + 𝑛2 𝑇 . 𝑛: #nodes 𝑇: #iteration ◼ It requires more iterations 𝑇 as 𝑛 is substantially large.

23 KYOTO UNIVERSITY / OIST Background Topology Design for Scalable
Decentralized Learning If the graph is dense, we can prevent the parameters from drifting away.

24 KYOTO UNIVERSITY / OIST Background Decentralized SGD with SOTA
Graph Convergence Rate of Decentralized SGD with Base-2 Graph Decentralized SGD with Base-2 Graph achieves the following: 1 𝑇 ෍ 𝑡=0 𝑇−1 E ∇𝑓 ഥ 𝒙 𝑡 2 ≤ 𝑂 1 𝑛𝑇 + log2 2 𝑛 𝑇2 1/3 + log2 𝑛 𝑇 . ◼ Convergence rate degrades as 𝑛 is substantially large. 𝑛: #nodes 𝑇: #iteration

25 KYOTO UNIVERSITY / OIST Background Decentralized SGD with SOTA
Graph Convergence Rate of Decentralized SGD with Base-2 Graph Decentralized SGD with Base-2 Graph achieves the following: 1 𝑇 ෍ 𝑡=0 𝑇−1 E ∇𝑓 ഥ 𝒙 𝑡 2 ≤ 𝑂 1 𝑛𝑇 + log2 2 𝑛 𝑇2 1/3 + log2 𝑛 𝑇 . ◼ Convergence rate degrades as 𝑛 is substantially large. 𝑛: #nodes 𝑇: #iteration

26 KYOTO UNIVERSITY / OIST Background Research Question Can we
develop a method that can completely alleviate the degradation caused by a large number of nodes? Convergence Rate of Proposed Method (Informal) We can achieve: 1 𝑇 ෍ 𝑡=0 𝑇−1 E ∇𝑓 ഥ 𝒙 𝑡 2 ≤ 𝑂 1 𝑛𝑇 + 1 𝑇 4 7 + 1 𝑇 3 5 + 1 𝑇 .

27 KYOTO UNIVERSITY / OIST Proposed Method

28 KYOTO UNIVERSITY / OIST Proposed Method Key Idea Problem:
The parameters of nodes drift away since the number of nodes is large. Idea: Even if the total number of nodes is large, we can simulate Decentralized SGD with small number of nodes ・・・

29 KYOTO UNIVERSITY / OIST

31 KYOTO UNIVERSITY / OIST Proposed Method Example with 𝑛
= 9: Run Decentralized SGD no active nodes. ◼ 𝒙 𝑖 (𝑟+1 2 ) = 𝒙 𝑖 (𝑟) − 𝜂∇𝐹𝑖 𝒙 𝑖 𝑟 ; 𝜉 𝑖 𝑟 ◼ Exchange parameters 𝒙 𝑖 (𝑟+1 2 ) with neighbors ◼ 𝒙 𝑖 (𝑟+1) = σ 𝑗=1 𝑛 𝑊𝑖𝑗 𝒙 𝑗 (𝑟+1 2 )

36 KYOTO UNIVERSITY / OIST Idea: Even if the total
number of nodes is large, we can simulate Decentralized SGD with small number of nodes

37 KYOTO UNIVERSITY / OIST Proposed Method Convergence Analysis Assumption
The mixing matrix 𝑊𝑘 ∈ 0,1 𝑘×𝑘 has a spectral gap 𝑝𝑘 . Convergence Rate of Teleportation Let 𝑘 be the number of active nodes, we can achieve: 𝑂 𝐿𝑟0 𝜎2 + 1 − 𝑛 − 1 𝑘 − 1 𝜁2 𝑘𝑇 + 𝐿2𝑟0 2 𝜎2 + 𝜁2 𝑇2𝑝𝑘 1 3 + 𝐿𝑟0 𝑝𝑘 𝑇 , where 𝑟0 ≔ 𝑓 𝑥 0 − 𝑓⋆. 𝑝𝑘 depends on the small graph consisting of only active nodes. 𝑛: #nodes 𝑇: #iteration

38 KYOTO UNIVERSITY / OIST Proposed Method By carefully tuning
the number of active nodes 𝑘, we get Convergence Rate of Proposed Method If we use ring as the topology and set 𝑘 as follows: 𝑘 = max 1, min 𝑇 𝜎2 + 𝜁2 𝐿𝑟0 1 5 , 𝑇 𝜎2 + 𝜁2 𝐿𝑟0 1 7 , 𝑛 we can achieve: 𝑂 𝐿𝑟0 𝜎2 𝑛𝑇 + 𝐿2𝑟0 2 𝜎2 + 𝜁2 3 4 𝑇 4 7 + 𝐿2𝑟0 2 𝜎2 + 𝜁2 2 3 𝑇 3 5 + 𝐿𝑟0 𝑇 , where 𝑟0 ≔ 𝑓 𝑥 0 − 𝑓⋆. 𝑛: #nodes 𝑇: #iteration

39 KYOTO UNIVERSITY / OIST Proposed Method By carefully tuning
the number of active nodes 𝑘, we get Convergence Rate of Proposed Method If we use ring as the topology and set 𝑘 as follows: 𝑘 = max 1, min 𝑇 𝜎2 + 𝜁2 𝐿𝑟0 1 5 , 𝑇 𝜎2 + 𝜁2 𝐿𝑟0 1 7 , 𝑛 we can achieve: 𝑂 𝐿𝑟0 𝜎2 𝑛𝑇 + 𝐿2𝑟0 2 𝜎2 + 𝜁2 3 4 𝑇 4 7 + 𝐿2𝑟0 2 𝜎2 + 𝜁2 2 3 𝑇 3 5 + 𝐿𝑟0 𝑇 , where 𝑟0 ≔ 𝑓 𝑥 0 − 𝑓⋆. The convergence rate consistently improve as 𝑛 increases!

40 KYOTO UNIVERSITY / OIST Proposed Method Research Question 2
Can we develop an efficient hyperparameter-tuning method? We must tune the number of active nodes 𝑘 from 1 to 𝑛. Grid search requires 𝑛𝑇 iterations Yes! We can obtain the proper 𝑘 within 2𝑇 iterations.

41 KYOTO UNIVERSITY / OIST Proposed Method Efficient Hyperparameter-tuning Method
◼ Grid search on 𝐾 can find a proper 𝑘. ◼ Moreover, this grid search can run in parallel. Key Lemma Let 𝐾 define 𝐾 ≔ 1, 2, 4, 8, … , 2ہlog2(𝑛+1 ] ) −1 . For any 𝑘⋆ < 𝑛, there exists 𝑘 ∈ 𝐾 such that 𝑘⋆ 4 < 𝑘 ≤ 𝑘⋆. Furthermore it holds that σ𝑘∈𝐾 𝑘 ≤ 𝑛.

Theorem (Informal) This hyperparameter-tuning methods achieve exactly the same convergence rate as that with the optimal 𝑘⋆. 1. Run Teleportation with ∀𝑘 ∈ 𝐾 in parallel. 2. Run Teleportation with 𝑛. (full activation.) 3. Chose the best parameters. We can obtain the proper 𝑘 within 2𝑇 iterations.

45 KYOTO UNIVERSITY / OIST Experiment

46 KYOTO UNIVERSITY / OIST Experiment Synthetic Function Setting: ◼
#nodes: 100 ◼ Network: Ring

47 KYOTO UNIVERSITY / OIST Experiment Synthetic Function 𝜎2: stochastic
grad. noise 𝜁2: heterogeneity of loss functions

48 KYOTO UNIVERSITY / OIST Experiment Synthetic Function 𝜎2: stochastic
grad. noise 𝜁2: heterogeneity of loss functions Teleportation converges faster than Decentralized SGD

49 KYOTO UNIVERSITY / OIST Experiment Neural Network (Non-IID)

50 KYOTO UNIVERSITY / OIST Experiment Neural Network (IID)

51 KYOTO UNIVERSITY / OIST Conclusion We proposed Teleportation: ◼
The convergence rate of Teleportation does not degrade as 𝑛 increases. ◼ We also propose an efficient hyperparameter-tuning method. ◼ We numerically demonstrate the effectiveness. paper

Scalable Decentralized Learning with Teleportat...

Scalable Decentralized Learning with Teleportation (ICLR 2025)

More Decks by Yuki Takezawa

Featured

Transcript