Beyond Exponential Graph: Communication Efficient Topology for Decentralized Learning via Finite-time Convergence

1 KYOTO UNIVERSITY KYOTO UNIVERSITY Beyond Exponential Graph: Communication Efficient
Topology for Decentralized Learning via Finite-time Convergence (NeurIPS 2023) Yuki Takezawa1,2, Ryoma Sato1,2, Han Bao1,2, Kenta Niwa3, Makoto Yamada2 1Kyoto Univ., 2OIST, 3NTT CS Lab.

2 KYOTO UNIVERSITY Background Decentralized Learning Training NN in parallel
on multiple nodes (e.g., server, GPUs). ◼ Large-scale machine learning. ◼ Privacy : server, which has its own (private) training datasets. Decentralized Learning

3 KYOTO UNIVERSITY Background Decentralized Learning : node (server), which
has training dataset. : nodes can transmit parameters. 𝑓𝑖 𝒙𝑖 : loss function of node 𝑖. (𝒙𝑖 is NN’s parameter of node 𝑖.) Goal of Decentralized Learning: inf 𝒙 1 𝑛 ෍ 𝑖=1 𝑛 𝑓𝑖 𝒙 𝒙1 = 𝒙2 = ⋯ 𝑓5 𝑓2 𝑓1 𝑓4 𝑓3 ※ 𝑛 is the number of nodes.

4 KYOTO UNIVERSITY Update rule of DSGD: Node 𝑖 update
its NN’s parameter 𝒙𝑖 as follows: ◼ 𝒙 𝑖 (𝑟+1 2 ) = 𝒙 𝑖 (𝑟) − 𝜂∇𝑓𝑖 𝒙 𝑖 𝑟 ◼ Exchange parameters 𝒙 𝑖 (𝑟+1 2 ) with neighbors. ◼ 𝒙 𝑖 (𝑟+1) = σ 𝑗=1 𝑛 𝑊𝑖𝑗 𝒙 𝑗 (𝑟+1 2 ) Background Decentralized SGD (DSGD) Let 𝑛 be the number of nodes and let 𝑾 be an adjacency matrix. 𝑊𝑖𝑗 is an edge weight and positive iff there exists edge 𝑖, 𝑗 or 𝑖 = 𝑗.

5 KYOTO UNIVERSITY Background What is the “good” network structure?
◼ Communication efficiency (i.e., training speed) ◼ Accuracy Ring Complete Grid

6 KYOTO UNIVERSITY Background Communication Efficiency Communication is the main
bottleneck of distributed learning. ◼ Communication costs is determined by max degree. Ring (2) Complete (n-1) Grid (4)

7 KYOTO UNIVERSITY Background What is the “good” network structure?
◼ Communication Efficiency (i.e., training speed) ◼ Accuracy Ring Complete Grid

8 KYOTO UNIVERSITY Background Consensus Rate ◼ How “well-connected” the
topology is important. ◼ How fast the information spread-out is important. Ring Complete Grid

9 KYOTO UNIVERSITY Background Consensus Rate Problem: ◼ There exists
𝑛 nodes, and node 𝑖 has parameters 𝒙𝑖 . ◼ Let 𝑊𝑖𝑗 is the edge weight. (𝑊𝑖𝑗 > 0 iff there exists edge or 𝑖 = 𝑗) ◼ Node 𝑖 updates 𝒙𝑖 as 𝒙𝑖 ← σ 𝑗=1 𝑛 𝑊𝑖𝑗 𝒙𝑗 . Question: ◼ How fast 𝒙𝑖 reach ഥ 𝒙 ≔ 1 𝑛 σ 𝑗=1 𝑛 𝒙𝑖 ?

10 KYOTO UNIVERSITY Background Consensus Rate with 𝑛 Nodes. Topology
Consensus Rate 𝛽 ∈ [0,1) ↓ Ring 1 − 𝑂 1 𝑛2 Torus 1 − 𝑂 1 𝑛 Exp. Graph 1 − 𝑂 1 log2 𝑛 Complete 0 1 𝑛 ෍ 𝑖=1 𝑛 𝒙 𝑖 𝑟 − ഥ 𝒙 2 ※ The number in the bracket is the maximum degree.

11 KYOTO UNIVERSITY Background Consensus Rate with 𝑛 Nodes. Fast
consensus rate (i.e., small 𝛽 ∈ [0,1)) enables Decentralized SGD to achieve high accuracy and fast convergence rate. Theorem: Convergence Rate The parameter 𝒙𝑖 generated by Decentralized SGD satisfies 1 𝑅 + 1 ෍ 𝑟=0 𝑅 ‖𝛻𝑓 ഥ 𝒙(𝑟) ‖2 ≤ 𝜖 after 𝑅 = 𝑂 1 𝑛𝜖2 + 1 1 − 𝛽 𝜖3/2 iterations where ഥ 𝒙 ≔ 1 𝑛 σ𝑖 𝒙𝑖 .

12 KYOTO UNIVERSITY Background Consensus Rate ◼ High communication efficiency:
small maximum degree ◼ High accuracy/fast convergence rate : fast consensus rate Ring Complete Grid High accuracy/ Fast convergence rate High communication efficiency

13 KYOTO UNIVERSITY Background Contribution We propose the Base-(k+1) Graph,
which enables Decenctalized SGD to achieve reasonable balance between communication efficiency and accuracy/convergence rate. ※ The number in the bracket is the maximum degree.

14 KYOTO UNIVERSITY Proposed Method

15 KYOTO UNIVERSITY Proposed Method Core Idea: Finite-Time Convergence The
existing topologies asymptotically converge. The proposed topologies, Base-(k+1) Graph, is finite-time convergence. 1 𝑛 ෍ 𝑖=1 𝑛 𝒙 𝑖 𝑟 − ഥ 𝒙 2 ※ The number in the bracket is the maximum degree.

16 KYOTO UNIVERSITY Proposed Method Existing Finite-time Convergent Topologies Topology
Max Degree #Nodes 𝒏 1-peer Hypercube 1 A power of 2 1-peer Exp. Graph 1 A power of 2 Base-(k+1) Graph 𝑘 Arbitrary number of nodes ◼ 1-peer Hypercube is not constructed when 𝑛 is not power of 2. ◼ 1-peer Exp. is not finite-time convergence when 𝑛 is not power of 2.

17 KYOTO UNIVERSITY Existing Finite-Time Convergent Topologies 1-peer Hypercube 𝑛
= 2 ◼ All edge weight is 0.5. 1 2 Node 1 Node 2 Initial parameter 𝑥1 𝑥2 𝑥1 + 𝑥2 2 𝑥1 + 𝑥2 2

18 KYOTO UNIVERSITY Existing Finite-Time Convergent Topologies 1-peer Hypercube 1
2 3 4 2 4 1 3 𝑛 = 4 ◼ All edge weight is 0.5. Node 1 Node 2 Node 3 Node 4 Init. value 𝑥1 𝑥2 𝑥3 𝑥4 𝑥1 + 𝑥2 2 𝑥1 + 𝑥2 2 𝑥3 + 𝑥4 2 𝑥3 + 𝑥4 2 𝑥1 + 𝑥2 + 𝑥3 + 𝑥4 2 𝑥1 + 𝑥2 + 𝑥3 + 𝑥4 2 𝑥1 + 𝑥2 + 𝑥3 + 𝑥4 2 𝑥1 + 𝑥2 + 𝑥3 + 𝑥4 2

21 KYOTO UNIVERSITY Existing Finite-Time Convergent Topologies 1-peer Hypercube 𝑛
= 8 ◼ All edge weight is 0.5. 1-peer Hypercube is finite-time convergence when 𝑛 is a power of 2, while it cannot be constructed when 𝑛 is not a power of 2.

22 KYOTO UNIVERSITY Proposed Method Base-2 Graph Next, we propose
the Base-2 Graph: ◼ It is finite-time convergence for any 𝑛. ◼ Its maximum degree is 1. Topology Max Degree #Nodes 𝒏 1-peer Hypercube 1 A power of 2 1-peer Exp. Graph 1 A power of 2 Base-(k+1) Graph 𝑘 Arbitrary number of nodes

23 KYOTO UNIVERSITY Proposed Method Core Idea of Simple Base-2
Graph 𝑛 = 3 = 2 + 1 𝑛 = 5 = 22 + 1 2 3 1 2 4 1 3 5 Core idea is splitting the set of nodes into disjoint subsets to which 1-peer Hypercube is applicable. 𝑛 = 7 = 22 + 2 + 1 3 4 6 5 1 7 2

Graph 𝑛 = 3 = 2 + 1 𝑛 = 5 = 22 + 1 2 3 1 2 4 1 3 5 Core idea is splitting the set of nodes into disjoint subsets to which 1-peer Hypercube is applicable. 𝑛 = 7 = 22 + 2 + 1 3 4 6 5 1 7 2

25 KYOTO UNIVERSITY Proposed Method Simple Base-2 Graph with 𝑛
= 3 2 3 1 2 3 1 2 3 1 Node 1 Node 2 Node 3 Init. value 𝑥1 𝑥2 𝑥3 𝑥1 + 𝑥2 2 𝑥1 + 𝑥2 2 𝑥3 𝑥1 + 𝑥2 2 𝑥1 + 𝑥2 + 4𝑥3 6 𝑥1 + 𝑥2 + 𝑥3 3 𝑥1 + 𝑥2 + 𝑥3 3 𝑥1 + 𝑥2 + 𝑥3 3 𝑥1 + 𝑥2 + 𝑥3 3 2 3 1 3 1 3 ※ edge weight 0.5 is omitted.

= 3 Node 1 Node 2 Node 3 Init. value 𝑥1 𝑥2 𝑥3 𝑥1 + 𝑥2 2 𝑥1 + 𝑥2 2 𝑥3 𝑥1 + 𝑥2 2 𝑥1 + 𝑥2 + 4𝑥3 6 𝑥1 + 𝑥2 + 𝑥3 3 𝑥1 + 𝑥2 + 𝑥3 3 𝑥1 + 𝑥2 + 𝑥3 3 𝑥1 + 𝑥2 + 𝑥3 3 2 3 1 2 3 1 2 3 1 2 3 1 3 1 3 ※ edge weight 0.5 is omitted.

= 3 Node 1 Node 2 Node 3 Init. value 𝑥1 𝑥2 𝑥3 𝑥1 + 𝑥2 2 𝑥1 + 𝑥2 2 𝑥3 𝑥1 + 𝑥2 2 𝑥1 + 𝑥2 + 4𝑥3 6 𝑥1 + 𝑥2 + 𝑥3 3 𝑥1 + 𝑥2 + 𝑥3 3 𝑥1 + 𝑥2 + 𝑥3 3 𝑥1 + 𝑥2 + 𝑥3 3 Average is 𝑥1+𝑥2+𝑥3 3 2 3 1 2 3 1 2 3 1 2 3 1 3

= 3 Node 1 Node 2 Node 3 Init. value 𝑥1 𝑥2 𝑥3 𝑥1 + 𝑥2 2 𝑥1 + 𝑥2 2 𝑥3 𝑥1 + 𝑥2 2 𝑥1 + 𝑥2 + 4𝑥3 6 𝑥1 + 𝑥2 + 𝑥3 3 𝑥1 + 𝑥2 + 𝑥3 3 𝑥1 + 𝑥2 + 𝑥3 3 𝑥1 + 𝑥2 + 𝑥3 3 2 3 1 2 3 1 2 3 1 2 3 1 3 Average is 𝑥1+𝑥2+𝑥3 3

= 5 2 4 1 3 5 2 4 1 3 5 2 4 1 3 5 2 4 1 3 5 2 4 1 3 5 4 5 1 5 1 5 ※ edge weight 0.5 is omitted.

= 5 = 22 + 1 2 4 1 3 5 2 4 1 3 5 2 4 1 3 5 2 4 1 3 5 2 4 1 3 5 4 5 1 5 1 5 ※ edge weight 0.5 is omitted.

= 5 = 22 + 1 2 4 1 3 5 2 4 1 3 5 2 4 1 3 5 2 4 1 3 5 2 4 1 3 5 4 5 1 5 1 5 Average is 𝑥1+𝑥2+𝑥3+𝑥4+𝑥5 5

35 KYOTO UNIVERSITY Proposed Method Brief Summary We propose the
Simple Base-2 Graph: ◼ Its maximum degree is only 1. ◼ It is finite-time convergence for any 𝑛. Next, we propose the Simple Base-(k+1) Graph: ◼ Its maximum degree is 𝑘. ◼ It is finite-time convergence for any 𝑛.

36 KYOTO UNIVERSITY Proposed Method Review: Core Idea of Simple
Base-2 Graph 𝑛 = 3 = 2 + 1 𝑛 = 5 = 22 + 1 2 3 1 2 4 1 3 5 Core idea is splitting the set of nodes into disjoint subsets to which 1-peer Hypercube is applicable. 𝑛 = 7 = 22 + 2 + 1 3 4 6 5 1 7 2 We need extend 1-peer Hypercube to k-peer setting.

37 KYOTO UNIVERSITY Proposed Method k-peer Hyper-hypercube 1-peer Hypercube can
be constructed when 𝑛 is a power of 2. ◼ 𝑛 is a power of 2 ⇔ The primal factors of 𝑛 is not larger than 2. 1 2 3 4 2 4 1 3 We propose the k-peer Hyper-hypercube, which can be constructed when the primal factors of 𝑛 is not larger than 𝑘 + 1.

38 KYOTO UNIVERSITY Proposed Method k-peer Hyper-hypercube ◼ Case with
𝑘 = 2 and 𝑛 = 6 = 2 × 3 ◼ Case with 𝑘 = 2 and 𝑛 = 9 = 3 × 3 . (self-loops are omitted.) Edge weight is 1 2 . Edge weight is 1 3 . Edge weight is 1 3 .

39 KYOTO UNIVERSITY Proposed Method Review: Core Idea of Simple
Base-2 Graph 𝑛 = 3 = 2 + 1 𝑛 = 5 = 22 + 1 2 3 1 2 4 1 3 5 Core idea is splitting the set of nodes into disjoint subsets to which 1-peer Hypercube is applicable. 𝑛 = 7 = 22 + 2 + 1 3 4 6 5 1 7 2 Binary representation (base-2 number)

Graph 𝑛 = 3 𝑛 = 5 = 3 + 2 2 3 1 2 4 1 3 5 Core idea is splitting the set of nodes into disjoint subsets to which 2-peer Hyper-hypercube is applicable. 𝑛 = 7 = 2 × 3 + 1 3 4 6 5 1 7 2 base-3 number

= 5 2 4 1 3 5 2 4 1 3 5 2 4 1 3 5 ※ self-loops are omitted.

= 5 2 4 1 3 5 2 4 1 3 5 2 4 1 3 5 ※ self-loops are omitted. Average is 𝑥1+𝑥2+𝑥3+𝑥4+𝑥5 5 Average is 𝑥1+𝑥2+𝑥3+𝑥4+𝑥5 5

45 KYOTO UNIVERSITY Proposed Method Simple Base-(k+1) Graph vs Base-(k+1)
Graph ◼ Using additional technique, we can reduce the length of the Simple Base-(k+1) Graph. ※ The number in the bracket is the maximum degree.

46 KYOTO UNIVERSITY Proposed Method Experiments 1 𝑛 ෍ 𝑖=1
𝑛 𝒙 𝑖 𝑟 − ഥ 𝒙 2 ※ The number in the bracket is the maximum degree.

47 KYOTO UNIVERSITY Decentralized SGD on Base-(k+1) Graph

48 KYOTO UNIVERSITY Proposed Method Decentralized SGD on Base-(k+1) Graph
Let 𝑾(1), ⋯ , 𝑾(𝑚) be adjacency matrices of Base-(k+1) Graph. Node 𝑖 updates its parameter 𝒙𝑖 as follows: 𝒙 𝑖 (𝑟+1) = ෍ 𝑗=1 𝑛 𝑊 𝑖𝑗 (1+𝑚𝑜𝑑 𝑟,𝑚 ) 𝒙 𝑗 (𝑟) − 𝜂∇𝑓𝑗 𝒙 𝑗 (𝑟) 2 3 1 2 3 1 2 3 1

Max Degree ↓ Order of 𝑹 ↓ Ring 2 𝑂 1 𝑛𝜖2 + 𝑛2 𝜖3/2 Torus 4 𝑂 1 𝑛𝜖2 + 𝑛 𝜖3/2 Exp. Graph log2 𝑛 𝑂 1 𝑛𝜖2 + log2 𝑛 𝜖3/2 Base-(k+1) Graph (ours) 𝑘 𝑂 1 𝑛𝜖2 + log𝑘+1 𝑛 𝜖3/2 DSGD satisfies 1 𝑅+1 σ𝑟=0 𝑅 ∇𝑓 ഥ 𝒙 𝑟 2 ≤ 𝜖 after 𝑅 iterations.

Topology Consensus Rate ↑ Max Degree ↓ Convergence Rate ↓ Exponential Graph 1 − 𝑂 1 log2 𝑛 log2 𝑛 𝑂 1 𝑛𝜖2 + log2 𝑛 𝜖3/2 Base-(𝒌 + 𝟏) Graph (ours) N/A 𝑘 𝑂 1 𝑛𝜖2 + log𝑘+1 𝑛 𝜖3/2 Exp. Graph vs Base-2 Graph ◼ Same convergence rate and better communication efficiency Exp. Graph vs Base-(𝑘 + 1) Graph with 2 ≤ 𝑘 < log2 𝑛 ◼ Faster convergence rate and better communication efficiency

51 KYOTO UNIVERSITY Experiments

52 KYOTO UNIVERSITY Experiments Model: VGG Datasets: Fashion MNIST, CIFAR-10,
CIFAR-100 #Nodes: 25 We conduct experiments both i.i.d. and non-i.i.d. settings. ◼ Non-i.i.d setting ◼ i.i.d setting

53 KYOTO UNIVERSITY Experiments Results on non-i.i.d. setting ◼ 𝑛
= 25 ◼ The number in the bracket is the maximum degree.

54 KYOTO UNIVERSITY Experiments Results on i.i.d. setting ◼ 𝑛
= 25 ◼ The number in the bracket is the maximum degree.

55 KYOTO UNIVERSITY Experiments Results of CIFAR-10 with non-i.i.d. Setting
◼ Base-2 Graph outperforms 1-peer Exp. ◼ Base-{3,4,5} Graph outperforms 1-peer Exp. and Exp.

56 KYOTO UNIVERSITY Experiments Results with Other Decentralized Learning Methods
◼ Base-2 Graph is comparable to 1-peer exponential graph. ◼ Base-5 Graph outperforms the exponential graph.

57 KYOTO UNIVERSITY Conclusion We propose Base-(k+1) Graph: ◼ Finite-time
convergence for any 𝑛 and 𝑘. ◼ Theoretically: Faster convergence rate and fewer communication costs than the exp. graph. ◼ Experimentally: Reasonable balance between accuracy and communication efficiency.

Beyond Exponential Graph: Communication Efficie...

Beyond Exponential Graph: Communication Efficient Topology for Decentralized Learning via Finite-time Convergence

More Decks by Yuki Takezawa

Other Decks in Research

Featured

Transcript