Scaling Distributed Machine Learning with the Parameter Server

Slide 1

Slide 1 text

Scaling Distributed Machine Learning with the Parameter Server Presented by: Liang Gong CS294-110 Class Presentation Fall 2015 Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 1 Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su

Slide 2

Slide 2 text

Machine Learning in Industry Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 2 • Large training dataset (1TB to 1PB) • Complex models (109 to 1012 parameters) •  ML must be done in distributed environment • Challenges: • Many machine learning algorithms are proposed for sequential execution • Machines can fail and jobs can be preempted

Slide 3

Slide 3 text

Motivation Balance the need of performance, flexibility and generality of machine learning algorithms, and the simplicity of systems design. How to: • Distribute workload • Share the model among all machines • Parallelize sequential algorithms • Reduce communication cost 3 Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley.

Slide 4

Slide 4 text

Main Idea of Parameter Server • Servers manage parameters • Worker Nodes are responsible for computing updates (training) for parameters based on part of the training dataset • Parameter updates derived from each node are pushed and aggregated on the server. 4 Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley.

Slide 5

Slide 5 text

A Simple Example Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 5 • Server node

Slide 6

Slide 6 text

A Simple Example Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 6 • Server node + worker nodes

Slide 7

Slide 7 text

A Simple Example Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 7 • Server node + worker nodes • Server node: all parameters

Slide 8

Slide 8 text

A Simple Example Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 8 • Server node + worker nodes • Server node: all parameters • Worker node: owns part of the training data

Slide 9

Slide 9 text

A Simple Example Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 9 • Server node + worker nodes • Server node: all parameters • Worker node: owns part of the training data • Operates in iterations

Slide 10

Slide 10 text

A Simple Example Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 10 • Server node + worker nodes • Server node: all parameters • Worker node: owns part of the training data • Operates in iterations • Worker nodes pull the updated w

Slide 11

Slide 11 text

A Simple Example Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 11 • Server node + worker nodes • Server node: all parameters • Worker node: owns part of the training data • Operates in iterations • Worker nodes pull the updated w • Worker node computes updates to w (local training)

Slide 12

Slide 12 text

A Simple Example Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 12 • Server node + worker nodes • Server node: all parameters • Worker node: owns part of the training data • Operates in iterations • Worker nodes pull the updated w • Worker node computes updates to w (local training) • Worker node pushes updates to the server node

Slide 13

Slide 13 text

A Simple Example Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 13 • Server node + worker nodes • Server node: all parameters • Worker node: owns part of the training data • Operates in iterations • Worker nodes pull the updated w • Worker node computes updates to w (local training) • Worker node pushes updates to the server node • Server node updates w

Slide 14

Slide 14 text

A Simple Example Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 14 • Server node + worker nodes • Server node: all parameters • Worker node: owns part of the training data • Operates in iterations • Worker nodes pull the updated w • Worker node computes updates to w (local training) • Worker node pushes updates to the server node • Server node updates w

Slide 15

Slide 15 text

A Simple Example Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 15 • Server node + worker nodes • Server node: all parameters • Worker node: owns part of the training data • Operates in iterations • Worker nodes pull the updated w • Worker node computes updates to w (local training) • Worker node pushes updates to the server node • Server node updates w

Slide 16

Slide 16 text

A Simple Example Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 16 • Server node + worker nodes • Server node: all parameters • Worker node: owns part of the training data • Operates in iterations • Worker nodes pull the updated w • Worker node computes updates to w (local training) • Worker node pushes updates to the server node • Server node updates w

Slide 17

Slide 17 text

A Simple Example Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 17 • Server node + worker nodes • Server node: all parameters • Worker node: owns part of the training data • Operates in iterations • Worker nodes pull the updated w • Worker node computes updates to w (local training) • Worker node pushes updates to the server node • Server node updates w

Slide 18

Slide 18 text

A Simple Example Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 18 • Server node + worker nodes • Server node: all parameters • Worker node: owns part of the training data • Operates in iterations • Worker nodes pull the updated w • Worker node computes updates to w (local training) • Worker node pushes updates to the server node • Server node updates w x x

Slide 19

Slide 19 text

A Simple Example Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 19 • Server node + worker nodes • Server node: all parameters • Worker node: owns part of the training data • Operates in iterations • Worker nodes pull the updated w • Worker node computes updates to w (local training) • Worker node pushes updates to the server node • Server node updates w x x x x

Slide 20

Slide 20 text

A Simple Example Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 20 • Server node + worker nodes • Server node: all parameters • Worker node: owns part of the training data • Operates in iterations • Worker nodes pull the updated w • Worker node computes updates to w (local training) • Worker node pushes updates to the server node • Server node updates w x x x x x x x x x x x

Slide 21

Slide 21 text

A Simple Example Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 21 • Server node + worker nodes • Server node: all parameters • Worker node: owns part of the training data • Operates in iterations • Worker nodes pull the updated w • Worker node computes updates to w (local training) • Worker node pushes updates to the server node • Server node updates w x x x x x x x x x x x x x x

Slide 22

Slide 22 text

A Simple Example Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 22 • 100 nodes  7.8% of w are used on one node (avg) • 1000 nodes  0.15% of w are used on one node (avg) x x x x x x x x x x x x x x

Slide 23

Slide 23 text

Architecture Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 23

Slide 24

Slide 24 text

Architecture Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 24 Server manager: Liveness and parameter partition of server nodes

Slide 25

Slide 25 text

Architecture Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 25 All server nodes partition parameters keys with consistent hashing.

Slide 26

Slide 26 text

Architecture Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 26 Worker node: communicate only with its server node

Slide 27

Slide 27 text

Architecture Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 27 Updates are replicated to slave server nodes synchronously.

Slide 28

Slide 28 text

Architecture Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 28 Updates are replicated to slave server nodes synchronously.

Slide 29

Slide 29 text

Architecture Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 29 Optimization: replication after aggregation

Slide 30

Slide 30 text

Data Transmission / Calling Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 30 • The shared parameters are presented as vectors. • Data is sent by pushing and pulling key range. • Tasks are issued by RPC. • Tasks are executed asynchronously. • Caller executes without waiting for a return from the callee. • Caller can specify dependencies between callees. Sequential Consistency Eventual Consistency 1 Bounded Delay Consistency

Slide 31

Slide 31 text

Trade-off: Asynchronous Call Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 31 • 1000 machines, 800 workers, 200 parameter servers. • 16 physical cores, 192G DRAM, 10Gb Ethernet.

Slide 32

Slide 32 text

Trade-off: Asynchronous Call Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 32 • 1000 machines, 800 workers, 200 parameter servers. • 16 physical cores, 192G DRAM, 10Gb Ethernet. Asynchronous updates require more iterations to achieve the same objective value.

Slide 33

Slide 33 text

Assumptions Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 33 • It is OK to lose part of the training dataset.  Not urgent to recover a fail worker node  Recovering a failed server node is critical • An approximate solution is good enough  Limited inaccuracy is tolerable  Relaxed consistency (as long as it converges)

Slide 34

Slide 34 text

Optimizations Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 34 • Message compression  save bandwidth • Aggregate parameter changes before synchronous replication on server node • Key lists for parameter updates are likely to be the same as last iteration •  cache the list, send a hash <1, 3>, <2, 4>, <6, 7.5>, <7, 4.5> … • Filter before transmission: • gradient update that is less than a threshold.

Slide 35

Slide 35 text

Network Saving Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 35 • 1000 machines, 800 workers, 200 parameter servers. • 16 physical cores, 192G DRAM, 10Gb Ethernet.

Slide 36

Slide 36 text

Trade-offs Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 36 • Consistency model vs Computing Time + Waiting Time Sequential Consistency (τ=0) Eventual Consistency (τ=∞) 1 Bounded Delay Consistency (τ=1)

Slide 37

Slide 37 text

Discussions • Feature selection? Sampling? • Trillions of features and trillions of examples in the training dataset  overfitting? • Each worker do multiple iterations before push? • Diversify the labels each node is assigned > Random? • If one worker only pushes trivial parameter changes, probably its training dataset are not very useful  remove and re-partition. • A hierarchy of server node 37 Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley.

Slide 38

Slide 38 text

x x Assumption / Problem Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 38 Fact: The total size of parameters (features) may exceed the capacity of a single machine. x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x

Slide 39

Slide 39 text

x x Assumption / Problem Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 39 Fact: The total size of parameters (features) may exceed the capacity of a single machine. Assumption: Each instance in the training set only contains a small portion of all features. x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x

Slide 40

Slide 40 text

x x Assumption / Problem Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 40 Fact: The total size of parameters (features) may exceed the capacity of a single machine. Assumption: Each instance in the training set only contains a small portion of all features. Problem: What if one example contains 90% of features (trillions of features in total)? x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x

Slide 41

Slide 41 text

x x Assumption / Problem Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 41 Fact: The total size of parameters (features) may exceed the capacity of a single machine. Assumption: Each instance in the training set only contains a small portion of all features. Problem: What if one example contains 90% of features (trillions of features in total)? x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x

Slide 42

Slide 42 text

x x Assumption / Problem Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 42 Fact: The total size of parameters (features) may exceed the capacity of a single machine. Assumption: Each instance in the training set only contains a small portion of all features. Problem: What if one example contains 90% of features (trillions of features in total)? x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x

Slide 43

Slide 43 text

Sketch Based Machine Learning Algorithms Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 43 • Sketches are a class of data stream summaries • Problem: An infinite number of data items arrive continuously, whereas the memory capacity is bounded by a small size • Every item is seen once • Approach: Typically formed by linear projections of source data with appropriate (pseudo) random vectors • Goal: use small memory to answer interesting queries with strong precision guarantees http://web.engr.illinois.edu/~vvnktrm2/talks/sketch.pdf

Slide 44

Slide 44 text

x x Assumption / Problem Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 44 Assumption: It is OK to calculate updates for models on each portion of data separately and aggregate the updates. Problem: What about clustering and other ML/DM algorithms? x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x