Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scaling Distributed Machine Learning with the Parameter Server

Liang Gong
April 01, 2017

Scaling Distributed Machine Learning with the Parameter Server

Presented by Liang Gong in CS294-110 (Big Data System research: Trends and Challenges) in Berkeley (Fall 2015). The instructor is Ion Stoica.

Liang Gong

April 01, 2017
Tweet

More Decks by Liang Gong

Other Decks in Technology

Transcript

  1. Scaling Distributed Machine Learning with the Parameter Server Presented by:

    Liang Gong CS294-110 Class Presentation Fall 2015 Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 1 Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su
  2. Machine Learning in Industry Liang Gong, Electric Engineering & Computer

    Science, University of California, Berkeley. 2 • Large training dataset (1TB to 1PB) • Complex models (109 to 1012 parameters) •  ML must be done in distributed environment • Challenges: • Many machine learning algorithms are proposed for sequential execution • Machines can fail and jobs can be preempted
  3. Motivation Balance the need of performance, flexibility and generality of

    machine learning algorithms, and the simplicity of systems design. How to: • Distribute workload • Share the model among all machines • Parallelize sequential algorithms • Reduce communication cost 3 Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley.
  4. Main Idea of Parameter Server • Servers manage parameters •

    Worker Nodes are responsible for computing updates (training) for parameters based on part of the training dataset • Parameter updates derived from each node are pushed and aggregated on the server. 4 Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley.
  5. A Simple Example Liang Gong, Electric Engineering & Computer Science,

    University of California, Berkeley. 5 • Server node
  6. A Simple Example Liang Gong, Electric Engineering & Computer Science,

    University of California, Berkeley. 6 • Server node + worker nodes
  7. A Simple Example Liang Gong, Electric Engineering & Computer Science,

    University of California, Berkeley. 7 • Server node + worker nodes • Server node: all parameters
  8. A Simple Example Liang Gong, Electric Engineering & Computer Science,

    University of California, Berkeley. 8 • Server node + worker nodes • Server node: all parameters • Worker node: owns part of the training data
  9. A Simple Example Liang Gong, Electric Engineering & Computer Science,

    University of California, Berkeley. 9 • Server node + worker nodes • Server node: all parameters • Worker node: owns part of the training data • Operates in iterations
  10. A Simple Example Liang Gong, Electric Engineering & Computer Science,

    University of California, Berkeley. 10 • Server node + worker nodes • Server node: all parameters • Worker node: owns part of the training data • Operates in iterations • Worker nodes pull the updated w
  11. A Simple Example Liang Gong, Electric Engineering & Computer Science,

    University of California, Berkeley. 11 • Server node + worker nodes • Server node: all parameters • Worker node: owns part of the training data • Operates in iterations • Worker nodes pull the updated w • Worker node computes updates to w (local training)
  12. A Simple Example Liang Gong, Electric Engineering & Computer Science,

    University of California, Berkeley. 12 • Server node + worker nodes • Server node: all parameters • Worker node: owns part of the training data • Operates in iterations • Worker nodes pull the updated w • Worker node computes updates to w (local training) • Worker node pushes updates to the server node
  13. A Simple Example Liang Gong, Electric Engineering & Computer Science,

    University of California, Berkeley. 13 • Server node + worker nodes • Server node: all parameters • Worker node: owns part of the training data • Operates in iterations • Worker nodes pull the updated w • Worker node computes updates to w (local training) • Worker node pushes updates to the server node • Server node updates w
  14. A Simple Example Liang Gong, Electric Engineering & Computer Science,

    University of California, Berkeley. 14 • Server node + worker nodes • Server node: all parameters • Worker node: owns part of the training data • Operates in iterations • Worker nodes pull the updated w • Worker node computes updates to w (local training) • Worker node pushes updates to the server node • Server node updates w
  15. A Simple Example Liang Gong, Electric Engineering & Computer Science,

    University of California, Berkeley. 15 • Server node + worker nodes • Server node: all parameters • Worker node: owns part of the training data • Operates in iterations • Worker nodes pull the updated w • Worker node computes updates to w (local training) • Worker node pushes updates to the server node • Server node updates w
  16. A Simple Example Liang Gong, Electric Engineering & Computer Science,

    University of California, Berkeley. 16 • Server node + worker nodes • Server node: all parameters • Worker node: owns part of the training data • Operates in iterations • Worker nodes pull the updated w • Worker node computes updates to w (local training) • Worker node pushes updates to the server node • Server node updates w
  17. A Simple Example Liang Gong, Electric Engineering & Computer Science,

    University of California, Berkeley. 17 • Server node + worker nodes • Server node: all parameters • Worker node: owns part of the training data • Operates in iterations • Worker nodes pull the updated w • Worker node computes updates to w (local training) • Worker node pushes updates to the server node • Server node updates w
  18. A Simple Example Liang Gong, Electric Engineering & Computer Science,

    University of California, Berkeley. 18 • Server node + worker nodes • Server node: all parameters • Worker node: owns part of the training data • Operates in iterations • Worker nodes pull the updated w • Worker node computes updates to w (local training) • Worker node pushes updates to the server node • Server node updates w x x
  19. A Simple Example Liang Gong, Electric Engineering & Computer Science,

    University of California, Berkeley. 19 • Server node + worker nodes • Server node: all parameters • Worker node: owns part of the training data • Operates in iterations • Worker nodes pull the updated w • Worker node computes updates to w (local training) • Worker node pushes updates to the server node • Server node updates w x x x x
  20. A Simple Example Liang Gong, Electric Engineering & Computer Science,

    University of California, Berkeley. 20 • Server node + worker nodes • Server node: all parameters • Worker node: owns part of the training data • Operates in iterations • Worker nodes pull the updated w • Worker node computes updates to w (local training) • Worker node pushes updates to the server node • Server node updates w x x x x x x x x x x x
  21. A Simple Example Liang Gong, Electric Engineering & Computer Science,

    University of California, Berkeley. 21 • Server node + worker nodes • Server node: all parameters • Worker node: owns part of the training data • Operates in iterations • Worker nodes pull the updated w • Worker node computes updates to w (local training) • Worker node pushes updates to the server node • Server node updates w x x x x x x x x x x x x x x
  22. A Simple Example Liang Gong, Electric Engineering & Computer Science,

    University of California, Berkeley. 22 • 100 nodes  7.8% of w are used on one node (avg) • 1000 nodes  0.15% of w are used on one node (avg) x x x x x x x x x x x x x x
  23. Architecture Liang Gong, Electric Engineering & Computer Science, University of

    California, Berkeley. 24 Server manager: Liveness and parameter partition of server nodes
  24. Architecture Liang Gong, Electric Engineering & Computer Science, University of

    California, Berkeley. 25 All server nodes partition parameters keys with consistent hashing.
  25. Architecture Liang Gong, Electric Engineering & Computer Science, University of

    California, Berkeley. 26 Worker node: communicate only with its server node
  26. Architecture Liang Gong, Electric Engineering & Computer Science, University of

    California, Berkeley. 27 Updates are replicated to slave server nodes synchronously.
  27. Architecture Liang Gong, Electric Engineering & Computer Science, University of

    California, Berkeley. 28 Updates are replicated to slave server nodes synchronously.
  28. Architecture Liang Gong, Electric Engineering & Computer Science, University of

    California, Berkeley. 29 Optimization: replication after aggregation
  29. Data Transmission / Calling Liang Gong, Electric Engineering & Computer

    Science, University of California, Berkeley. 30 • The shared parameters are presented as <key, value> vectors. • Data is sent by pushing and pulling key range. • Tasks are issued by RPC. • Tasks are executed asynchronously. • Caller executes without waiting for a return from the callee. • Caller can specify dependencies between callees. Sequential Consistency Eventual Consistency 1 Bounded Delay Consistency
  30. Trade-off: Asynchronous Call Liang Gong, Electric Engineering & Computer Science,

    University of California, Berkeley. 31 • 1000 machines, 800 workers, 200 parameter servers. • 16 physical cores, 192G DRAM, 10Gb Ethernet.
  31. Trade-off: Asynchronous Call Liang Gong, Electric Engineering & Computer Science,

    University of California, Berkeley. 32 • 1000 machines, 800 workers, 200 parameter servers. • 16 physical cores, 192G DRAM, 10Gb Ethernet. Asynchronous updates require more iterations to achieve the same objective value.
  32. Assumptions Liang Gong, Electric Engineering & Computer Science, University of

    California, Berkeley. 33 • It is OK to lose part of the training dataset.  Not urgent to recover a fail worker node  Recovering a failed server node is critical • An approximate solution is good enough  Limited inaccuracy is tolerable  Relaxed consistency (as long as it converges)
  33. Optimizations Liang Gong, Electric Engineering & Computer Science, University of

    California, Berkeley. 34 • Message compression  save bandwidth • Aggregate parameter changes before synchronous replication on server node • Key lists for parameter updates are likely to be the same as last iteration •  cache the list, send a hash <1, 3>, <2, 4>, <6, 7.5>, <7, 4.5> … • Filter before transmission: • gradient update that is less than a threshold.
  34. Network Saving Liang Gong, Electric Engineering & Computer Science, University

    of California, Berkeley. 35 • 1000 machines, 800 workers, 200 parameter servers. • 16 physical cores, 192G DRAM, 10Gb Ethernet.
  35. Trade-offs Liang Gong, Electric Engineering & Computer Science, University of

    California, Berkeley. 36 • Consistency model vs Computing Time + Waiting Time Sequential Consistency (τ=0) Eventual Consistency (τ=∞) 1 Bounded Delay Consistency (τ=1)
  36. Discussions • Feature selection? Sampling? • Trillions of features and

    trillions of examples in the training dataset  overfitting? • Each worker do multiple iterations before push? • Diversify the labels each node is assigned > Random? • If one worker only pushes trivial parameter changes, probably its training dataset are not very useful  remove and re-partition. • A hierarchy of server node 37 Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley.
  37. x x Assumption / Problem Liang Gong, Electric Engineering &

    Computer Science, University of California, Berkeley. 38 Fact: The total size of parameters (features) may exceed the capacity of a single machine. x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x
  38. x x Assumption / Problem Liang Gong, Electric Engineering &

    Computer Science, University of California, Berkeley. 39 Fact: The total size of parameters (features) may exceed the capacity of a single machine. Assumption: Each instance in the training set only contains a small portion of all features. x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x
  39. x x Assumption / Problem Liang Gong, Electric Engineering &

    Computer Science, University of California, Berkeley. 40 Fact: The total size of parameters (features) may exceed the capacity of a single machine. Assumption: Each instance in the training set only contains a small portion of all features. Problem: What if one example contains 90% of features (trillions of features in total)? x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x
  40. x x Assumption / Problem Liang Gong, Electric Engineering &

    Computer Science, University of California, Berkeley. 41 Fact: The total size of parameters (features) may exceed the capacity of a single machine. Assumption: Each instance in the training set only contains a small portion of all features. Problem: What if one example contains 90% of features (trillions of features in total)? x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x
  41. x x Assumption / Problem Liang Gong, Electric Engineering &

    Computer Science, University of California, Berkeley. 42 Fact: The total size of parameters (features) may exceed the capacity of a single machine. Assumption: Each instance in the training set only contains a small portion of all features. Problem: What if one example contains 90% of features (trillions of features in total)? x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x
  42. Sketch Based Machine Learning Algorithms Liang Gong, Electric Engineering &

    Computer Science, University of California, Berkeley. 43 • Sketches are a class of data stream summaries • Problem: An infinite number of data items arrive continuously, whereas the memory capacity is bounded by a small size • Every item is seen once • Approach: Typically formed by linear projections of source data with appropriate (pseudo) random vectors • Goal: use small memory to answer interesting queries with strong precision guarantees http://web.engr.illinois.edu/~vvnktrm2/talks/sketch.pdf
  43. x x Assumption / Problem Liang Gong, Electric Engineering &

    Computer Science, University of California, Berkeley. 44 Assumption: It is OK to calculate updates for models on each portion of data separately and aggregate the updates. Problem: What about clustering and other ML/DM algorithms? x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x