Apache MXNet Distributed Training Explained In Depth by Viacheslav Kovalevskyi at Big Data Spain 2017

Apache MXNet Distributed Training Explained In Depth by Viacheslav Kovalevskyi at Big Data Spain 2017

Distributed training is a complex process that does more harm than good if it not setup correctly.

https://www.bigdataspain.org/2017/talk/apache-mxnet-distributed-training-explained-in-depth

Big Data Spain 2017
November 16th - 17th Kinépolis Madrid

Cb6e6da05b5b943d2691ceefa3381cad?s=128

Big Data Spain

November 30, 2017
Tweet

Transcript

  1. None
  2. Apache MXNet Distributed Training Explained In Depth Viacheslav Kovalevskyi @b0noi

    https://goo.gl/MaZFkE
  3. Why To Distribute?

  4. * https://github.com/apache/incubator-mxnet/tree/master/example/image-classification#distributed-training

  5. Multi Machine Vs Multi GPU Training

  6. x16

  7. * https://github.com/apache/incubator-mxnet/tree/master/example/image-classification#distributed-training

  8. * https://github.com/apache/incubator-mxnet/tree/master/example/image-classification#distributed-training 1 instance (16 GPUs)

  9. * https://github.com/apache/incubator-mxnet/tree/master/example/image-classification#distributed-training 1 instance (16 GPUs) Not achievable with 1

    instance
  10. x16 MXNet cluster

  11. * https://github.com/apache/incubator-mxnet/tree/master/example/image-classification#distributed-training 1 instance (16 GPUs) Not achievable with 1

    instance
  12. * https://github.com/apache/incubator-mxnet/tree/master/example/image-classification#distributed-training instances, 16 GPU each

  13. Training Example def f(x): # a = 5 # b

    = 2 return 5 * x + 2 # Data X = np.arange(100, step=0.001) Y = f(X) # Split data for training and evaluation X_train, X_test, Y_train, Y_test = train_test_split(X, Y)
  14. Actual training model.fit(train_iter, eval_iter, optimizer_params={ 'learning_rate':0.000000002}, num_epoch=20, eval_metric='mae', batch_end_callback =

    mx.callback.Speedometer(batch_size, 20), kvstore="device") * https://mxnet.incubator.apache.org/tutorials/python/linear-regression.html
  15. Lab #1 https://goo.gl/MaZFk E

  16. Let’s Distribute

  17. Main Components of a Cluster scheduler worker(s) server(s)

  18. How To Start a Component import os os.environ.update({ "DMLC_ROLE": "scheduler",

    "DMLC_PS_ROOT_URI": "127.0.0.1", "DMLC_PS_ROOT_PORT": "9000", "DMLC_NUM_SERVER": "1", "DMLC_NUM_WORKER": "2", "PS_VERBOSE": "0" }) import mxnet as mx
  19. How To Start a Component import os os.environ.update({ "DMLC_ROLE": "scheduler",

    # Could be "scheduler", "worker" or "server" "DMLC_PS_ROOT_URI": "127.0.0.1", "DMLC_PS_ROOT_PORT": "9000", "DMLC_NUM_SERVER": "1", "DMLC_NUM_WORKER": "2", "PS_VERBOSE": "0" }) import mxnet as mx
  20. How To Start a Component import os os.environ.update({ "DMLC_ROLE": "scheduler",

    # Could be "scheduler", "worker" or "server" "DMLC_PS_ROOT_URI": "127.0.0.1", # IP address of a scheduler "DMLC_PS_ROOT_PORT": "9000", "DMLC_NUM_SERVER": "1", "DMLC_NUM_WORKER": "2", "PS_VERBOSE": "0" }) import mxnet as mx
  21. How To Start a Component import os os.environ.update({ "DMLC_ROLE": "scheduler",

    # Could be "scheduler", "worker" or "server" "DMLC_PS_ROOT_URI": "127.0.0.1", # IP address of a scheduler "DMLC_PS_ROOT_PORT": "9000", # Port of a scheduler "DMLC_NUM_SERVER": "1", "DMLC_NUM_WORKER": "2", "PS_VERBOSE": "0" }) import mxnet as mx
  22. How To Start a Component import os os.environ.update({ "DMLC_ROLE": "scheduler",

    # Could be "scheduler", "worker" or "server" "DMLC_PS_ROOT_URI": "127.0.0.1", # IP address of a scheduler "DMLC_PS_ROOT_PORT": "9000", # Port of a scheduler "DMLC_NUM_SERVER": "1", # Number of servers in cluster "DMLC_NUM_WORKER": "2", "PS_VERBOSE": "0" }) import mxnet as mx
  23. How To Start a Component import os os.environ.update({ "DMLC_ROLE": "scheduler",

    # Could be "scheduler", "worker" or "server" "DMLC_PS_ROOT_URI": "127.0.0.1", # IP address of a scheduler "DMLC_PS_ROOT_PORT": "9000", # Port of a scheduler "DMLC_NUM_SERVER": "1", # Number of servers in cluster "DMLC_NUM_WORKER": "2", # Number of workers in cluster "PS_VERBOSE": "0" }) import mxnet as mx
  24. How To Start a Component import os os.environ.update({ "DMLC_ROLE": "scheduler",

    # Could be "scheduler", "worker" or "server" "DMLC_PS_ROOT_URI": "127.0.0.1", # IP address of a scheduler "DMLC_PS_ROOT_PORT": "9000", # Port of a scheduler "DMLC_NUM_SERVER": "1", # Number of servers in cluster "DMLC_NUM_WORKER": "2", # Number of workers in cluster "PS_VERBOSE": "0" # Could be 0, 1 or 2 }) import mxnet as mx
  25. Physical instance Our Test Cluster 1x scheduler 1x worker 1x

    server
  26. Lab #2 (example_cluster)

  27. python start_scheduler.py & python start_server.py & python start_worker.py Lets Bootstrap

    Our First Cluster
  28. python start_scheduler.py & python start_server.py & python start_worker.py Lets Bootstrap

    Our First Cluster
  29. 1x scheduler (1) 1x worker (?) 1x server (?) Meta:

    request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=server, ip=172.31.99.98, port=62 Hey scheduler, I’m server, I’m up, my rank is ? please add me to the cluster on server
  30. 1x scheduler (1) 1x worker (?) 1x server (?) Meta:

    request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=server, ip=172.31.99.98, port=62 Hey scheduler, I’m server, I’m up, my rank is ? please add me to the cluster Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=server, ip=172.31.99.98, port=62 I'm confirming that I got: “Hey scheduler, I’m server, I’m up, my rank is ? please add me to the cluster” on server on scheduler
  31. 1x scheduler (1) 1x worker (?) 1x server (?) Hey

    scheduler, I’m worker, I’m up, my rank is ? please add me to the cluster Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=worker, ip=172.31.99.98, port=6 on worker
  32. 1x scheduler (1) 1x worker (?) 1x server (?) Assigning

    rank 8 to the server src/van.cc:235: assign rank=8 to node role=server, ip=172.31.99.98, port=62263, is_recovery=0 on scheduler
  33. 1x scheduler (1) 1x worker (?) 1x server (?) Assigning

    rank 9 to the worker src/van.cc:235: assign rank=8 to node role=server, ip=172.31.99.98, port=62263, is_recovery=0 src/van.cc:235: assign rank=9 to node role=worker, ip=172.31.99.98, port=62427, is_recovery=0 on scheduler on scheduler
  34. 1x scheduler (1) 1x worker (?) 1x server (?) ={

    role=server, id=8, ip=172.31.99.98, port=62263, is_recovery=0 role=worker, id=9, ip=172.31.99.98, por Hey, worker, you are now part of the cluster with rank 9 on scheduler
  35. 1x scheduler (1) 1x worker (?) 1x server (?) ={

    role=server, id=8, ip=172.31.99.98, port=62263, is_recovery=0 role=worker, id=9, ip=172.31.99.98, por Hey, server, you are now part of the cluster with rank 8 ={ role=server, id=8, ip=172.31.99.98, port=62263, is_recovery=0 role=worker, id=9, ip=172.31.99.98, por on scheduler on scheduler
  36. 1x scheduler (1) 1x worker (?) 1x server (?) src/van.cc:251:

    the scheduler is connected to 1 workers and 1 servers on scheduler
  37. 1x scheduler (1) 1x worker (?) 1x server (8) node={

    role=server, id=8, ip=172.31.99.98, port=62263, is_recovery=0 role=worker, id=9, ip=172.31.99.9 src/van.cc:281: S[8] is connected to others Finally I’m connected and have rank 8 on server on server
  38. 1x scheduler (1) 1x worker (9) 1x server (8) Finally

    I’m connected and have rank 9 node={ role=server, id=8, ip=172.31.99.98, port=62572, is_recovery=0 role=worker, id=9, ip=172.31.99.9 src/van.cc:281: W[9] is connected to others on worker on worker
  39. 1x scheduler (1) 1x worker (9) 1x server (8) I

    have reached barrier on worker src/van.cc:136: ? => 1. Meta: request=1, timestamp=1, control={ cmd=BARRIER, barrier_group=7 } on server on scheduler I have reached barrier I have reached barrier
  40. 1x scheduler (1) 1x worker (9) 1x server (8) 3

    nodes have reached barrier, looks like all gang is here src/van.cc:161: 1 => 1. Meta: request=1, timestamp=2, control={ cmd=BARRIER, barrier_group=7 } src/van.cc:291: Barrier count for 7 : 1 src/van.cc:161: 8 => 1. Meta: request=1, timestamp=1, control={ cmd=BARRIER, barrier_group=7 } src/van.cc:291: Barrier count for 7 : 2 src/van.cc:161: 9 => 1. Meta: request=1, timestamp=1, control={ cmd=BARRIER, barrier_group=7 } src/van.cc:291: Barrier count for 7 : 3 on scheduler
  41. 1x scheduler (1) 1x worker (9) 1x server (8) Hey

    server and worker, you are free to go, barrier has been removed. on scheduler src/van.cc:136: ? => 9. Meta: request=0, timestamp=3, control={ cmd=BARRIER, barrier_group=0 } src/van.cc:136: ? => 8. Meta: request=0, timestamp=4, control={ cmd=BARRIER, barrier_group=0 }
  42. 1x scheduler (1) 1x worker (9) 1x server (8) I

    will wait you all in the next barrier on scheduler src/van.cc:136: ? => 1. Meta: request=1, timestamp=6, control={ cmd=BARRIER, barrier_group=7 } src/van.cc:161: 1 => 1. Meta: request=1, timestamp=6, control={ cmd=BARRIER, barrier_group=7 } src/van.cc:291: Barrier count for 7 : 1
  43. 1x scheduler (1) 1x worker (9) 1x server (8)

  44. Lab 3 (multi_worker_clust er)

  45. More Workers (multi_worker_clust er)

  46. 1x scheduler 1x server 2x workers

  47. Common Misconception

  48. https://stackoverflow.com/questions/46460492

  49. None
  50. None
  51. None
  52. Thank You!