Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache MXNet Distributed Training Explained In ...

Apache MXNet Distributed Training Explained In Depth by Viacheslav Kovalevskyi at Big Data Spain 2017

Distributed training is a complex process that does more harm than good if it not setup correctly.

https://www.bigdataspain.org/2017/talk/apache-mxnet-distributed-training-explained-in-depth

Big Data Spain 2017
November 16th - 17th Kinépolis Madrid

Big Data Spain

November 30, 2017
Tweet

More Decks by Big Data Spain

Other Decks in Technology

Transcript

  1. x16

  2. Training Example def f(x): # a = 5 # b

    = 2 return 5 * x + 2 # Data X = np.arange(100, step=0.001) Y = f(X) # Split data for training and evaluation X_train, X_test, Y_train, Y_test = train_test_split(X, Y)
  3. Actual training model.fit(train_iter, eval_iter, optimizer_params={ 'learning_rate':0.000000002}, num_epoch=20, eval_metric='mae', batch_end_callback =

    mx.callback.Speedometer(batch_size, 20), kvstore="device") * https://mxnet.incubator.apache.org/tutorials/python/linear-regression.html
  4. How To Start a Component import os os.environ.update({ "DMLC_ROLE": "scheduler",

    "DMLC_PS_ROOT_URI": "127.0.0.1", "DMLC_PS_ROOT_PORT": "9000", "DMLC_NUM_SERVER": "1", "DMLC_NUM_WORKER": "2", "PS_VERBOSE": "0" }) import mxnet as mx
  5. How To Start a Component import os os.environ.update({ "DMLC_ROLE": "scheduler",

    # Could be "scheduler", "worker" or "server" "DMLC_PS_ROOT_URI": "127.0.0.1", "DMLC_PS_ROOT_PORT": "9000", "DMLC_NUM_SERVER": "1", "DMLC_NUM_WORKER": "2", "PS_VERBOSE": "0" }) import mxnet as mx
  6. How To Start a Component import os os.environ.update({ "DMLC_ROLE": "scheduler",

    # Could be "scheduler", "worker" or "server" "DMLC_PS_ROOT_URI": "127.0.0.1", # IP address of a scheduler "DMLC_PS_ROOT_PORT": "9000", "DMLC_NUM_SERVER": "1", "DMLC_NUM_WORKER": "2", "PS_VERBOSE": "0" }) import mxnet as mx
  7. How To Start a Component import os os.environ.update({ "DMLC_ROLE": "scheduler",

    # Could be "scheduler", "worker" or "server" "DMLC_PS_ROOT_URI": "127.0.0.1", # IP address of a scheduler "DMLC_PS_ROOT_PORT": "9000", # Port of a scheduler "DMLC_NUM_SERVER": "1", "DMLC_NUM_WORKER": "2", "PS_VERBOSE": "0" }) import mxnet as mx
  8. How To Start a Component import os os.environ.update({ "DMLC_ROLE": "scheduler",

    # Could be "scheduler", "worker" or "server" "DMLC_PS_ROOT_URI": "127.0.0.1", # IP address of a scheduler "DMLC_PS_ROOT_PORT": "9000", # Port of a scheduler "DMLC_NUM_SERVER": "1", # Number of servers in cluster "DMLC_NUM_WORKER": "2", "PS_VERBOSE": "0" }) import mxnet as mx
  9. How To Start a Component import os os.environ.update({ "DMLC_ROLE": "scheduler",

    # Could be "scheduler", "worker" or "server" "DMLC_PS_ROOT_URI": "127.0.0.1", # IP address of a scheduler "DMLC_PS_ROOT_PORT": "9000", # Port of a scheduler "DMLC_NUM_SERVER": "1", # Number of servers in cluster "DMLC_NUM_WORKER": "2", # Number of workers in cluster "PS_VERBOSE": "0" }) import mxnet as mx
  10. How To Start a Component import os os.environ.update({ "DMLC_ROLE": "scheduler",

    # Could be "scheduler", "worker" or "server" "DMLC_PS_ROOT_URI": "127.0.0.1", # IP address of a scheduler "DMLC_PS_ROOT_PORT": "9000", # Port of a scheduler "DMLC_NUM_SERVER": "1", # Number of servers in cluster "DMLC_NUM_WORKER": "2", # Number of workers in cluster "PS_VERBOSE": "0" # Could be 0, 1 or 2 }) import mxnet as mx
  11. 1x scheduler (1) 1x worker (?) 1x server (?) Meta:

    request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=server, ip=172.31.99.98, port=62 Hey scheduler, I’m server, I’m up, my rank is ? please add me to the cluster on server
  12. 1x scheduler (1) 1x worker (?) 1x server (?) Meta:

    request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=server, ip=172.31.99.98, port=62 Hey scheduler, I’m server, I’m up, my rank is ? please add me to the cluster Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=server, ip=172.31.99.98, port=62 I'm confirming that I got: “Hey scheduler, I’m server, I’m up, my rank is ? please add me to the cluster” on server on scheduler
  13. 1x scheduler (1) 1x worker (?) 1x server (?) Hey

    scheduler, I’m worker, I’m up, my rank is ? please add me to the cluster Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=worker, ip=172.31.99.98, port=6 on worker
  14. 1x scheduler (1) 1x worker (?) 1x server (?) Assigning

    rank 8 to the server src/van.cc:235: assign rank=8 to node role=server, ip=172.31.99.98, port=62263, is_recovery=0 on scheduler
  15. 1x scheduler (1) 1x worker (?) 1x server (?) Assigning

    rank 9 to the worker src/van.cc:235: assign rank=8 to node role=server, ip=172.31.99.98, port=62263, is_recovery=0 src/van.cc:235: assign rank=9 to node role=worker, ip=172.31.99.98, port=62427, is_recovery=0 on scheduler on scheduler
  16. 1x scheduler (1) 1x worker (?) 1x server (?) ={

    role=server, id=8, ip=172.31.99.98, port=62263, is_recovery=0 role=worker, id=9, ip=172.31.99.98, por Hey, worker, you are now part of the cluster with rank 9 on scheduler
  17. 1x scheduler (1) 1x worker (?) 1x server (?) ={

    role=server, id=8, ip=172.31.99.98, port=62263, is_recovery=0 role=worker, id=9, ip=172.31.99.98, por Hey, server, you are now part of the cluster with rank 8 ={ role=server, id=8, ip=172.31.99.98, port=62263, is_recovery=0 role=worker, id=9, ip=172.31.99.98, por on scheduler on scheduler
  18. 1x scheduler (1) 1x worker (?) 1x server (?) src/van.cc:251:

    the scheduler is connected to 1 workers and 1 servers on scheduler
  19. 1x scheduler (1) 1x worker (?) 1x server (8) node={

    role=server, id=8, ip=172.31.99.98, port=62263, is_recovery=0 role=worker, id=9, ip=172.31.99.9 src/van.cc:281: S[8] is connected to others Finally I’m connected and have rank 8 on server on server
  20. 1x scheduler (1) 1x worker (9) 1x server (8) Finally

    I’m connected and have rank 9 node={ role=server, id=8, ip=172.31.99.98, port=62572, is_recovery=0 role=worker, id=9, ip=172.31.99.9 src/van.cc:281: W[9] is connected to others on worker on worker
  21. 1x scheduler (1) 1x worker (9) 1x server (8) I

    have reached barrier on worker src/van.cc:136: ? => 1. Meta: request=1, timestamp=1, control={ cmd=BARRIER, barrier_group=7 } on server on scheduler I have reached barrier I have reached barrier
  22. 1x scheduler (1) 1x worker (9) 1x server (8) 3

    nodes have reached barrier, looks like all gang is here src/van.cc:161: 1 => 1. Meta: request=1, timestamp=2, control={ cmd=BARRIER, barrier_group=7 } src/van.cc:291: Barrier count for 7 : 1 src/van.cc:161: 8 => 1. Meta: request=1, timestamp=1, control={ cmd=BARRIER, barrier_group=7 } src/van.cc:291: Barrier count for 7 : 2 src/van.cc:161: 9 => 1. Meta: request=1, timestamp=1, control={ cmd=BARRIER, barrier_group=7 } src/van.cc:291: Barrier count for 7 : 3 on scheduler
  23. 1x scheduler (1) 1x worker (9) 1x server (8) Hey

    server and worker, you are free to go, barrier has been removed. on scheduler src/van.cc:136: ? => 9. Meta: request=0, timestamp=3, control={ cmd=BARRIER, barrier_group=0 } src/van.cc:136: ? => 8. Meta: request=0, timestamp=4, control={ cmd=BARRIER, barrier_group=0 }
  24. 1x scheduler (1) 1x worker (9) 1x server (8) I

    will wait you all in the next barrier on scheduler src/van.cc:136: ? => 1. Meta: request=1, timestamp=6, control={ cmd=BARRIER, barrier_group=7 } src/van.cc:161: 1 => 1. Meta: request=1, timestamp=6, control={ cmd=BARRIER, barrier_group=7 } src/van.cc:291: Barrier count for 7 : 1