Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache MXNet Distributed Training Explained In Depth by Viacheslav Kovalevskyi at Big Data Spain 2017

Apache MXNet Distributed Training Explained In Depth by Viacheslav Kovalevskyi at Big Data Spain 2017

Distributed training is a complex process that does more harm than good if it not setup correctly.

https://www.bigdataspain.org/2017/talk/apache-mxnet-distributed-training-explained-in-depth

Big Data Spain 2017
November 16th - 17th Kinépolis Madrid

Big Data Spain

November 30, 2017
Tweet

More Decks by Big Data Spain

Other Decks in Technology

Transcript

  1. x16

  2. Training Example def f(x): # a = 5 # b

    = 2 return 5 * x + 2 # Data X = np.arange(100, step=0.001) Y = f(X) # Split data for training and evaluation X_train, X_test, Y_train, Y_test = train_test_split(X, Y)
  3. Actual training model.fit(train_iter, eval_iter, optimizer_params={ 'learning_rate':0.000000002}, num_epoch=20, eval_metric='mae', batch_end_callback =

    mx.callback.Speedometer(batch_size, 20), kvstore="device") * https://mxnet.incubator.apache.org/tutorials/python/linear-regression.html
  4. How To Start a Component import os os.environ.update({ "DMLC_ROLE": "scheduler",

    "DMLC_PS_ROOT_URI": "127.0.0.1", "DMLC_PS_ROOT_PORT": "9000", "DMLC_NUM_SERVER": "1", "DMLC_NUM_WORKER": "2", "PS_VERBOSE": "0" }) import mxnet as mx
  5. How To Start a Component import os os.environ.update({ "DMLC_ROLE": "scheduler",

    # Could be "scheduler", "worker" or "server" "DMLC_PS_ROOT_URI": "127.0.0.1", "DMLC_PS_ROOT_PORT": "9000", "DMLC_NUM_SERVER": "1", "DMLC_NUM_WORKER": "2", "PS_VERBOSE": "0" }) import mxnet as mx
  6. How To Start a Component import os os.environ.update({ "DMLC_ROLE": "scheduler",

    # Could be "scheduler", "worker" or "server" "DMLC_PS_ROOT_URI": "127.0.0.1", # IP address of a scheduler "DMLC_PS_ROOT_PORT": "9000", "DMLC_NUM_SERVER": "1", "DMLC_NUM_WORKER": "2", "PS_VERBOSE": "0" }) import mxnet as mx
  7. How To Start a Component import os os.environ.update({ "DMLC_ROLE": "scheduler",

    # Could be "scheduler", "worker" or "server" "DMLC_PS_ROOT_URI": "127.0.0.1", # IP address of a scheduler "DMLC_PS_ROOT_PORT": "9000", # Port of a scheduler "DMLC_NUM_SERVER": "1", "DMLC_NUM_WORKER": "2", "PS_VERBOSE": "0" }) import mxnet as mx
  8. How To Start a Component import os os.environ.update({ "DMLC_ROLE": "scheduler",

    # Could be "scheduler", "worker" or "server" "DMLC_PS_ROOT_URI": "127.0.0.1", # IP address of a scheduler "DMLC_PS_ROOT_PORT": "9000", # Port of a scheduler "DMLC_NUM_SERVER": "1", # Number of servers in cluster "DMLC_NUM_WORKER": "2", "PS_VERBOSE": "0" }) import mxnet as mx
  9. How To Start a Component import os os.environ.update({ "DMLC_ROLE": "scheduler",

    # Could be "scheduler", "worker" or "server" "DMLC_PS_ROOT_URI": "127.0.0.1", # IP address of a scheduler "DMLC_PS_ROOT_PORT": "9000", # Port of a scheduler "DMLC_NUM_SERVER": "1", # Number of servers in cluster "DMLC_NUM_WORKER": "2", # Number of workers in cluster "PS_VERBOSE": "0" }) import mxnet as mx
  10. How To Start a Component import os os.environ.update({ "DMLC_ROLE": "scheduler",

    # Could be "scheduler", "worker" or "server" "DMLC_PS_ROOT_URI": "127.0.0.1", # IP address of a scheduler "DMLC_PS_ROOT_PORT": "9000", # Port of a scheduler "DMLC_NUM_SERVER": "1", # Number of servers in cluster "DMLC_NUM_WORKER": "2", # Number of workers in cluster "PS_VERBOSE": "0" # Could be 0, 1 or 2 }) import mxnet as mx
  11. 1x scheduler (1) 1x worker (?) 1x server (?) Meta:

    request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=server, ip=172.31.99.98, port=62 Hey scheduler, I’m server, I’m up, my rank is ? please add me to the cluster on server
  12. 1x scheduler (1) 1x worker (?) 1x server (?) Meta:

    request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=server, ip=172.31.99.98, port=62 Hey scheduler, I’m server, I’m up, my rank is ? please add me to the cluster Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=server, ip=172.31.99.98, port=62 I'm confirming that I got: “Hey scheduler, I’m server, I’m up, my rank is ? please add me to the cluster” on server on scheduler
  13. 1x scheduler (1) 1x worker (?) 1x server (?) Hey

    scheduler, I’m worker, I’m up, my rank is ? please add me to the cluster Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=worker, ip=172.31.99.98, port=6 on worker
  14. 1x scheduler (1) 1x worker (?) 1x server (?) Assigning

    rank 8 to the server src/van.cc:235: assign rank=8 to node role=server, ip=172.31.99.98, port=62263, is_recovery=0 on scheduler
  15. 1x scheduler (1) 1x worker (?) 1x server (?) Assigning

    rank 9 to the worker src/van.cc:235: assign rank=8 to node role=server, ip=172.31.99.98, port=62263, is_recovery=0 src/van.cc:235: assign rank=9 to node role=worker, ip=172.31.99.98, port=62427, is_recovery=0 on scheduler on scheduler
  16. 1x scheduler (1) 1x worker (?) 1x server (?) ={

    role=server, id=8, ip=172.31.99.98, port=62263, is_recovery=0 role=worker, id=9, ip=172.31.99.98, por Hey, worker, you are now part of the cluster with rank 9 on scheduler
  17. 1x scheduler (1) 1x worker (?) 1x server (?) ={

    role=server, id=8, ip=172.31.99.98, port=62263, is_recovery=0 role=worker, id=9, ip=172.31.99.98, por Hey, server, you are now part of the cluster with rank 8 ={ role=server, id=8, ip=172.31.99.98, port=62263, is_recovery=0 role=worker, id=9, ip=172.31.99.98, por on scheduler on scheduler
  18. 1x scheduler (1) 1x worker (?) 1x server (?) src/van.cc:251:

    the scheduler is connected to 1 workers and 1 servers on scheduler
  19. 1x scheduler (1) 1x worker (?) 1x server (8) node={

    role=server, id=8, ip=172.31.99.98, port=62263, is_recovery=0 role=worker, id=9, ip=172.31.99.9 src/van.cc:281: S[8] is connected to others Finally I’m connected and have rank 8 on server on server
  20. 1x scheduler (1) 1x worker (9) 1x server (8) Finally

    I’m connected and have rank 9 node={ role=server, id=8, ip=172.31.99.98, port=62572, is_recovery=0 role=worker, id=9, ip=172.31.99.9 src/van.cc:281: W[9] is connected to others on worker on worker
  21. 1x scheduler (1) 1x worker (9) 1x server (8) I

    have reached barrier on worker src/van.cc:136: ? => 1. Meta: request=1, timestamp=1, control={ cmd=BARRIER, barrier_group=7 } on server on scheduler I have reached barrier I have reached barrier
  22. 1x scheduler (1) 1x worker (9) 1x server (8) 3

    nodes have reached barrier, looks like all gang is here src/van.cc:161: 1 => 1. Meta: request=1, timestamp=2, control={ cmd=BARRIER, barrier_group=7 } src/van.cc:291: Barrier count for 7 : 1 src/van.cc:161: 8 => 1. Meta: request=1, timestamp=1, control={ cmd=BARRIER, barrier_group=7 } src/van.cc:291: Barrier count for 7 : 2 src/van.cc:161: 9 => 1. Meta: request=1, timestamp=1, control={ cmd=BARRIER, barrier_group=7 } src/van.cc:291: Barrier count for 7 : 3 on scheduler
  23. 1x scheduler (1) 1x worker (9) 1x server (8) Hey

    server and worker, you are free to go, barrier has been removed. on scheduler src/van.cc:136: ? => 9. Meta: request=0, timestamp=3, control={ cmd=BARRIER, barrier_group=0 } src/van.cc:136: ? => 8. Meta: request=0, timestamp=4, control={ cmd=BARRIER, barrier_group=0 }
  24. 1x scheduler (1) 1x worker (9) 1x server (8) I

    will wait you all in the next barrier on scheduler src/van.cc:136: ? => 1. Meta: request=1, timestamp=6, control={ cmd=BARRIER, barrier_group=7 } src/van.cc:161: 1 => 1. Meta: request=1, timestamp=6, control={ cmd=BARRIER, barrier_group=7 } src/van.cc:291: Barrier count for 7 : 1