Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Distributed Machine Learning - Challenges & Opportunities

Distributed Machine Learning - Challenges & Opportunities

Talk presented at Fifth Elephant 2017.


Anand Chitipothu

July 27, 2017


  1. Distributed Machine Learning Challenges & Opportuni2es Anand Chi)pothu @anandology 1

  2. Who is Speaking? Anand Chi)pothu @anandology • Building a data

    science pla0orm at @rorodata • Advanced programming courses at @pipalacademy • Worked at Strand Life Sciences and Internet Archive 2
  3. Mo#va#on • Training ML models o1en takes long 4me •

    Distributed approach is very scalable and effec4ve • The exis4ng tools for distributed training are not simple to use 3
  4. Simple Interfaces Programming: Python print("Hello world!") Machine Learning: Scikit-learn model.fit(...)

    model.predict(...) 4
  5. Simple Interfaces Deploying Func.ons: Firefly $ firefly credit_risk.predict Distributed

    Machine Learning: ??? 5
  6. Machine Learning - Tradi/onal Workflow Typical workflow for building an

    ML model: • Data Prepara*on • Feature Extrac*on • Model Training • Hyper parameter op+miza+on / Grid Search 6
  7. While that is going on... h"ps:/ /www.xkcd.com/303/ 7

  8. Opportuni)es The grid search is one of the most 1me

    consuming parts and the has the poten1al to be parallelized. 8
  9. Paralleliza(on Pa,erns • Data parallelism • Task parallelism 9

  10. Data Parallelism 10

  11. Data Parallelism - Examples • GPU computa-on • Open MP,

    MPI • Spark ML algorithms • Map-Reduce 11
  12. Task Parallelism 12

  13. Task Parallelism - Examples • Grid Search • Task queues

    like Celery 13
  14. How to Parallelize Grid Search? The scikit-learn library of Python

    has an out-of the box solu6on to parallelize this. grid_search = GridSearchCV(model, parameters, n_jobs=4) But limited to the one computer! Can we run this on mul5ple computers? 14
  15. Distributed Machine Learning Run ML algorithms on a cluster. Advantages:

    • Horizontal Scaling 15
  16. Available Solu,ons for Distributed ML • Apache Spark • Dask

  17. Challenges • Requires se*ng up a managing cluster of computers

    • Non-trivial task for a data scien:st to manage • How to start on demand and shutdown when unused Is it possible to have a simple interface that a data scien4st can manage on his/her own? 17
  18. Our Experiments 18

  19. Compute Pla,orm We've built a compute pla0orm for running jobs

    in the cloud. $ run-job python model_training.py created new job 9845a3bd4. 19
  20. Behind the Scenes • Picks an available instance in the

    cloud (or starts a new one) • Runs a docker container with appropriate image • Exposes the required ports, setup a URL endpoint to access it • Manages shared disk across all the jobs 20
  21. The Magic Running on a 16 core instance is just

    a flag away. $ run-job -i C16 python model_training.py created new job 8f40f02f. 21
  22. 22

  23. Distributed Machine Learning We've implemented multiprocessing.Pool like interface to run

    on the top of our compute pla5orm. pool = DistributedPool(n=5) results = pool.map(square, range(100)) pool.close() Starts 5 distributed jobs to share the work. 23
  24. Scikit-learn Integra/on Extended the distributed interface to support scikit-learn. from

    distributed_scikit import GridSearchCV grid_search = GridSearchCV( GradientBoostingRegressor(), parameters, n_jobs=16) A distributed pool with n_jobs will be created to distribute the tasks. 24
  25. Advantages • Simplicity • No manual setup required • Works

    from the familiar notebook interface • Op;on to run on spot instances (without any addi;onal setup) 25
  26. Future Work • Distribu)ng training of ensemble models • Distributed

    deep learning 26
  27. Summary • With ever increasing datasets, distributed training will be

    more effec8ve than single node approaches • Abstrac8ng away complexity of distributed learning can improve 8me-to-market 27
  28. Ques%ons? 28