Slide 1

Slide 1 text

Distributed Machine Learning Challenges & Opportuni2es Anand Chi)pothu @anandology 1

Slide 2

Slide 2 text

Who is Speaking? Anand Chi)pothu @anandology • Building a data science pla0orm at @rorodata • Advanced programming courses at @pipalacademy • Worked at Strand Life Sciences and Internet Archive 2

Slide 3

Slide 3 text

Mo#va#on • Training ML models o1en takes long 4me • Distributed approach is very scalable and effec4ve • The exis4ng tools for distributed training are not simple to use 3

Slide 4

Slide 4 text

Simple Interfaces Programming: Python print("Hello world!") Machine Learning: Scikit-learn model.fit(...) model.predict(...) 4

Slide 5

Slide 5 text

Simple Interfaces Deploying Func.ons: Firefly $ firefly credit_risk.predict http://127.0.0.1:8080/ Distributed Machine Learning: ??? 5

Slide 6

Slide 6 text

Machine Learning - Tradi/onal Workflow Typical workflow for building an ML model: • Data Prepara*on • Feature Extrac*on • Model Training • Hyper parameter op+miza+on / Grid Search 6

Slide 7

Slide 7 text

While that is going on... h"ps:/ /www.xkcd.com/303/ 7

Slide 8

Slide 8 text

Opportuni)es The grid search is one of the most 1me consuming parts and the has the poten1al to be parallelized. 8

Slide 9

Slide 9 text

Paralleliza(on Pa,erns • Data parallelism • Task parallelism 9

Slide 10

Slide 10 text

Data Parallelism 10

Slide 11

Slide 11 text

Data Parallelism - Examples • GPU computa-on • Open MP, MPI • Spark ML algorithms • Map-Reduce 11

Slide 12

Slide 12 text

Task Parallelism 12

Slide 13

Slide 13 text

Task Parallelism - Examples • Grid Search • Task queues like Celery 13

Slide 14

Slide 14 text

How to Parallelize Grid Search? The scikit-learn library of Python has an out-of the box solu6on to parallelize this. grid_search = GridSearchCV(model, parameters, n_jobs=4) But limited to the one computer! Can we run this on mul5ple computers? 14

Slide 15

Slide 15 text

Distributed Machine Learning Run ML algorithms on a cluster. Advantages: • Horizontal Scaling 15

Slide 16

Slide 16 text

Available Solu,ons for Distributed ML • Apache Spark • Dask 16

Slide 17

Slide 17 text

Challenges • Requires se*ng up a managing cluster of computers • Non-trivial task for a data scien:st to manage • How to start on demand and shutdown when unused Is it possible to have a simple interface that a data scien4st can manage on his/her own? 17

Slide 18

Slide 18 text

Our Experiments 18

Slide 19

Slide 19 text

Compute Pla,orm We've built a compute pla0orm for running jobs in the cloud. $ run-job python model_training.py created new job 9845a3bd4. 19

Slide 20

Slide 20 text

Behind the Scenes • Picks an available instance in the cloud (or starts a new one) • Runs a docker container with appropriate image • Exposes the required ports, setup a URL endpoint to access it • Manages shared disk across all the jobs 20

Slide 21

Slide 21 text

The Magic Running on a 16 core instance is just a flag away. $ run-job -i C16 python model_training.py created new job 8f40f02f. 21

Slide 22

Slide 22 text

22

Slide 23

Slide 23 text

Distributed Machine Learning We've implemented multiprocessing.Pool like interface to run on the top of our compute pla5orm. pool = DistributedPool(n=5) results = pool.map(square, range(100)) pool.close() Starts 5 distributed jobs to share the work. 23

Slide 24

Slide 24 text

Scikit-learn Integra/on Extended the distributed interface to support scikit-learn. from distributed_scikit import GridSearchCV grid_search = GridSearchCV( GradientBoostingRegressor(), parameters, n_jobs=16) A distributed pool with n_jobs will be created to distribute the tasks. 24

Slide 25

Slide 25 text

Advantages • Simplicity • No manual setup required • Works from the familiar notebook interface • Op;on to run on spot instances (without any addi;onal setup) 25

Slide 26

Slide 26 text

Future Work • Distribu)ng training of ensemble models • Distributed deep learning 26

Slide 27

Slide 27 text

Summary • With ever increasing datasets, distributed training will be more effec8ve than single node approaches • Abstrac8ng away complexity of distributed learning can improve 8me-to-market 27

Slide 28

Slide 28 text

Ques%ons? 28