Distributed Machine Learning - Challenges & Opportunities

Distributed Machine Learning Challenges & Opportuni2es Anand Chi)pothu @anandology 1

Who is Speaking? Anand Chi)pothu @anandology • Building a data
science pla0orm at @rorodata • Advanced programming courses at @pipalacademy • Worked at Strand Life Sciences and Internet Archive 2

Mo#va#on • Training ML models o1en takes long 4me •
Distributed approach is very scalable and eﬀec4ve • The exis4ng tools for distributed training are not simple to use 3

Simple Interfaces Programming: Python print("Hello world!") Machine Learning: Scikit-learn model.fit(...)
model.predict(...) 4

Simple Interfaces Deploying Func.ons: Fireﬂy $ firefly credit_risk.predict http://127.0.0.1:8080/ Distributed
Machine Learning: ??? 5

Machine Learning - Tradi/onal Workﬂow Typical workﬂow for building an
ML model: • Data Prepara*on • Feature Extrac*on • Model Training • Hyper parameter op+miza+on / Grid Search 6

While that is going on... h"ps:/ /www.xkcd.com/303/ 7

Opportuni)es The grid search is one of the most 1me
consuming parts and the has the poten1al to be parallelized. 8

Paralleliza(on Pa,erns • Data parallelism • Task parallelism 9

Data Parallelism 10

Data Parallelism - Examples • GPU computa-on • Open MP,
MPI • Spark ML algorithms • Map-Reduce 11

Task Parallelism 12

Task Parallelism - Examples • Grid Search • Task queues
like Celery 13

How to Parallelize Grid Search? The scikit-learn library of Python
has an out-of the box solu6on to parallelize this. grid_search = GridSearchCV(model, parameters, n_jobs=4) But limited to the one computer! Can we run this on mul5ple computers? 14

Distributed Machine Learning Run ML algorithms on a cluster. Advantages:
• Horizontal Scaling 15

Available Solu,ons for Distributed ML • Apache Spark • Dask
16

Challenges • Requires se*ng up a managing cluster of computers
• Non-trivial task for a data scien:st to manage • How to start on demand and shutdown when unused Is it possible to have a simple interface that a data scien4st can manage on his/her own? 17

Our Experiments 18

Compute Pla,orm We've built a compute pla0orm for running jobs
in the cloud. $ run-job python model_training.py created new job 9845a3bd4. 19

Behind the Scenes • Picks an available instance in the
cloud (or starts a new one) • Runs a docker container with appropriate image • Exposes the required ports, setup a URL endpoint to access it • Manages shared disk across all the jobs 20

The Magic Running on a 16 core instance is just
a ﬂag away. $ run-job -i C16 python model_training.py created new job 8f40f02f. 21

Distributed Machine Learning We've implemented multiprocessing.Pool like interface to run
on the top of our compute pla5orm. pool = DistributedPool(n=5) results = pool.map(square, range(100)) pool.close() Starts 5 distributed jobs to share the work. 23

Scikit-learn Integra/on Extended the distributed interface to support scikit-learn. from
distributed_scikit import GridSearchCV grid_search = GridSearchCV( GradientBoostingRegressor(), parameters, n_jobs=16) A distributed pool with n_jobs will be created to distribute the tasks. 24

Advantages • Simplicity • No manual setup required • Works
from the familiar notebook interface • Op;on to run on spot instances (without any addi;onal setup) 25

Future Work • Distribu)ng training of ensemble models • Distributed
deep learning 26

Summary • With ever increasing datasets, distributed training will be
more eﬀec8ve than single node approaches • Abstrac8ng away complexity of distributed learning can improve 8me-to-market 27

Ques%ons? 28

Distributed Machine Learning - Challenges & Opp...

Distributed Machine Learning - Challenges & Opportunities

Anand Chitipothu

More Decks by Anand Chitipothu

Other Decks in Technology

Featured

Transcript

Distributed Machine Learning Challenges & Opportuni2es Anand Chi)pothu @anandology 1

Who is Speaking? Anand Chi)pothu @anandology • Building a data

Mo#va#on • Training ML models o1en takes long 4me •

Simple Interfaces Programming: Python print("Hello world!") Machine Learning: Scikit-learn model.fit(...)

Simple Interfaces Deploying Func.ons: Fireﬂy $ firefly credit_risk.predict http://127.0.0.1:8080/ Distributed

Machine Learning - Tradi/onal Workﬂow Typical workﬂow for building an

While that is going on... h"ps:/ /www.xkcd.com/303/ 7

Opportuni)es The grid search is one of the most 1me

Paralleliza(on Pa,erns • Data parallelism • Task parallelism 9

Data Parallelism 10

Data Parallelism - Examples • GPU computa-on • Open MP,

Task Parallelism 12

Task Parallelism - Examples • Grid Search • Task queues

How to Parallelize Grid Search? The scikit-learn library of Python

Distributed Machine Learning Run ML algorithms on a cluster. Advantages:

Available Solu,ons for Distributed ML • Apache Spark • Dask

Challenges • Requires se*ng up a managing cluster of computers

Our Experiments 18

Compute Pla,orm We've built a compute pla0orm for running jobs

Behind the Scenes • Picks an available instance in the

The Magic Running on a 16 core instance is just

22

Distributed Machine Learning We've implemented multiprocessing.Pool like interface to run

Scikit-learn Integra/on Extended the distributed interface to support scikit-learn. from

Advantages • Simplicity • No manual setup required • Works

Future Work • Distribu)ng training of ensemble models • Distributed

Summary • With ever increasing datasets, distributed training will be

Ques%ons? 28