Write what you mean: Scaling up machine learning algorithms directly from source code

Write What You Mean Thomas Peters, engineer Scaling up machine
learning algorithms directly from source code

Q: Why should I have to rewrite my program as
my dataset gets larger?

def sq_distance(p1, p2): return sum((c[0]-c[1])**2 for c in zip(p1, p2))
def index_of_nearest(q, points): return min((sq_distance(q, p), i) for i, p in enumerate(points))[1] def nearest_center(points, centers): return [index_of_nearest(p, centers) for p in points] Example: Nearest Neighbor

Unfortunately, this is not fast.

A: You shouldn’t have to! Q: Why should I have
to rewrite my program as my dataset gets larger?

Pyfora Automatically scalable Python for large-scale machine learning and data
science http://github.com/ufora/ufora http://docs.pyfora.com/

Goals of Pyfora • Provide identical semantics to regular Python
• Easily use hundreds of CPUs / GPUs and TBs of RAM • Scale by analyzing source code, not by calling libraries No more complex frameworks or

Approaches to Scaling

Approaches to Scaling APIs and Frameworks • Library of functions
for specific patterns of parallelism • Programmer (re)writes program to fit the pattern.

Approaches to Scaling APIs and Frameworks • Library of functions
for specific patterns of parallelism • Programmer (re)writes program to fit the pattern. Programming Language • Semantics of calculation entirely defined by source- code • Compiler and Runtime are responsible for efficient execution.

Approaches to Scaling APIs and Frameworks • MPI • Hadoop
• Spark Programming Languages • SQL • CUDA • CILK • Python with Pyfora

API Language Pros • More control over performance • Easy
to integrate lots of different systems. • Simpler code • Much more expressive • Programs are easier to understand. • Cleaner failure modes • Much deeper optimizations are possible. Cons • More code • Program meaning obscured by implementation details • Hard to debug when something goes wrong • Very hard to implement

With a strong implementation, “language approach” should win • Any
pattern that can be implemented in an API can be recognized in a language. • Language-based systems have the entire source code, so they have more to work with than API based systems. • Can measure behavior at runtime and use this to optimize.

Example: Nearest Neighbors def sq_distance(p1,p2): return sum((c[0]-c[1])**2 for c in
zip(p1, p2)) def index_of_nearest(q, points): return min((sq_distance(q, p), i) for i, p in enumerate(points))[1] def nearest_center(points, centers): return [index_of_nearest(p, centers) for p in points]

How can we make this fast? • JIT compile to
make single-threaded code fast • Parallelize to use multiple CPUs • Distribute data to use multiple machines

Why is this tricky? Optimal behavior depends on the sizes
and shapes of data. Centers Points If both sets are small, don’t bother to distribute.

Why is this tricky? Centers Points If “points” is tall
and thin, it’s natural to split it across many machines and replicate “centers”

Why is this tricky? Centers Points If “points” and “centers”
are really wide (say, they’re images), it would be better to split them vertically, compute distances between all pairs in slices, and merge them.

Why is this tricky? You will end up writing totally
different code for each of these different situations. The source code contains the necessary structure. The key is to defer decisions to runtime, when the system can actually see how big the datasets are.

Getting it right is valuable • Much less work for
the programmer • Code is more readable • Code becomes more reusable. • Use the language the way it was intended: For instance, in Python, the “row” objects can be anything that looks like a list.

What are some other common implementation problems we can solve
this way?

Problem: Wrong-sized chunking • API-based frameworks require you to explicitly
partition your data into chunks. • If you are running a complex task, the runtime may be really long for a small subset of chunks. You’ll end up waiting a long time for that last mapper. • If your tasks allocate memory, you can run out of RAM and crash.

Solution: Dynamically rebalance CORE #1 CORE #2 CORE #3 CORE
#4 Splitting Adaptive Parallelism sum(f(x) for x in v)

Solution: Dynamically rebalance • This requires you to be able
to interrupt running tasks as they’re executing. • Adding support for this to an API makes it much more complicated to use. • This is much easier to do with compiler support.

Problem: Nested parallelism Example: • You have an iterative model
• There is lots of parallelism in each iteration • But you also want to search over many hyperparameters With API-based approaches, you have to manage this yourself, either by constructing a graph of subtasks, or figuring out how to flatten your workload into something that can be map-reduced.

sources of parallelism def fit_model(learning_rate, model, params): while not model.finished(params):
params = model.update_params(learning_rate, params) return params fits = [[fit_model(rate, model, params) for rate in learning_rates] for model in models] Solution: infer parallelism from source

So how does Pyfora work? • Operate on a subset
of Python that restricts mutability (but we're relaxing this). • Built a JIT compiler that can “pop” code back into the interpreter • Can move sets of stack frames from one machine to another • Can rewrite selected stack frames to use futures if there is parallelism to exploit. • Carefully track what data a thread is using. • Dynamically schedule threads and data on machines to optimize for cache locality.

import pyfora executor = pyfora.connect("http://...") data = executor.importS3Dataset("myBucket", "myData.csv") def
calibrate(dataframe, params): #some complex model with loops and parallelism with executor.remotely: dataframe = pyfora.read_csv(data) models = [calibrate(dataframe, p) for p in params] print(models.toLocal().result())

What are we working on? • Relaxing immutability assumptions. •
Compiler optimizations (immutable Python is a rich source of these) • Automatic compilation and scheduling of data and compute on GPU

Thanks! • Check out the repo: github.com/ufora/ufora • Read the
docs: docs.pyfora.com • Subscribe to “This Week in Data” (see top of ufora.com) • Email me: [email protected]

Write what you mean: Scaling up machine learnin...

Write what you mean: Scaling up machine learning algorithms directly from source code

Thomas Peters

More Decks by Thomas Peters

Other Decks in Technology

Featured

Transcript

Write What You Mean Thomas Peters, engineer Scaling up machine

Q: Why should I have to rewrite my program as

def sq_distance(p1, p2): return sum((c[0]-c[1])**2 for c in zip(p1, p2))

Unfortunately, this is not fast.

A: You shouldn’t have to! Q: Why should I have

Pyfora Automatically scalable Python for large-scale machine learning and data

Goals of Pyfora • Provide identical semantics to regular Python

Approaches to Scaling

Approaches to Scaling APIs and Frameworks • Library of functions

Approaches to Scaling APIs and Frameworks • Library of functions

Approaches to Scaling APIs and Frameworks • MPI • Hadoop

API Language Pros • More control over performance • Easy

With a strong implementation, “language approach” should win • Any

Example: Nearest Neighbors def sq_distance(p1,p2): return sum((c[0]-c[1])**2 for c in

How can we make this fast? • JIT compile to

Why is this tricky? Optimal behavior depends on the sizes

Why is this tricky? Centers Points If “points” is tall

Why is this tricky? Centers Points If “points” and “centers”

Why is this tricky? You will end up writing totally

Getting it right is valuable • Much less work for

What are some other common implementation problems we can solve

Problem: Wrong-sized chunking • API-based frameworks require you to explicitly

Solution: Dynamically rebalance CORE #1 CORE #2 CORE #3 CORE

Solution: Dynamically rebalance • This requires you to be able

Problem: Nested parallelism Example: • You have an iterative model

sources of parallelism def fit_model(learning_rate, model, params): while not model.finished(params):

So how does Pyfora work? • Operate on a subset

import pyfora executor = pyfora.connect("http://...") data = executor.importS3Dataset("myBucket", "myData.csv") def

What are we working on? • Relaxing immutability assumptions. •

Thanks! • Check out the repo: github.com/ufora/ufora • Read the