Dask - Parallelizing NumPy/Pandas Through Task Scheduling

Dask - Parallelizing NumPy/Pandas Through Task Scheduling Jim Crist Continuum
Analytics // github.com/jcrist // @jiminy_crist PyData NYC 2015

A Motivating Example

Ocean Temperature Data • Daily mean ocean temperature every 1/4
degree • 720 x 1440 array every day • http://www.esrl.noaa.gov/psd/data/gridded/ data.noaa.oisst.v2.highres.html

One year’s worth from netCDF4 import Dataset import matplotlib.pyplot as
plt from numpy import flipud data = Dataset("sst.day.mean.2015.v2.nc").variables["sst"] year_mean = data[:].mean(axis=0) plt.imshow(flipud(year_mean), cmap="RdBu_r") plt.title("Average Global Ocean Temperature, 2015")

36 year’s worth $ ls sst.day.mean.1981.v2.nc sst.day.mean.1993.v2.nc sst.day.mean.2005.v2.nc sst.day.mean.1982.v2.nc sst.day.mean.1994.v2.nc
sst.day.mean.2006.v2.nc sst.day.mean.1983.v2.nc sst.day.mean.1995.v2.nc sst.day.mean.2007.v2.nc sst.day.mean.1984.v2.nc sst.day.mean.1996.v2.nc sst.day.mean.2008.v2.nc sst.day.mean.1985.v2.nc sst.day.mean.1997.v2.nc sst.day.mean.2009.v2.nc ... ... ... $ du -h 15G . 720 x 1440 x 12341 x 4 = 51 GB uncompressed!

Can’t just load this all into Numpy… what now?

Solution: blocked algorithms!

Blocked Algorithms Blocked mean x = h5py.File('myfile.hdf5')['x'] # Trillion element
array on disk sums = [] counts = [] for i in range(1000000): # One million times chunk = x[1000000*i: 1000000*(i+1)] # Pull out chunk sums.append(np.sum(chunk)) # Sum chunk counts.append(len(chunk)) # Count chunk result = sum(sums) / sum(counts) # Aggregate results

Blocked Algorithms

Blocked algorithms allow for • parallelism • lower ram usage
The trick is figuring out how to break the computation into blocks. This is where dask comes in.

Dask is: • A parallel computing framework • That leverages
the excellent python ecosystem • Using blocked algorithms and task scheduling • Written in pure python

dask.array • Out-of-core, parallel, n-dimensional array library • Copies the
numpy interface • Arithmetic: +, *, … • Reductions: mean, max, … • Slicing: x[10:, 100:50:-2] • Fancy indexing: x[:, [3, 1, 2]] • Some linear algebra: tensordot, qr, svd, …

Task Schedulers

But what about the GIL???

Most Python Programs…

x = np.arange(9000000).reshape((3000, 3000)) a = da.from_array(x, chunks=(750, 750)) a.dot(a.T).sum(axis=0).compute()

Out-of-core arrays import dask.array as da from netCDF4 import Dataset
from glob import glob from numpy import flipud import matplotlib.pyplot as plt files = sorted(glob('*.nc')) data = [Dataset(f).variables['sst'] for f in files] arrs = [da.from_array(x, chunks=(24, 360, 360)) for x in data] x = da.concatenate(arrs, axis=0) full_mean = x.mean(axis=0) plt.imshow(np.flipud(full_mean), cmap='RdBu_r') plt.title('Average Global Ocean Temperature, 1981-2015')

Out-of-core arrays

Parallel, Out-of-core SVD

Parallel, Out-of-core SVD Randomized

dask.dataframe • Out-of-core, blocked parallel DataFrame • Mirrors pandas interface

dask.dataframe • Elementwise operations: df.x + df.y • Row-wise selections:
df[df.x > 0] • Aggregations: df.x.max() • groupby-aggregate: df.groupby(df.x).y.max() • Value counts: df.x.value_counts() • Resampling: df.x.resample('d', how=‘mean') • Expanding window: df.x.cumsum() • Joins: dd.merge(df1, df2, left_index=True, right_index=True)

dask.dataframe • From csvs: dd.read_csv('*.csv') • From pandas: dd.from_pandas(some_pandas_object) •
From castra: dd.from_castra('path_to_castra')

Out-of-core dataframes • Yearly csvs of all American flights 1987-2008
• Contains information on times, airlines, locations, etc… • Roughly 121 million rows, 11GB of data • http://www.transtats.bts.gov/Fields.asp?Table_ID=236

df.depdelay.groupby(df.origin).mean().nlargest(10).compute()

dask.bag • Out-of-core, unordered list • toolz + multiprocessing •
map, filter, reduce, groupby, take, … • Good for log files, json blobs, etc…

import dask.bag as db # Create a bag of tuples
of (subreddit, body) b = db.from_castra('reddit.castra', columns=['subreddit', 'body'], npartitions=8) # Filter out comments not in r/MachineLearning matches_subreddit = b.filter(lambda x: x[0] == 'MachineLearning') # Convert each comment into a list of words, and concatenate words = matches_subreddit.pluck(1).map(to_words).concat() # Count the frequencies for each word, and take the top 100 top_words = words.frequencies().topk(100, key=1).compute() Example - Reddit Data http://blaze.pydata.org/blog/2015/09/08/reddit-comments/

from wordcloud import WordCloud # Make a word cloud from
the results wc = WordCloud() wc = generate_from_frequencies(top_words) wc.to_image() Example - Reddit Data

• Collections build task graphs • Schedulers execute task graphs
• Graph specification = uniting interface

Dask Specification • Dictionary of {name: task} • Tasks are
tuples of (func, args...) (lispy syntax) • Args can be names, values, or tasks Python Code Dask Graph a = 1 b = 2 x = inc(a) y = inc(b) z = mul(x, y) dsk = {"a": 1, "b": 2, "x": (inc, "a"), "y": (inc, "b"), "z": (mul, "x", "y")}

Dask collections fit many problems… … but not everything.

Can create graphs directly def load(filename): ... def clean(data): ...
def analyze(sequence_of_data): ... def store(result): with open(..., 'w') as f: f.write(result) dsk = {'load-1': (load, 'myfile.a.data'), 'load-2': (load, 'myfile.b.data'), 'load-3': (load, 'myfile.c.data'), 'clean-1': (clean, 'load-1'), 'clean-2': (clean, 'load-2'), 'clean-3': (clean, 'load-3'), 'analyze': (analyze, ['clean-%d' % i for i in [1, 2, 3]]), 'store': (store, 'analyze')}

Or use dask.imperative @do def load(filename): ... @do def clean(data):
... @do def analyze(sequence_of_data): ... @do def store(result): with open(..., 'w') as f: f.write(result) files = ['myfile.a.data', 'myfile.b.data', 'myfile.c.data'] loaded = [load(f) for f in files] cleaned = [clean(i) for i in loaded] analyzed = analyze(cleaned) stored = store(analyze)

Custom Workflows! Can use dask.imperative to compose custom workflows easily!

Distributed???

“For workloads that are processing multi-gigabytes rather than terabyte+ scale,
a big-memory server may well provide better performance per dollar than a cluster.” “Nobody ever got fired for using Hadoop on a cluster” http://research.microsoft.com/pubs/163083/hotcbp12%20final.pdf

Not yet, but we’re working on it… http://distributed.readthedocs.org/en/latest/

Takeaways • Python can still handle large data using blocked
algorithms • Dask collections form task graphs expressing these algorithms • Dask schedulers execute these graphs in parallel • Dask graphs can be directly created for custom pipelines

Further Information • Docs: http://dask.pydata.org/ • Examples: https://github.com/blaze/dask-examples • Try
them online with Binder! • Blaze blog: http://blaze.pydata.org/ • Chat: https://gitter.im/blaze/dask • Github: https://github.com/blaze/dask

Questions? http://dask.pydata.org

Dask - Parallelizing NumPy/Pandas Through Task ...

Dask - Parallelizing NumPy/Pandas Through Task Scheduling

More Decks by Jim Crist

Other Decks in Programming

Featured

Transcript