Dask - Parallelizing NumPy/Pandas Through Task Scheduling

Slide 1

Slide 1 text

Dask - Parallelizing NumPy/Pandas Through Task Scheduling Jim Crist Continuum Analytics // github.com/jcrist // @jiminy_crist PyData NYC 2015

Slide 2

Slide 2 text

A Motivating Example

Slide 3

Slide 3 text

Ocean Temperature Data • Daily mean ocean temperature every 1/4 degree • 720 x 1440 array every day • http://www.esrl.noaa.gov/psd/data/gridded/ data.noaa.oisst.v2.highres.html

Slide 4

Slide 4 text

One year’s worth from netCDF4 import Dataset import matplotlib.pyplot as plt from numpy import flipud data = Dataset("sst.day.mean.2015.v2.nc").variables["sst"] year_mean = data[:].mean(axis=0) plt.imshow(flipud(year_mean), cmap="RdBu_r") plt.title("Average Global Ocean Temperature, 2015")

Slide 5

Slide 5 text

36 year’s worth $ ls sst.day.mean.1981.v2.nc sst.day.mean.1993.v2.nc sst.day.mean.2005.v2.nc sst.day.mean.1982.v2.nc sst.day.mean.1994.v2.nc sst.day.mean.2006.v2.nc sst.day.mean.1983.v2.nc sst.day.mean.1995.v2.nc sst.day.mean.2007.v2.nc sst.day.mean.1984.v2.nc sst.day.mean.1996.v2.nc sst.day.mean.2008.v2.nc sst.day.mean.1985.v2.nc sst.day.mean.1997.v2.nc sst.day.mean.2009.v2.nc ... ... ... $ du -h 15G . 720 x 1440 x 12341 x 4 = 51 GB uncompressed!

Slide 6

Slide 6 text

Can’t just load this all into Numpy… what now?

Slide 7

Slide 7 text

Solution: blocked algorithms!

Slide 8

Slide 8 text

Blocked Algorithms Blocked mean x = h5py.File('myfile.hdf5')['x'] # Trillion element array on disk sums = [] counts = [] for i in range(1000000): # One million times chunk = x[1000000*i: 1000000*(i+1)] # Pull out chunk sums.append(np.sum(chunk)) # Sum chunk counts.append(len(chunk)) # Count chunk result = sum(sums) / sum(counts) # Aggregate results

Slide 9

Slide 9 text

Blocked Algorithms

Slide 10

Slide 10 text

Blocked algorithms allow for • parallelism • lower ram usage The trick is figuring out how to break the computation into blocks. This is where dask comes in.

Slide 11

Slide 11 text

Dask is: • A parallel computing framework • That leverages the excellent python ecosystem • Using blocked algorithms and task scheduling • Written in pure python

Slide 12

Slide 12 text

Dask

Slide 13

Slide 13 text

dask.array • Out-of-core, parallel, n-dimensional array library • Copies the numpy interface • Arithmetic: +, *, … • Reductions: mean, max, … • Slicing: x[10:, 100:50:-2] • Fancy indexing: x[:, [3, 1, 2]] • Some linear algebra: tensordot, qr, svd, …

Slide 14

Slide 14 text

Demo

Slide 15

Slide 15 text

Task Schedulers

Slide 16

Slide 16 text

Task Schedulers

Slide 17

Slide 17 text

But what about the GIL???

Slide 18

Slide 18 text

Most Python Programs…

Slide 19

Slide 19 text

x = np.arange(9000000).reshape((3000, 3000)) a = da.from_array(x, chunks=(750, 750)) a.dot(a.T).sum(axis=0).compute()

Slide 20

Slide 20 text

Out-of-core arrays import dask.array as da from netCDF4 import Dataset from glob import glob from numpy import flipud import matplotlib.pyplot as plt files = sorted(glob('*.nc')) data = [Dataset(f).variables['sst'] for f in files] arrs = [da.from_array(x, chunks=(24, 360, 360)) for x in data] x = da.concatenate(arrs, axis=0) full_mean = x.mean(axis=0) plt.imshow(np.flipud(full_mean), cmap='RdBu_r') plt.title('Average Global Ocean Temperature, 1981-2015')

Slide 21

Slide 21 text

Out-of-core arrays

Slide 22

Slide 22 text

Parallel, Out-of-core SVD

Slide 23

Slide 23 text

Parallel, Out-of-core SVD Randomized

Slide 24

Slide 24 text

dask.dataframe • Out-of-core, blocked parallel DataFrame • Mirrors pandas interface

Slide 25

Slide 25 text

dask.dataframe • Elementwise operations: df.x + df.y • Row-wise selections: df[df.x > 0] • Aggregations: df.x.max() • groupby-aggregate: df.groupby(df.x).y.max() • Value counts: df.x.value_counts() • Resampling: df.x.resample('d', how=‘mean') • Expanding window: df.x.cumsum() • Joins: dd.merge(df1, df2, left_index=True, right_index=True)

Slide 26

Slide 26 text

dask.dataframe • From csvs: dd.read_csv('*.csv') • From pandas: dd.from_pandas(some_pandas_object) • From castra: dd.from_castra('path_to_castra')

Slide 27

Slide 27 text

Out-of-core dataframes • Yearly csvs of all American flights 1987-2008 • Contains information on times, airlines, locations, etc… • Roughly 121 million rows, 11GB of data • http://www.transtats.bts.gov/Fields.asp?Table_ID=236

Slide 28

Slide 28 text

Demo

Slide 29

Slide 29 text

df.depdelay.groupby(df.origin).mean().nlargest(10).compute()

Slide 30

Slide 30 text

dask.bag • Out-of-core, unordered list • toolz + multiprocessing • map, filter, reduce, groupby, take, … • Good for log files, json blobs, etc…

Slide 31

Slide 31 text

import dask.bag as db # Create a bag of tuples of (subreddit, body) b = db.from_castra('reddit.castra', columns=['subreddit', 'body'], npartitions=8) # Filter out comments not in r/MachineLearning matches_subreddit = b.filter(lambda x: x[0] == 'MachineLearning') # Convert each comment into a list of words, and concatenate words = matches_subreddit.pluck(1).map(to_words).concat() # Count the frequencies for each word, and take the top 100 top_words = words.frequencies().topk(100, key=1).compute() Example - Reddit Data http://blaze.pydata.org/blog/2015/09/08/reddit-comments/

Slide 32

Slide 32 text

from wordcloud import WordCloud # Make a word cloud from the results wc = WordCloud() wc = generate_from_frequencies(top_words) wc.to_image() Example - Reddit Data

Slide 33

Slide 33 text

• Collections build task graphs • Schedulers execute task graphs • Graph specification = uniting interface

Slide 34

Slide 34 text

Dask Specification • Dictionary of {name: task} • Tasks are tuples of (func, args...) (lispy syntax) • Args can be names, values, or tasks Python Code Dask Graph a = 1 b = 2 x = inc(a) y = inc(b) z = mul(x, y) dsk = {"a": 1, "b": 2, "x": (inc, "a"), "y": (inc, "b"), "z": (mul, "x", "y")}

Slide 35

Slide 35 text

Dask collections fit many problems… … but not everything.

Slide 36

Slide 36 text

Can create graphs directly def load(filename): ... def clean(data): ... def analyze(sequence_of_data): ... def store(result): with open(..., 'w') as f: f.write(result) dsk = {'load-1': (load, 'myfile.a.data'), 'load-2': (load, 'myfile.b.data'), 'load-3': (load, 'myfile.c.data'), 'clean-1': (clean, 'load-1'), 'clean-2': (clean, 'load-2'), 'clean-3': (clean, 'load-3'), 'analyze': (analyze, ['clean-%d' % i for i in [1, 2, 3]]), 'store': (store, 'analyze')}

Slide 37

Slide 37 text

Or use dask.imperative @do def load(filename): ... @do def clean(data): ... @do def analyze(sequence_of_data): ... @do def store(result): with open(..., 'w') as f: f.write(result) files = ['myfile.a.data', 'myfile.b.data', 'myfile.c.data'] loaded = [load(f) for f in files] cleaned = [clean(i) for i in loaded] analyzed = analyze(cleaned) stored = store(analyze)

Slide 38

Slide 38 text

Custom Workflows! Can use dask.imperative to compose custom workflows easily!

Slide 39

Slide 39 text

Distributed???

Slide 40

Slide 40 text

“For workloads that are processing multi-gigabytes rather than terabyte+ scale, a big-memory server may well provide better performance per dollar than a cluster.” “Nobody ever got fired for using Hadoop on a cluster” http://research.microsoft.com/pubs/163083/hotcbp12%20final.pdf

Slide 41

Slide 41 text

Not yet, but we’re working on it… http://distributed.readthedocs.org/en/latest/

Slide 42

Slide 42 text

Takeaways • Python can still handle large data using blocked algorithms • Dask collections form task graphs expressing these algorithms • Dask schedulers execute these graphs in parallel • Dask graphs can be directly created for custom pipelines

Slide 43

Slide 43 text

Further Information • Docs: http://dask.pydata.org/ • Examples: https://github.com/blaze/dask-examples • Try them online with Binder! • Blaze blog: http://blaze.pydata.org/ • Chat: https://gitter.im/blaze/dask • Github: https://github.com/blaze/dask

Slide 44

Slide 44 text

Questions? http://dask.pydata.org