Parallel and Out-of-Core Python with Dask

Slide 1

Slide 1 text

Parallel and Out-of-Core Python with Dask Jim Crist Continuum Analytics // github.com/jcrist // @jiminy_crist PyData at Strata NYC 2015

Slide 2

Slide 2 text

A Motivating Example

Slide 3

Slide 3 text

Ocean Temperature Data • Daily mean ocean temperature every 1/4 degree • 720 x 1440 array every day • http://www.esrl.noaa.gov/psd/data/gridded/ data.noaa.oisst.v2.highres.html

Slide 4

Slide 4 text

One year’s worth from netCDF4 import Dataset import matplotlib.pyplot as plt from numpy import flipud data = Dataset("sst.day.mean.2015.v2.nc").variables["sst"] year_mean = data[:].mean(axis=0) plt.imshow(flipud(year_mean), cmap="RdBu_r") plt.title("Average Global Ocean Temperature, 2015")

Slide 5

Slide 5 text

36 year’s worth $ ls sst.day.mean.1981.v2.nc sst.day.mean.1993.v2.nc sst.day.mean.2005.v2.nc sst.day.mean.1982.v2.nc sst.day.mean.1994.v2.nc sst.day.mean.2006.v2.nc sst.day.mean.1983.v2.nc sst.day.mean.1995.v2.nc sst.day.mean.2007.v2.nc sst.day.mean.1984.v2.nc sst.day.mean.1996.v2.nc sst.day.mean.2008.v2.nc sst.day.mean.1985.v2.nc sst.day.mean.1997.v2.nc sst.day.mean.2009.v2.nc ... ... ... $ du -h 15G . 720 x 1440 x 12341 x 4 = 51 GB uncompressed!

Slide 6

Slide 6 text

Can’t just load this all into Numpy… what now?

Slide 7

Slide 7 text

Solution: blocked algorithms!

Slide 8

Slide 8 text

Blocked Algorithms Blocked mean x = h5py.File('myfile.hdf5')['x'] # Trillion element array on disk sums = [] counts = [] for i in range(1000000): # One million times chunk = x[1000000*i: 1000000*(i+1)] # Pull out chunk sums.append(np.sum(chunk)) # Sum chunk counts.append(len(chunk)) # Count chunk result = sum(sums) / sum(counts) # Aggregate results

Slide 9

Slide 9 text

Blocked Algorithms

Slide 10

Slide 10 text

Blocked algorithms allow for • parallelism • lower ram usage The trick is figuring out how to break the computation into blocks. This is where dask comes in.

Slide 11

Slide 11 text

Dask is: • A parallel computing framework • That leverages the excellent python ecosystem • Using blocked algorithms and task scheduling • Written in pure python

Slide 12

Slide 12 text

Dask

Slide 13

Slide 13 text

dask.array • Out-of-core, parallel, n-dimensional array library • Mirrors numpy interface

Slide 14

Slide 14 text

Demo

Slide 15

Slide 15 text

Task Schedulers

Slide 16

Slide 16 text

Out-of-core arrays import dask.array as da from netCDF4 import Dataset from glob import glob from numpy import flipud import matplotlib.pyplot as plt files = sorted(glob('*.nc')) data = [Dataset(f).variables['sst'] for f in files] arrs = [da.from_array(x, chunks=(24, 360, 360)) for x in data] x = da.concatenate(arrs, axis=0) full_mean = x.mean(axis=0) plt.imshow(np.flipud(full_mean), cmap='RdBu_r') plt.title('Average Global Ocean Temperature, 1981-2015')

Slide 17

Slide 17 text

Out-of-core arrays

Slide 18

Slide 18 text

dask.dataframe • Out-of-core, blocked parallel DataFrame • Mirrors pandas interface

Slide 19

Slide 19 text

Out-of-core dataframes • Yearly csvs of all American flights 1987-2008 • Contains information on times, airlines, locations, etc… • Roughly 121 million rows, 11GB of data • http://www.transtats.bts.gov/Fields.asp?Table_ID=236

Slide 20

Slide 20 text

Demo

Slide 21

Slide 21 text

• Collections build task graphs • Schedulers execute task graphs • Graph specification = uniting interface

Slide 22

Slide 22 text

Dask Specification • Dictionary of {name: task} • Tasks are tuples of (func, args...) (lispy syntax) • Args can be names, values, or tasks Python Code Dask Graph a = 1 b = 2 x = inc(a) y = inc(b) z = mul(x, y) dsk = {"a": 1, "b": 2, "x": (inc, "a"), "y": (inc, "b"), "z": (mul, "x", "y")}

Slide 23

Slide 23 text

Dask collections fit many problems… … but not everything.

Slide 24

Slide 24 text

Can create graphs directly @do def load(filename): ... @do def clean(data): ... @do def analyze(sequence_of_data): ... @do def store(result): with open(..., 'w') as f: f.write(result) files = [‘myfile.a.data’, ‘myfile.b.data’, ‘myfile.c.data’] loaded = [load(f) for f in files] cleaned = [clean(i) for i in loaded] analyzed = analyze(cleaned) stored = store(analyze)

Slide 25

Slide 25 text

Takeaways • Python can still handle large data using blocked algorithms • Dask collections form task graphs expressing these algorithms • Dask schedulers execute these graphs in parallel • Dask graphs can be directly created for custom pipelines

Slide 26

Slide 26 text

Questions? http://dask.pydata.org