Parallel and Out-of-Core Python with Dask

Parallel and Out-of-Core Python with Dask Jim Crist Continuum Analytics
// github.com/jcrist // @jiminy_crist PyData at Strata NYC 2015

A Motivating Example

Ocean Temperature Data • Daily mean ocean temperature every 1/4
degree • 720 x 1440 array every day • http://www.esrl.noaa.gov/psd/data/gridded/ data.noaa.oisst.v2.highres.html

One year’s worth from netCDF4 import Dataset import matplotlib.pyplot as
plt from numpy import flipud data = Dataset("sst.day.mean.2015.v2.nc").variables["sst"] year_mean = data[:].mean(axis=0) plt.imshow(flipud(year_mean), cmap="RdBu_r") plt.title("Average Global Ocean Temperature, 2015")

36 year’s worth $ ls sst.day.mean.1981.v2.nc sst.day.mean.1993.v2.nc sst.day.mean.2005.v2.nc sst.day.mean.1982.v2.nc sst.day.mean.1994.v2.nc
sst.day.mean.2006.v2.nc sst.day.mean.1983.v2.nc sst.day.mean.1995.v2.nc sst.day.mean.2007.v2.nc sst.day.mean.1984.v2.nc sst.day.mean.1996.v2.nc sst.day.mean.2008.v2.nc sst.day.mean.1985.v2.nc sst.day.mean.1997.v2.nc sst.day.mean.2009.v2.nc ... ... ... $ du -h 15G . 720 x 1440 x 12341 x 4 = 51 GB uncompressed!

Can’t just load this all into Numpy… what now?

Solution: blocked algorithms!

Blocked Algorithms Blocked mean x = h5py.File('myfile.hdf5')['x'] # Trillion element
array on disk sums = [] counts = [] for i in range(1000000): # One million times chunk = x[1000000*i: 1000000*(i+1)] # Pull out chunk sums.append(np.sum(chunk)) # Sum chunk counts.append(len(chunk)) # Count chunk result = sum(sums) / sum(counts) # Aggregate results

Blocked Algorithms

Blocked algorithms allow for • parallelism • lower ram usage
The trick is figuring out how to break the computation into blocks. This is where dask comes in.

Dask is: • A parallel computing framework • That leverages
the excellent python ecosystem • Using blocked algorithms and task scheduling • Written in pure python

dask.array • Out-of-core, parallel, n-dimensional array library • Mirrors numpy
interface

Task Schedulers

Out-of-core arrays import dask.array as da from netCDF4 import Dataset
from glob import glob from numpy import flipud import matplotlib.pyplot as plt files = sorted(glob('*.nc')) data = [Dataset(f).variables['sst'] for f in files] arrs = [da.from_array(x, chunks=(24, 360, 360)) for x in data] x = da.concatenate(arrs, axis=0) full_mean = x.mean(axis=0) plt.imshow(np.flipud(full_mean), cmap='RdBu_r') plt.title('Average Global Ocean Temperature, 1981-2015')

Out-of-core arrays

dask.dataframe • Out-of-core, blocked parallel DataFrame • Mirrors pandas interface

Out-of-core dataframes • Yearly csvs of all American flights 1987-2008
• Contains information on times, airlines, locations, etc… • Roughly 121 million rows, 11GB of data • http://www.transtats.bts.gov/Fields.asp?Table_ID=236

• Collections build task graphs • Schedulers execute task graphs
• Graph specification = uniting interface

Dask Specification • Dictionary of {name: task} • Tasks are
tuples of (func, args...) (lispy syntax) • Args can be names, values, or tasks Python Code Dask Graph a = 1 b = 2 x = inc(a) y = inc(b) z = mul(x, y) dsk = {"a": 1, "b": 2, "x": (inc, "a"), "y": (inc, "b"), "z": (mul, "x", "y")}

Dask collections fit many problems… … but not everything.

Can create graphs directly @do def load(filename): ... @do def
clean(data): ... @do def analyze(sequence_of_data): ... @do def store(result): with open(..., 'w') as f: f.write(result) files = [‘myfile.a.data’, ‘myfile.b.data’, ‘myfile.c.data’] loaded = [load(f) for f in files] cleaned = [clean(i) for i in loaded] analyzed = analyze(cleaned) stored = store(analyze)

Takeaways • Python can still handle large data using blocked
algorithms • Dask collections form task graphs expressing these algorithms • Dask schedulers execute these graphs in parallel • Dask graphs can be directly created for custom pipelines

Questions? http://dask.pydata.org

Parallel and Out-of-Core Python with Dask

Parallel and Out-of-Core Python with Dask

Jim Crist

More Decks by Jim Crist

Other Decks in Programming

Featured

Transcript