Dask - Out-of-core NumPy/Pandas through Task Scheduling

Dask: Out-of-core Numpy/Pandas through Task Scheduling Jim Crist [email protected]

A Motivating Example

Ocean Temperature Data • Daily mean ocean temperature every 1/4
degree • 720 x 1440 array every day • http://www.esrl.noaa.gov/psd/data/gridded/ data.noaa.oisst.v2.highres.html

One year’s worth from netCDF4 import Dataset import matplotlib.pyplot as
plt from numpy import flipud data = Dataset("sst.day.mean.2015.v2.nc").variables["sst"] year_mean = data[:].mean(axis=0) plt.imshow(flipud(year_mean), cmap="RdBu_r") plt.title("Average Global Ocean Temperature, 2015")

36 year’s worth $ ls sst.day.mean.1981.v2.nc sst.day.mean.1993.v2.nc sst.day.mean.2005.v2.nc sst.day.mean.1982.v2.nc sst.day.mean.1994.v2.nc
sst.day.mean.2006.v2.nc sst.day.mean.1983.v2.nc sst.day.mean.1995.v2.nc sst.day.mean.2007.v2.nc sst.day.mean.1984.v2.nc sst.day.mean.1996.v2.nc sst.day.mean.2008.v2.nc sst.day.mean.1985.v2.nc sst.day.mean.1997.v2.nc sst.day.mean.2009.v2.nc ... ... ... $ du -h 15G .

36 year’s worth $ ls sst.day.mean.1981.v2.nc sst.day.mean.1993.v2.nc sst.day.mean.2005.v2.nc sst.day.mean.1982.v2.nc sst.day.mean.1994.v2.nc
sst.day.mean.2006.v2.nc sst.day.mean.1983.v2.nc sst.day.mean.1995.v2.nc sst.day.mean.2007.v2.nc sst.day.mean.1984.v2.nc sst.day.mean.1996.v2.nc sst.day.mean.2008.v2.nc sst.day.mean.1985.v2.nc sst.day.mean.1997.v2.nc sst.day.mean.2009.v2.nc ... ... ... $ du -h 15G . 720 x 1440 x 12341 x 4 = 51 GB uncompressed!

Can’t just load this all into Numpy… what now?

Solution: blocked algorithms!

Blocked Algorithms Blocked mean x = h5py.File('myfile.hdf5')['x'] # Trillion element
array on disk sums = [] counts = [] for i in range(1000000): # One million times chunk = x[1000000*i: 1000000*(i+1)] # Pull out chunk sums.append(np.sum(chunk)) # Sum chunk counts.append(len(chunk)) # Count chunk result = sum(sums) / sum(counts) # Aggregate results

Blocked Algorithms

Blocked algorithms allow for • parallelism • lower ram usage

The trick is figuring out how to break the computation into blocks.

The trick is figuring out how to break the computation into blocks. This is where dask comes in.

Dask is:

Dask is: • A parallel computing framework

Dask is: • A parallel computing framework • That leverages
the excellent python ecosystem

the excellent python ecosystem • Using blocked algorithms and task scheduling

the excellent python ecosystem • Using blocked algorithms and task scheduling • Written in pure python

dask.array Out-of-core, parallel, n-dimensional array library

dask.array • Copies the numpy interface • Arithmetic: +, *,
… • Reductions: mean, max, … • Slicing: x[10:, 100:50:-2] • Fancy indexing: x[:, [3, 1, 2]] • Some linear algebra: tensordot, qr, svd, … Out-of-core, parallel, n-dimensional array library

dask.array • Copies the numpy interface • Arithmetic: +, *,
… • Reductions: mean, max, … • Slicing: x[10:, 100:50:-2] • Fancy indexing: x[:, [3, 1, 2]] • Some linear algebra: tensordot, qr, svd, … Out-of-core, parallel, n-dimensional array library • New operations • Parallel algorithms (approximate quantiles, topk, …) • Slightly overlapping arrays • Integration with HDF5

Task Schedulers

Out-of-core arrays import dask.array as da from netCDF4 import Dataset
from glob import glob from numpy import flipud import matplotlib.pyplot as plt files = sorted(glob('*.nc')) data = [Dataset(f).variables['sst'] for f in files] arrs = [da.from_array(x, chunks=(24, 360, 360)) for x in data] x = da.concatenate(arrs, axis=0) full_mean = x.mean(axis=0) plt.imshow(np.flipud(full_mean), cmap='RdBu_r') plt.title('Average Global Ocean Temperature, 1981-2015')

Out-of-core arrays

dask.dataframe • Out-of-core, blocked parallel DataFrame • Mirrors pandas interface
• Only implements a subset of pandas operations (currently)

dask.dataframe Efficient operations • Elementwise operations: df.x + df.y •
Row-wise selections: df[df.x > 0] • Aggregations: df.x.max() • groupby-aggregate: df.groupby(df.x).y.max() • Value counts: df.x.value_counts() • Drop duplicates: df.x.drop_duplicates() • Join on index: dd.merge(df1, df2, left_index=True, right_index=True)

dask.dataframe Less efficient operations (require shuffle unless on index) •
Set index: df.set_index(df.x) • groupby-apply • Join not on the index: pd.merge(df1, df2, on='name')

Out-of-core dataframes • Yearly csvs of all American flights since
1990 • Contains information on times, airlines, locations, etc… • http://www.transtats.bts.gov/Fields.asp?Table_ID=236

Out-of-core dataframes >>> import dask.dataframe as dd # Create a
dataframe from csv files >>> df = dd.read_csv('*.csv', usecols=['Origin', 'DepTime', 'CRSDepTime', 'Cancelled']) # Get time series of non-cancelled and delayed flights >>> not_cancelled = df[df.Cancelled != 1] >>> delayed = not_cancelled[not_cancelled.DepTime > not_cancelled.CRSDepTime] # Count total and delayed flights per airport >>> total_per_airport = not_cancelled.Origin.value_counts() >>> delayed_per_airport = delayed.Origin.value_counts() # Calculate percent delayed per airport >>> percent_delayed = delayed_per_airport/total_per_airport # Remove airports that had less than 500 flights a year on average >>> out = percent_delayed[total_per_airport > 10000]

Out-of-core dataframes # Convert to pandas dataframe, sort, and output
top 10 >>> result = out.compute() >>> result.sort(ascending=False) >>> result.head(10) ATL 0.538589 PIT 0.515708 ORD 0.513163 PHL 0.508329 DFW 0.506470 CLT 0.501259 DEN 0.474589 JFK 0.453212 SFO 0.452156 CVG 0.452117 dtype: float64

Out-of-core dataframes • 10 GB on disk • Need to
read ~4 GB subset to perform computation • Max memory during computation is only 0.75 GB

• Collections build task graphs • Schedulers execute task graphs
• Graph specification = uniting interface

Dask Specification • Dictionary of {name: task} • Tasks are
tuples of (func, args...) (lispy syntax) • Args can be names, values, or tasks Python Code Dask Graph a = 1 b = 2 x = inc(a) y = inc(b) z = mul(x, y) dsk = {"a": 1, "b": 2, "x": (inc, "a"), "y": (inc, "b"), "z": (mul, "x", "y")}

Dask collections fit many problems… … but not everything.

Can create graphs directly def load(filename): ... def clean(data): ...
def analyze(sequence_of_data): ... def store(result): with open(..., 'w') as f: f.write(result) dsk = {'load-1': (load, 'myfile.a.data'), 'load-2': (load, 'myfile.b.data'), 'load-3': (load, 'myfile.c.data'), 'clean-1': (clean, 'load-1'), 'clean-2': (clean, 'load-2'), 'clean-3': (clean, 'load-3'), 'analyze': (analyze, ['clean-%d' % i for i in [1, 2, 3]]), 'store': (store, 'analyze')}

Takeaways

Takeaways • Python can still handle large data using blocked
algorithms

algorithms • Dask collections form task graphs expressing these algorithms

algorithms • Dask collections form task graphs expressing these algorithms • Dask schedulers execute these graphs in parallel

algorithms • Dask collections form task graphs expressing these algorithms • Dask schedulers execute these graphs in parallel • Dask graphs can be directly created for custom pipelines

Questions? http://dask.pydata.org

Dask - Out-of-core NumPy/Pandas through Task Sc...

Dask - Out-of-core NumPy/Pandas through Task Scheduling

More Decks by Jim Crist

Other Decks in Programming

Featured

Transcript