Dask - Out-of-core NumPy/Pandas through Task Scheduling

by Jim Crist

Slide 1

Slide 1 text

Dask: Out-of-core Numpy/Pandas through Task Scheduling Jim Crist [email protected]

Slide 2

Slide 2 text

A Motivating Example

Slide 3

Slide 3 text

Ocean Temperature Data • Daily mean ocean temperature every 1/4 degree • 720 x 1440 array every day • http://www.esrl.noaa.gov/psd/data/gridded/ data.noaa.oisst.v2.highres.html

Slide 4

Slide 4 text

One year’s worth from netCDF4 import Dataset import matplotlib.pyplot as plt from numpy import flipud data = Dataset("sst.day.mean.2015.v2.nc").variables["sst"] year_mean = data[:].mean(axis=0) plt.imshow(flipud(year_mean), cmap="RdBu_r") plt.title("Average Global Ocean Temperature, 2015")

Slide 5

Slide 5 text

36 year’s worth $ ls sst.day.mean.1981.v2.nc sst.day.mean.1993.v2.nc sst.day.mean.2005.v2.nc sst.day.mean.1982.v2.nc sst.day.mean.1994.v2.nc sst.day.mean.2006.v2.nc sst.day.mean.1983.v2.nc sst.day.mean.1995.v2.nc sst.day.mean.2007.v2.nc sst.day.mean.1984.v2.nc sst.day.mean.1996.v2.nc sst.day.mean.2008.v2.nc sst.day.mean.1985.v2.nc sst.day.mean.1997.v2.nc sst.day.mean.2009.v2.nc ... ... ... $ du -h 15G .

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Can’t just load this all into Numpy… what now?

Slide 8

Slide 8 text

Solution: blocked algorithms!

Slide 9

Slide 9 text

Blocked Algorithms Blocked mean x = h5py.File('myfile.hdf5')['x'] # Trillion element array on disk sums = [] counts = [] for i in range(1000000): # One million times chunk = x[1000000*i: 1000000*(i+1)] # Pull out chunk sums.append(np.sum(chunk)) # Sum chunk counts.append(len(chunk)) # Count chunk result = sum(sums) / sum(counts) # Aggregate results

Slide 10

Slide 10 text

Blocked Algorithms

Slide 11

Slide 11 text

Blocked algorithms allow for • parallelism • lower ram usage

Slide 12

Slide 12 text

Blocked algorithms allow for • parallelism • lower ram usage The trick is figuring out how to break the computation into blocks.

Slide 13

Slide 13 text

Blocked algorithms allow for • parallelism • lower ram usage The trick is figuring out how to break the computation into blocks. This is where dask comes in.

Slide 14

Slide 14 text

Dask is:

Slide 15

Slide 15 text

Dask is: • A parallel computing framework

Slide 16

Slide 16 text

Dask is: • A parallel computing framework • That leverages the excellent python ecosystem

Slide 17

Slide 17 text

Dask is: • A parallel computing framework • That leverages the excellent python ecosystem • Using blocked algorithms and task scheduling

Slide 18

Slide 18 text

Dask is: • A parallel computing framework • That leverages the excellent python ecosystem • Using blocked algorithms and task scheduling • Written in pure python

Slide 19

Slide 19 text

Dask

Slide 20

Slide 20 text

dask.array Out-of-core, parallel, n-dimensional array library

Slide 21

Slide 21 text

Slide 22

Slide 22 text

dask.array • Copies the numpy interface • Arithmetic: +, *, … • Reductions: mean, max, … • Slicing: x[10:, 100:50:-2] • Fancy indexing: x[:, [3, 1, 2]] • Some linear algebra: tensordot, qr, svd, … Out-of-core, parallel, n-dimensional array library • New operations • Parallel algorithms (approximate quantiles, topk, …) • Slightly overlapping arrays • Integration with HDF5

Slide 23

Slide 23 text

Demo

Slide 24

Slide 24 text

Task Schedulers

Slide 25

Slide 25 text

Task Schedulers

Slide 26

Slide 26 text

Task Schedulers

Slide 27

Slide 27 text

Task Schedulers

Slide 28

Slide 28 text

Out-of-core arrays import dask.array as da from netCDF4 import Dataset from glob import glob from numpy import flipud import matplotlib.pyplot as plt files = sorted(glob('*.nc')) data = [Dataset(f).variables['sst'] for f in files] arrs = [da.from_array(x, chunks=(24, 360, 360)) for x in data] x = da.concatenate(arrs, axis=0) full_mean = x.mean(axis=0) plt.imshow(np.flipud(full_mean), cmap='RdBu_r') plt.title('Average Global Ocean Temperature, 1981-2015')

Slide 29

Slide 29 text

Out-of-core arrays

Slide 30

Slide 30 text

dask.dataframe • Out-of-core, blocked parallel DataFrame • Mirrors pandas interface • Only implements a subset of pandas operations (currently)

Slide 31

Slide 31 text

dask.dataframe Efficient operations • Elementwise operations: df.x + df.y • Row-wise selections: df[df.x > 0] • Aggregations: df.x.max() • groupby-aggregate: df.groupby(df.x).y.max() • Value counts: df.x.value_counts() • Drop duplicates: df.x.drop_duplicates() • Join on index: dd.merge(df1, df2, left_index=True, right_index=True)

Slide 32

Slide 32 text

dask.dataframe Less efficient operations (require shuffle unless on index) • Set index: df.set_index(df.x) • groupby-apply • Join not on the index: pd.merge(df1, df2, on='name')

Slide 33

Slide 33 text

Out-of-core dataframes • Yearly csvs of all American flights since 1990 • Contains information on times, airlines, locations, etc… • http://www.transtats.bts.gov/Fields.asp?Table_ID=236

Slide 34

Slide 34 text

Out-of-core dataframes >>> import dask.dataframe as dd # Create a dataframe from csv files >>> df = dd.read_csv('*.csv', usecols=['Origin', 'DepTime', 'CRSDepTime', 'Cancelled']) # Get time series of non-cancelled and delayed flights >>> not_cancelled = df[df.Cancelled != 1] >>> delayed = not_cancelled[not_cancelled.DepTime > not_cancelled.CRSDepTime] # Count total and delayed flights per airport >>> total_per_airport = not_cancelled.Origin.value_counts() >>> delayed_per_airport = delayed.Origin.value_counts() # Calculate percent delayed per airport >>> percent_delayed = delayed_per_airport/total_per_airport # Remove airports that had less than 500 flights a year on average >>> out = percent_delayed[total_per_airport > 10000]

Slide 35

Slide 35 text

Out-of-core dataframes # Convert to pandas dataframe, sort, and output top 10 >>> result = out.compute() >>> result.sort(ascending=False) >>> result.head(10) ATL 0.538589 PIT 0.515708 ORD 0.513163 PHL 0.508329 DFW 0.506470 CLT 0.501259 DEN 0.474589 JFK 0.453212 SFO 0.452156 CVG 0.452117 dtype: float64

Slide 36

Slide 36 text

Out-of-core dataframes • 10 GB on disk • Need to read ~4 GB subset to perform computation • Max memory during computation is only 0.75 GB

Slide 37

Slide 37 text

• Collections build task graphs • Schedulers execute task graphs • Graph specification = uniting interface

Slide 38

Slide 38 text

Dask Specification • Dictionary of {name: task} • Tasks are tuples of (func, args...) (lispy syntax) • Args can be names, values, or tasks Python Code Dask Graph a = 1 b = 2 x = inc(a) y = inc(b) z = mul(x, y) dsk = {"a": 1, "b": 2, "x": (inc, "a"), "y": (inc, "b"), "z": (mul, "x", "y")}

Slide 39

Slide 39 text

Dask collections fit many problems… … but not everything.

Slide 40

Slide 40 text

Can create graphs directly def load(filename): ... def clean(data): ... def analyze(sequence_of_data): ... def store(result): with open(..., 'w') as f: f.write(result) dsk = {'load-1': (load, 'myfile.a.data'), 'load-2': (load, 'myfile.b.data'), 'load-3': (load, 'myfile.c.data'), 'clean-1': (clean, 'load-1'), 'clean-2': (clean, 'load-2'), 'clean-3': (clean, 'load-3'), 'analyze': (analyze, ['clean-%d' % i for i in [1, 2, 3]]), 'store': (store, 'analyze')}

Slide 41

Slide 41 text

Takeaways

Slide 42

Slide 42 text

Takeaways • Python can still handle large data using blocked algorithms

Slide 43

Slide 43 text

Takeaways • Python can still handle large data using blocked algorithms • Dask collections form task graphs expressing these algorithms

Slide 44

Slide 44 text

Takeaways • Python can still handle large data using blocked algorithms • Dask collections form task graphs expressing these algorithms • Dask schedulers execute these graphs in parallel

Slide 45

Slide 45 text

Takeaways • Python can still handle large data using blocked algorithms • Dask collections form task graphs expressing these algorithms • Dask schedulers execute these graphs in parallel • Dask graphs can be directly created for custom pipelines

Slide 46

Slide 46 text

Questions? http://dask.pydata.org