Dask - Out-of-core NumPy/Pandas through Task Scheduling

Dask - Out-of-core NumPy/Pandas through Task Scheduling

Talk given at SciPy 2015.
Video: https://youtu.be/1kkFZ4P-XHg

Dask Array implements the NumPy ndarray interface using blocked algorithms, cutting up the large array into many small arrays. This lets us compute on arrays larger than memory using all of our cores. In this talk we describe dask, dask.array, dask.dataframe, as well as task scheduling generally.

Docs: http://dask.pydata.org/en/latest/
Github: https://github.com/ContinuumIO/dask

85bba1ca66eb909a289448a90e88f53a?s=128

Jim Crist

July 08, 2015
Tweet

Transcript

  1. Dask: Out-of-core Numpy/Pandas through Task Scheduling Jim Crist jcrist@continuum.io

  2. A Motivating Example

  3. Ocean Temperature Data • Daily mean ocean temperature every 1/4

    degree • 720 x 1440 array every day • http://www.esrl.noaa.gov/psd/data/gridded/ data.noaa.oisst.v2.highres.html
  4. One year’s worth from netCDF4 import Dataset import matplotlib.pyplot as

    plt from numpy import flipud data = Dataset("sst.day.mean.2015.v2.nc").variables["sst"] year_mean = data[:].mean(axis=0) plt.imshow(flipud(year_mean), cmap="RdBu_r") plt.title("Average Global Ocean Temperature, 2015")
  5. 36 year’s worth $ ls sst.day.mean.1981.v2.nc sst.day.mean.1993.v2.nc sst.day.mean.2005.v2.nc sst.day.mean.1982.v2.nc sst.day.mean.1994.v2.nc

    sst.day.mean.2006.v2.nc sst.day.mean.1983.v2.nc sst.day.mean.1995.v2.nc sst.day.mean.2007.v2.nc sst.day.mean.1984.v2.nc sst.day.mean.1996.v2.nc sst.day.mean.2008.v2.nc sst.day.mean.1985.v2.nc sst.day.mean.1997.v2.nc sst.day.mean.2009.v2.nc ... ... ... $ du -h 15G .
  6. 36 year’s worth $ ls sst.day.mean.1981.v2.nc sst.day.mean.1993.v2.nc sst.day.mean.2005.v2.nc sst.day.mean.1982.v2.nc sst.day.mean.1994.v2.nc

    sst.day.mean.2006.v2.nc sst.day.mean.1983.v2.nc sst.day.mean.1995.v2.nc sst.day.mean.2007.v2.nc sst.day.mean.1984.v2.nc sst.day.mean.1996.v2.nc sst.day.mean.2008.v2.nc sst.day.mean.1985.v2.nc sst.day.mean.1997.v2.nc sst.day.mean.2009.v2.nc ... ... ... $ du -h 15G . 720 x 1440 x 12341 x 4 = 51 GB uncompressed!
  7. Can’t just load this all into Numpy… what now?

  8. Solution: blocked algorithms!

  9. Blocked Algorithms Blocked mean x = h5py.File('myfile.hdf5')['x'] # Trillion element

    array on disk sums = [] counts = [] for i in range(1000000): # One million times chunk = x[1000000*i: 1000000*(i+1)] # Pull out chunk sums.append(np.sum(chunk)) # Sum chunk counts.append(len(chunk)) # Count chunk result = sum(sums) / sum(counts) # Aggregate results
  10. Blocked Algorithms

  11. Blocked algorithms allow for • parallelism • lower ram usage

  12. Blocked algorithms allow for • parallelism • lower ram usage

    The trick is figuring out how to break the computation into blocks.
  13. Blocked algorithms allow for • parallelism • lower ram usage

    The trick is figuring out how to break the computation into blocks. This is where dask comes in.
  14. Dask is:

  15. Dask is: • A parallel computing framework

  16. Dask is: • A parallel computing framework • That leverages

    the excellent python ecosystem
  17. Dask is: • A parallel computing framework • That leverages

    the excellent python ecosystem • Using blocked algorithms and task scheduling
  18. Dask is: • A parallel computing framework • That leverages

    the excellent python ecosystem • Using blocked algorithms and task scheduling • Written in pure python
  19. Dask

  20. dask.array Out-of-core, parallel, n-dimensional array library

  21. dask.array • Copies the numpy interface • Arithmetic: +, *,

    … • Reductions: mean, max, … • Slicing: x[10:, 100:50:-2] • Fancy indexing: x[:, [3, 1, 2]] • Some linear algebra: tensordot, qr, svd, … Out-of-core, parallel, n-dimensional array library
  22. dask.array • Copies the numpy interface • Arithmetic: +, *,

    … • Reductions: mean, max, … • Slicing: x[10:, 100:50:-2] • Fancy indexing: x[:, [3, 1, 2]] • Some linear algebra: tensordot, qr, svd, … Out-of-core, parallel, n-dimensional array library • New operations • Parallel algorithms (approximate quantiles, topk, …) • Slightly overlapping arrays • Integration with HDF5
  23. Demo

  24. Task Schedulers

  25. Task Schedulers

  26. Task Schedulers

  27. Task Schedulers

  28. Out-of-core arrays import dask.array as da from netCDF4 import Dataset

    from glob import glob from numpy import flipud import matplotlib.pyplot as plt files = sorted(glob('*.nc')) data = [Dataset(f).variables['sst'] for f in files] arrs = [da.from_array(x, chunks=(24, 360, 360)) for x in data] x = da.concatenate(arrs, axis=0) full_mean = x.mean(axis=0) plt.imshow(np.flipud(full_mean), cmap='RdBu_r') plt.title('Average Global Ocean Temperature, 1981-2015')
  29. Out-of-core arrays

  30. dask.dataframe • Out-of-core, blocked parallel DataFrame • Mirrors pandas interface

    • Only implements a subset of pandas operations (currently)
  31. dask.dataframe Efficient operations • Elementwise operations: df.x + df.y •

    Row-wise selections: df[df.x > 0] • Aggregations: df.x.max() • groupby-aggregate: df.groupby(df.x).y.max() • Value counts: df.x.value_counts() • Drop duplicates: df.x.drop_duplicates() • Join on index: dd.merge(df1, df2, left_index=True, right_index=True)
  32. dask.dataframe Less efficient operations (require shuffle unless on index) •

    Set index: df.set_index(df.x) • groupby-apply • Join not on the index: pd.merge(df1, df2, on='name')
  33. Out-of-core dataframes • Yearly csvs of all American flights since

    1990 • Contains information on times, airlines, locations, etc… • http://www.transtats.bts.gov/Fields.asp?Table_ID=236
  34. Out-of-core dataframes >>> import dask.dataframe as dd # Create a

    dataframe from csv files >>> df = dd.read_csv('*.csv', usecols=['Origin', 'DepTime', 'CRSDepTime', 'Cancelled']) # Get time series of non-cancelled and delayed flights >>> not_cancelled = df[df.Cancelled != 1] >>> delayed = not_cancelled[not_cancelled.DepTime > not_cancelled.CRSDepTime] # Count total and delayed flights per airport >>> total_per_airport = not_cancelled.Origin.value_counts() >>> delayed_per_airport = delayed.Origin.value_counts() # Calculate percent delayed per airport >>> percent_delayed = delayed_per_airport/total_per_airport # Remove airports that had less than 500 flights a year on average >>> out = percent_delayed[total_per_airport > 10000]
  35. Out-of-core dataframes # Convert to pandas dataframe, sort, and output

    top 10 >>> result = out.compute() >>> result.sort(ascending=False) >>> result.head(10) ATL 0.538589 PIT 0.515708 ORD 0.513163 PHL 0.508329 DFW 0.506470 CLT 0.501259 DEN 0.474589 JFK 0.453212 SFO 0.452156 CVG 0.452117 dtype: float64
  36. Out-of-core dataframes • 10 GB on disk • Need to

    read ~4 GB subset to perform computation • Max memory during computation is only 0.75 GB
  37. • Collections build task graphs • Schedulers execute task graphs

    • Graph specification = uniting interface
  38. Dask Specification • Dictionary of {name: task} • Tasks are

    tuples of (func, args...) (lispy syntax) • Args can be names, values, or tasks Python Code Dask Graph a = 1 b = 2 x = inc(a) y = inc(b) z = mul(x, y) dsk = {"a": 1, "b": 2, "x": (inc, "a"), "y": (inc, "b"), "z": (mul, "x", "y")}
  39. Dask collections fit many problems… … but not everything.

  40. Can create graphs directly def load(filename): ... def clean(data): ...

    def analyze(sequence_of_data): ... def store(result): with open(..., 'w') as f: f.write(result) dsk = {'load-1': (load, 'myfile.a.data'), 'load-2': (load, 'myfile.b.data'), 'load-3': (load, 'myfile.c.data'), 'clean-1': (clean, 'load-1'), 'clean-2': (clean, 'load-2'), 'clean-3': (clean, 'load-3'), 'analyze': (analyze, ['clean-%d' % i for i in [1, 2, 3]]), 'store': (store, 'analyze')}
  41. Takeaways

  42. Takeaways • Python can still handle large data using blocked

    algorithms
  43. Takeaways • Python can still handle large data using blocked

    algorithms • Dask collections form task graphs expressing these algorithms
  44. Takeaways • Python can still handle large data using blocked

    algorithms • Dask collections form task graphs expressing these algorithms • Dask schedulers execute these graphs in parallel
  45. Takeaways • Python can still handle large data using blocked

    algorithms • Dask collections form task graphs expressing these algorithms • Dask schedulers execute these graphs in parallel • Dask graphs can be directly created for custom pipelines
  46. Questions? http://dask.pydata.org