Parallel and Out-of-Core Python with Dask

85bba1ca66eb909a289448a90e88f53a?s=47 Jim Crist
October 02, 2015

Parallel and Out-of-Core Python with Dask

Talk given at PyData at Strata NYC 2015 on 9/29/2015.

Given as part of a general tutorial on the Blaze ecosystem. Provides an overview of Dask (http://dask.pydata.org/en/latest/), as well as some examples.

Example notebooks can be found here: https://github.com/cpcloud/strata-nyc-2015

85bba1ca66eb909a289448a90e88f53a?s=128

Jim Crist

October 02, 2015
Tweet

Transcript

  1. Parallel and Out-of-Core Python with Dask Jim Crist Continuum Analytics

    // github.com/jcrist // @jiminy_crist PyData at Strata NYC 2015
  2. A Motivating Example

  3. Ocean Temperature Data • Daily mean ocean temperature every 1/4

    degree • 720 x 1440 array every day • http://www.esrl.noaa.gov/psd/data/gridded/ data.noaa.oisst.v2.highres.html
  4. One year’s worth from netCDF4 import Dataset import matplotlib.pyplot as

    plt from numpy import flipud data = Dataset("sst.day.mean.2015.v2.nc").variables["sst"] year_mean = data[:].mean(axis=0) plt.imshow(flipud(year_mean), cmap="RdBu_r") plt.title("Average Global Ocean Temperature, 2015")
  5. 36 year’s worth $ ls sst.day.mean.1981.v2.nc sst.day.mean.1993.v2.nc sst.day.mean.2005.v2.nc sst.day.mean.1982.v2.nc sst.day.mean.1994.v2.nc

    sst.day.mean.2006.v2.nc sst.day.mean.1983.v2.nc sst.day.mean.1995.v2.nc sst.day.mean.2007.v2.nc sst.day.mean.1984.v2.nc sst.day.mean.1996.v2.nc sst.day.mean.2008.v2.nc sst.day.mean.1985.v2.nc sst.day.mean.1997.v2.nc sst.day.mean.2009.v2.nc ... ... ... $ du -h 15G . 720 x 1440 x 12341 x 4 = 51 GB uncompressed!
  6. Can’t just load this all into Numpy… what now?

  7. Solution: blocked algorithms!

  8. Blocked Algorithms Blocked mean x = h5py.File('myfile.hdf5')['x'] # Trillion element

    array on disk sums = [] counts = [] for i in range(1000000): # One million times chunk = x[1000000*i: 1000000*(i+1)] # Pull out chunk sums.append(np.sum(chunk)) # Sum chunk counts.append(len(chunk)) # Count chunk result = sum(sums) / sum(counts) # Aggregate results
  9. Blocked Algorithms

  10. Blocked algorithms allow for • parallelism • lower ram usage

    The trick is figuring out how to break the computation into blocks. This is where dask comes in.
  11. Dask is: • A parallel computing framework • That leverages

    the excellent python ecosystem • Using blocked algorithms and task scheduling • Written in pure python
  12. Dask

  13. dask.array • Out-of-core, parallel, n-dimensional array library • Mirrors numpy

    interface
  14. Demo

  15. Task Schedulers

  16. Out-of-core arrays import dask.array as da from netCDF4 import Dataset

    from glob import glob from numpy import flipud import matplotlib.pyplot as plt files = sorted(glob('*.nc')) data = [Dataset(f).variables['sst'] for f in files] arrs = [da.from_array(x, chunks=(24, 360, 360)) for x in data] x = da.concatenate(arrs, axis=0) full_mean = x.mean(axis=0) plt.imshow(np.flipud(full_mean), cmap='RdBu_r') plt.title('Average Global Ocean Temperature, 1981-2015')
  17. Out-of-core arrays

  18. dask.dataframe • Out-of-core, blocked parallel DataFrame • Mirrors pandas interface

  19. Out-of-core dataframes • Yearly csvs of all American flights 1987-2008

    • Contains information on times, airlines, locations, etc… • Roughly 121 million rows, 11GB of data • http://www.transtats.bts.gov/Fields.asp?Table_ID=236
  20. Demo

  21. • Collections build task graphs • Schedulers execute task graphs

    • Graph specification = uniting interface
  22. Dask Specification • Dictionary of {name: task} • Tasks are

    tuples of (func, args...) (lispy syntax) • Args can be names, values, or tasks Python Code Dask Graph a = 1 b = 2 x = inc(a) y = inc(b) z = mul(x, y) dsk = {"a": 1, "b": 2, "x": (inc, "a"), "y": (inc, "b"), "z": (mul, "x", "y")}
  23. Dask collections fit many problems… … but not everything.

  24. Can create graphs directly @do def load(filename): ... @do def

    clean(data): ... @do def analyze(sequence_of_data): ... @do def store(result): with open(..., 'w') as f: f.write(result) files = [‘myfile.a.data’, ‘myfile.b.data’, ‘myfile.c.data’] loaded = [load(f) for f in files] cleaned = [clean(i) for i in loaded] analyzed = analyze(cleaned) stored = store(analyze)
  25. Takeaways • Python can still handle large data using blocked

    algorithms • Dask collections form task graphs expressing these algorithms • Dask schedulers execute these graphs in parallel • Dask graphs can be directly created for custom pipelines
  26. Questions? http://dask.pydata.org