Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Parallel and Out-of-Core Python with Dask

Jim Crist
October 02, 2015

Parallel and Out-of-Core Python with Dask

Talk given at PyData at Strata NYC 2015 on 9/29/2015.

Given as part of a general tutorial on the Blaze ecosystem. Provides an overview of Dask (http://dask.pydata.org/en/latest/), as well as some examples.

Example notebooks can be found here: https://github.com/cpcloud/strata-nyc-2015

Jim Crist

October 02, 2015
Tweet

More Decks by Jim Crist

Other Decks in Programming

Transcript

  1. Parallel and Out-of-Core
    Python with Dask
    Jim Crist
    Continuum Analytics // github.com/jcrist // @jiminy_crist
    PyData at Strata NYC 2015

    View Slide

  2. A Motivating Example

    View Slide

  3. Ocean Temperature Data
    • Daily mean ocean temperature every 1/4 degree
    • 720 x 1440 array every day
    • http://www.esrl.noaa.gov/psd/data/gridded/
    data.noaa.oisst.v2.highres.html

    View Slide

  4. One year’s worth
    from netCDF4 import Dataset
    import matplotlib.pyplot as plt
    from numpy import flipud
    data = Dataset("sst.day.mean.2015.v2.nc").variables["sst"]
    year_mean = data[:].mean(axis=0)
    plt.imshow(flipud(year_mean), cmap="RdBu_r")
    plt.title("Average Global Ocean Temperature, 2015")

    View Slide

  5. 36 year’s worth
    $ ls
    sst.day.mean.1981.v2.nc sst.day.mean.1993.v2.nc sst.day.mean.2005.v2.nc
    sst.day.mean.1982.v2.nc sst.day.mean.1994.v2.nc sst.day.mean.2006.v2.nc
    sst.day.mean.1983.v2.nc sst.day.mean.1995.v2.nc sst.day.mean.2007.v2.nc
    sst.day.mean.1984.v2.nc sst.day.mean.1996.v2.nc sst.day.mean.2008.v2.nc
    sst.day.mean.1985.v2.nc sst.day.mean.1997.v2.nc sst.day.mean.2009.v2.nc
    ... ... ...
    $ du -h
    15G .
    720 x 1440 x 12341 x 4 = 51 GB uncompressed!

    View Slide

  6. Can’t just load this all into Numpy… what now?

    View Slide

  7. Solution: blocked algorithms!

    View Slide

  8. Blocked Algorithms
    Blocked mean
    x = h5py.File('myfile.hdf5')['x'] # Trillion element array on disk
    sums = []
    counts = []
    for i in range(1000000): # One million times
    chunk = x[1000000*i: 1000000*(i+1)] # Pull out chunk
    sums.append(np.sum(chunk)) # Sum chunk
    counts.append(len(chunk)) # Count chunk
    result = sum(sums) / sum(counts) # Aggregate results

    View Slide

  9. Blocked Algorithms

    View Slide

  10. Blocked algorithms allow for
    • parallelism
    • lower ram usage
    The trick is figuring out how to
    break the computation into blocks.
    This is where dask
    comes in.

    View Slide

  11. Dask is:
    • A parallel computing framework
    • That leverages the excellent python ecosystem
    • Using blocked algorithms and task scheduling
    • Written in pure python

    View Slide

  12. Dask

    View Slide

  13. dask.array
    • Out-of-core, parallel, n-dimensional array library
    • Mirrors numpy interface

    View Slide

  14. Demo

    View Slide

  15. Task Schedulers

    View Slide

  16. Out-of-core arrays
    import dask.array as da
    from netCDF4 import Dataset
    from glob import glob
    from numpy import flipud
    import matplotlib.pyplot as plt
    files = sorted(glob('*.nc'))
    data = [Dataset(f).variables['sst'] for f in files]
    arrs = [da.from_array(x, chunks=(24, 360, 360)) for x in data]
    x = da.concatenate(arrs, axis=0)
    full_mean = x.mean(axis=0)
    plt.imshow(np.flipud(full_mean), cmap='RdBu_r')
    plt.title('Average Global Ocean Temperature, 1981-2015')

    View Slide

  17. Out-of-core arrays

    View Slide

  18. dask.dataframe
    • Out-of-core, blocked parallel DataFrame
    • Mirrors pandas interface

    View Slide

  19. Out-of-core dataframes
    • Yearly csvs of all American flights 1987-2008
    • Contains information on times, airlines, locations, etc…
    • Roughly 121 million rows, 11GB of data
    • http://www.transtats.bts.gov/Fields.asp?Table_ID=236

    View Slide

  20. Demo

    View Slide

  21. • Collections build task graphs
    • Schedulers execute task graphs
    • Graph specification = uniting interface

    View Slide

  22. Dask Specification
    • Dictionary of {name: task}
    • Tasks are tuples of (func, args...) (lispy syntax)
    • Args can be names, values, or tasks
    Python Code Dask Graph
    a = 1
    b = 2
    x = inc(a)
    y = inc(b)
    z = mul(x, y)
    dsk = {"a": 1,
    "b": 2,
    "x": (inc, "a"),
    "y": (inc, "b"),
    "z": (mul, "x", "y")}

    View Slide

  23. Dask collections fit many problems…
    … but not everything.

    View Slide

  24. Can create graphs directly
    @do
    def load(filename):
    ...
    @do
    def clean(data):
    ...
    @do
    def analyze(sequence_of_data):
    ...
    @do
    def store(result):
    with open(..., 'w') as f:
    f.write(result)
    files = [‘myfile.a.data’, ‘myfile.b.data’, ‘myfile.c.data’]
    loaded = [load(f) for f in files]
    cleaned = [clean(i) for i in loaded]
    analyzed = analyze(cleaned)
    stored = store(analyze)

    View Slide

  25. Takeaways
    • Python can still handle large data using blocked
    algorithms
    • Dask collections form task graphs expressing these
    algorithms
    • Dask schedulers execute these graphs in parallel
    • Dask graphs can be directly created for custom pipelines

    View Slide

  26. Questions?
    http://dask.pydata.org

    View Slide