Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Dask - Out-of-core NumPy/Pandas through Task Scheduling

Dask - Out-of-core NumPy/Pandas through Task Scheduling

Talk given at SciPy 2015.
Video: https://youtu.be/1kkFZ4P-XHg

Dask Array implements the NumPy ndarray interface using blocked algorithms, cutting up the large array into many small arrays. This lets us compute on arrays larger than memory using all of our cores. In this talk we describe dask, dask.array, dask.dataframe, as well as task scheduling generally.

Docs: http://dask.pydata.org/en/latest/
Github: https://github.com/ContinuumIO/dask

Jim Crist

July 08, 2015
Tweet

More Decks by Jim Crist

Other Decks in Programming

Transcript

  1. Dask:
    Out-of-core Numpy/Pandas
    through Task Scheduling
    Jim Crist
    [email protected]

    View Slide

  2. A Motivating Example

    View Slide

  3. Ocean Temperature Data
    • Daily mean ocean temperature every 1/4 degree
    • 720 x 1440 array every day
    • http://www.esrl.noaa.gov/psd/data/gridded/
    data.noaa.oisst.v2.highres.html

    View Slide

  4. One year’s worth
    from netCDF4 import Dataset
    import matplotlib.pyplot as plt
    from numpy import flipud
    data = Dataset("sst.day.mean.2015.v2.nc").variables["sst"]
    year_mean = data[:].mean(axis=0)
    plt.imshow(flipud(year_mean), cmap="RdBu_r")
    plt.title("Average Global Ocean Temperature, 2015")

    View Slide

  5. 36 year’s worth
    $ ls
    sst.day.mean.1981.v2.nc sst.day.mean.1993.v2.nc sst.day.mean.2005.v2.nc
    sst.day.mean.1982.v2.nc sst.day.mean.1994.v2.nc sst.day.mean.2006.v2.nc
    sst.day.mean.1983.v2.nc sst.day.mean.1995.v2.nc sst.day.mean.2007.v2.nc
    sst.day.mean.1984.v2.nc sst.day.mean.1996.v2.nc sst.day.mean.2008.v2.nc
    sst.day.mean.1985.v2.nc sst.day.mean.1997.v2.nc sst.day.mean.2009.v2.nc
    ... ... ...
    $ du -h
    15G .

    View Slide

  6. 36 year’s worth
    $ ls
    sst.day.mean.1981.v2.nc sst.day.mean.1993.v2.nc sst.day.mean.2005.v2.nc
    sst.day.mean.1982.v2.nc sst.day.mean.1994.v2.nc sst.day.mean.2006.v2.nc
    sst.day.mean.1983.v2.nc sst.day.mean.1995.v2.nc sst.day.mean.2007.v2.nc
    sst.day.mean.1984.v2.nc sst.day.mean.1996.v2.nc sst.day.mean.2008.v2.nc
    sst.day.mean.1985.v2.nc sst.day.mean.1997.v2.nc sst.day.mean.2009.v2.nc
    ... ... ...
    $ du -h
    15G .
    720 x 1440 x 12341 x 4 = 51 GB uncompressed!

    View Slide

  7. Can’t just load this all into Numpy… what now?

    View Slide

  8. Solution: blocked algorithms!

    View Slide

  9. Blocked Algorithms
    Blocked mean
    x = h5py.File('myfile.hdf5')['x'] # Trillion element array on disk
    sums = []
    counts = []
    for i in range(1000000): # One million times
    chunk = x[1000000*i: 1000000*(i+1)] # Pull out chunk
    sums.append(np.sum(chunk)) # Sum chunk
    counts.append(len(chunk)) # Count chunk
    result = sum(sums) / sum(counts) # Aggregate results

    View Slide

  10. Blocked Algorithms

    View Slide

  11. Blocked algorithms allow for
    • parallelism
    • lower ram usage

    View Slide

  12. Blocked algorithms allow for
    • parallelism
    • lower ram usage
    The trick is figuring out how to
    break the computation into blocks.

    View Slide

  13. Blocked algorithms allow for
    • parallelism
    • lower ram usage
    The trick is figuring out how to
    break the computation into blocks.
    This is where dask
    comes in.

    View Slide

  14. Dask is:

    View Slide

  15. Dask is:
    • A parallel computing framework

    View Slide

  16. Dask is:
    • A parallel computing framework
    • That leverages the excellent python ecosystem

    View Slide

  17. Dask is:
    • A parallel computing framework
    • That leverages the excellent python ecosystem
    • Using blocked algorithms and task scheduling

    View Slide

  18. Dask is:
    • A parallel computing framework
    • That leverages the excellent python ecosystem
    • Using blocked algorithms and task scheduling
    • Written in pure python

    View Slide

  19. Dask

    View Slide

  20. dask.array
    Out-of-core, parallel, n-dimensional array library

    View Slide

  21. dask.array
    • Copies the numpy interface
    • Arithmetic: +, *, …
    • Reductions: mean, max, …
    • Slicing: x[10:, 100:50:-2]
    • Fancy indexing: x[:, [3, 1, 2]]
    • Some linear algebra: tensordot, qr,
    svd, …
    Out-of-core, parallel, n-dimensional array library

    View Slide

  22. dask.array
    • Copies the numpy interface
    • Arithmetic: +, *, …
    • Reductions: mean, max, …
    • Slicing: x[10:, 100:50:-2]
    • Fancy indexing: x[:, [3, 1, 2]]
    • Some linear algebra: tensordot, qr,
    svd, …
    Out-of-core, parallel, n-dimensional array library
    • New operations
    • Parallel algorithms (approximate
    quantiles, topk, …)
    • Slightly overlapping arrays
    • Integration with HDF5

    View Slide

  23. Demo

    View Slide

  24. Task Schedulers

    View Slide

  25. Task Schedulers

    View Slide

  26. Task Schedulers

    View Slide

  27. Task Schedulers

    View Slide

  28. Out-of-core arrays
    import dask.array as da
    from netCDF4 import Dataset
    from glob import glob
    from numpy import flipud
    import matplotlib.pyplot as plt
    files = sorted(glob('*.nc'))
    data = [Dataset(f).variables['sst'] for f in files]
    arrs = [da.from_array(x, chunks=(24, 360, 360)) for x in data]
    x = da.concatenate(arrs, axis=0)
    full_mean = x.mean(axis=0)
    plt.imshow(np.flipud(full_mean), cmap='RdBu_r')
    plt.title('Average Global Ocean Temperature, 1981-2015')

    View Slide

  29. Out-of-core arrays

    View Slide

  30. dask.dataframe
    • Out-of-core, blocked parallel DataFrame
    • Mirrors pandas interface
    • Only implements a subset of pandas operations (currently)

    View Slide

  31. dask.dataframe
    Efficient operations
    • Elementwise operations: df.x + df.y
    • Row-wise selections: df[df.x > 0]
    • Aggregations: df.x.max()
    • groupby-aggregate: df.groupby(df.x).y.max()
    • Value counts: df.x.value_counts()
    • Drop duplicates: df.x.drop_duplicates()
    • Join on index: dd.merge(df1, df2, left_index=True, right_index=True)

    View Slide

  32. dask.dataframe
    Less efficient operations (require shuffle unless on index)
    • Set index: df.set_index(df.x)
    • groupby-apply
    • Join not on the index: pd.merge(df1, df2, on='name')

    View Slide

  33. Out-of-core dataframes
    • Yearly csvs of all American flights since 1990
    • Contains information on times, airlines, locations, etc…
    • http://www.transtats.bts.gov/Fields.asp?Table_ID=236

    View Slide

  34. Out-of-core dataframes
    >>> import dask.dataframe as dd
    # Create a dataframe from csv files
    >>> df = dd.read_csv('*.csv', usecols=['Origin', 'DepTime', 'CRSDepTime', 'Cancelled'])
    # Get time series of non-cancelled and delayed flights
    >>> not_cancelled = df[df.Cancelled != 1]
    >>> delayed = not_cancelled[not_cancelled.DepTime > not_cancelled.CRSDepTime]
    # Count total and delayed flights per airport
    >>> total_per_airport = not_cancelled.Origin.value_counts()
    >>> delayed_per_airport = delayed.Origin.value_counts()
    # Calculate percent delayed per airport
    >>> percent_delayed = delayed_per_airport/total_per_airport
    # Remove airports that had less than 500 flights a year on average
    >>> out = percent_delayed[total_per_airport > 10000]

    View Slide

  35. Out-of-core dataframes
    # Convert to pandas dataframe, sort, and output top 10
    >>> result = out.compute()
    >>> result.sort(ascending=False)
    >>> result.head(10)
    ATL 0.538589
    PIT 0.515708
    ORD 0.513163
    PHL 0.508329
    DFW 0.506470
    CLT 0.501259
    DEN 0.474589
    JFK 0.453212
    SFO 0.452156
    CVG 0.452117
    dtype: float64

    View Slide

  36. Out-of-core dataframes
    • 10 GB on disk
    • Need to read ~4 GB
    subset to perform
    computation
    • Max memory during
    computation is only 0.75
    GB

    View Slide

  37. • Collections build task graphs
    • Schedulers execute task graphs
    • Graph specification = uniting interface

    View Slide

  38. Dask Specification
    • Dictionary of {name: task}
    • Tasks are tuples of (func, args...) (lispy syntax)
    • Args can be names, values, or tasks
    Python Code Dask Graph
    a = 1
    b = 2
    x = inc(a)
    y = inc(b)
    z = mul(x, y)
    dsk = {"a": 1,
    "b": 2,
    "x": (inc, "a"),
    "y": (inc, "b"),
    "z": (mul, "x", "y")}

    View Slide

  39. Dask collections fit many problems…
    … but not everything.

    View Slide

  40. Can create graphs directly
    def load(filename):
    ...
    def clean(data):
    ...
    def analyze(sequence_of_data):
    ...
    def store(result):
    with open(..., 'w') as f:
    f.write(result)
    dsk = {'load-1': (load, 'myfile.a.data'),
    'load-2': (load, 'myfile.b.data'),
    'load-3': (load, 'myfile.c.data'),
    'clean-1': (clean, 'load-1'),
    'clean-2': (clean, 'load-2'),
    'clean-3': (clean, 'load-3'),
    'analyze': (analyze, ['clean-%d' % i for i in [1, 2, 3]]),
    'store': (store, 'analyze')}

    View Slide

  41. Takeaways

    View Slide

  42. Takeaways
    • Python can still handle large data using blocked
    algorithms

    View Slide

  43. Takeaways
    • Python can still handle large data using blocked
    algorithms
    • Dask collections form task graphs expressing these
    algorithms

    View Slide

  44. Takeaways
    • Python can still handle large data using blocked
    algorithms
    • Dask collections form task graphs expressing these
    algorithms
    • Dask schedulers execute these graphs in parallel

    View Slide

  45. Takeaways
    • Python can still handle large data using blocked
    algorithms
    • Dask collections form task graphs expressing these
    algorithms
    • Dask schedulers execute these graphs in parallel
    • Dask graphs can be directly created for custom pipelines

    View Slide

  46. Questions?
    http://dask.pydata.org

    View Slide