Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Dask - Parallelizing NumPy/Pandas Through Task Scheduling

Jim Crist
November 11, 2015

Dask - Parallelizing NumPy/Pandas Through Task Scheduling

Talk given at PyData NYC 2015 on Dask.

Video: https://www.youtube.com/watch?v=mHd8AI8GQhQ
Materials can be found here: https://github.com/jcrist/Dask_PyData_NYC

Jim Crist

November 11, 2015
Tweet

More Decks by Jim Crist

Other Decks in Programming

Transcript

  1. Dask - Parallelizing NumPy/Pandas
    Through Task Scheduling
    Jim Crist
    Continuum Analytics // github.com/jcrist // @jiminy_crist
    PyData NYC 2015

    View Slide

  2. A Motivating Example

    View Slide

  3. Ocean Temperature Data
    • Daily mean ocean temperature every 1/4 degree
    • 720 x 1440 array every day
    • http://www.esrl.noaa.gov/psd/data/gridded/
    data.noaa.oisst.v2.highres.html

    View Slide

  4. One year’s worth
    from netCDF4 import Dataset
    import matplotlib.pyplot as plt
    from numpy import flipud
    data = Dataset("sst.day.mean.2015.v2.nc").variables["sst"]
    year_mean = data[:].mean(axis=0)
    plt.imshow(flipud(year_mean), cmap="RdBu_r")
    plt.title("Average Global Ocean Temperature, 2015")

    View Slide

  5. 36 year’s worth
    $ ls
    sst.day.mean.1981.v2.nc sst.day.mean.1993.v2.nc sst.day.mean.2005.v2.nc
    sst.day.mean.1982.v2.nc sst.day.mean.1994.v2.nc sst.day.mean.2006.v2.nc
    sst.day.mean.1983.v2.nc sst.day.mean.1995.v2.nc sst.day.mean.2007.v2.nc
    sst.day.mean.1984.v2.nc sst.day.mean.1996.v2.nc sst.day.mean.2008.v2.nc
    sst.day.mean.1985.v2.nc sst.day.mean.1997.v2.nc sst.day.mean.2009.v2.nc
    ... ... ...
    $ du -h
    15G .
    720 x 1440 x 12341 x 4 = 51 GB uncompressed!

    View Slide

  6. Can’t just load this all into Numpy… what now?

    View Slide

  7. Solution: blocked algorithms!

    View Slide

  8. Blocked Algorithms
    Blocked mean
    x = h5py.File('myfile.hdf5')['x'] # Trillion element array on disk
    sums = []
    counts = []
    for i in range(1000000): # One million times
    chunk = x[1000000*i: 1000000*(i+1)] # Pull out chunk
    sums.append(np.sum(chunk)) # Sum chunk
    counts.append(len(chunk)) # Count chunk
    result = sum(sums) / sum(counts) # Aggregate results

    View Slide

  9. Blocked Algorithms

    View Slide

  10. Blocked algorithms allow for
    • parallelism
    • lower ram usage
    The trick is figuring out how to
    break the computation into blocks.
    This is where dask
    comes in.

    View Slide

  11. Dask is:
    • A parallel computing framework
    • That leverages the excellent python ecosystem
    • Using blocked algorithms and task scheduling
    • Written in pure python

    View Slide

  12. Dask

    View Slide

  13. dask.array
    • Out-of-core, parallel, n-dimensional array library
    • Copies the numpy interface
    • Arithmetic: +, *, …
    • Reductions: mean, max, …
    • Slicing: x[10:, 100:50:-2]
    • Fancy indexing: x[:, [3, 1, 2]]
    • Some linear algebra: tensordot, qr, svd, …

    View Slide

  14. Demo

    View Slide

  15. Task Schedulers

    View Slide

  16. Task Schedulers

    View Slide

  17. But what about the GIL???

    View Slide

  18. Most Python Programs…

    View Slide

  19. x = np.arange(9000000).reshape((3000, 3000))
    a = da.from_array(x, chunks=(750, 750))
    a.dot(a.T).sum(axis=0).compute()

    View Slide

  20. Out-of-core arrays
    import dask.array as da
    from netCDF4 import Dataset
    from glob import glob
    from numpy import flipud
    import matplotlib.pyplot as plt
    files = sorted(glob('*.nc'))
    data = [Dataset(f).variables['sst'] for f in files]
    arrs = [da.from_array(x, chunks=(24, 360, 360)) for x in data]
    x = da.concatenate(arrs, axis=0)
    full_mean = x.mean(axis=0)
    plt.imshow(np.flipud(full_mean), cmap='RdBu_r')
    plt.title('Average Global Ocean Temperature, 1981-2015')

    View Slide

  21. Out-of-core arrays

    View Slide

  22. Parallel, Out-of-core SVD

    View Slide

  23. Parallel, Out-of-core SVD
    Randomized

    View Slide

  24. dask.dataframe
    • Out-of-core, blocked parallel DataFrame
    • Mirrors pandas interface

    View Slide

  25. dask.dataframe
    • Elementwise operations: df.x + df.y
    • Row-wise selections: df[df.x > 0]
    • Aggregations: df.x.max()
    • groupby-aggregate: df.groupby(df.x).y.max()
    • Value counts: df.x.value_counts()
    • Resampling: df.x.resample('d', how=‘mean')
    • Expanding window: df.x.cumsum()
    • Joins: dd.merge(df1, df2, left_index=True, right_index=True)

    View Slide

  26. dask.dataframe
    • From csvs: dd.read_csv('*.csv')
    • From pandas: dd.from_pandas(some_pandas_object)
    • From castra: dd.from_castra('path_to_castra')

    View Slide

  27. Out-of-core dataframes
    • Yearly csvs of all American flights 1987-2008
    • Contains information on times, airlines, locations, etc…
    • Roughly 121 million rows, 11GB of data
    • http://www.transtats.bts.gov/Fields.asp?Table_ID=236

    View Slide

  28. Demo

    View Slide

  29. df.depdelay.groupby(df.origin).mean().nlargest(10).compute()

    View Slide

  30. dask.bag
    • Out-of-core, unordered list
    • toolz + multiprocessing
    • map, filter, reduce, groupby, take, …
    • Good for log files, json blobs, etc…

    View Slide

  31. import dask.bag as db
    # Create a bag of tuples of (subreddit, body)
    b = db.from_castra('reddit.castra', columns=['subreddit', 'body'], npartitions=8)
    # Filter out comments not in r/MachineLearning
    matches_subreddit = b.filter(lambda x: x[0] == 'MachineLearning')
    # Convert each comment into a list of words, and concatenate
    words = matches_subreddit.pluck(1).map(to_words).concat()
    # Count the frequencies for each word, and take the top 100
    top_words = words.frequencies().topk(100, key=1).compute()
    Example - Reddit Data
    http://blaze.pydata.org/blog/2015/09/08/reddit-comments/

    View Slide

  32. from wordcloud import WordCloud
    # Make a word cloud from the results
    wc = WordCloud()
    wc = generate_from_frequencies(top_words)
    wc.to_image()
    Example - Reddit Data

    View Slide

  33. • Collections build task graphs
    • Schedulers execute task graphs
    • Graph specification = uniting interface

    View Slide

  34. Dask Specification
    • Dictionary of {name: task}
    • Tasks are tuples of (func, args...) (lispy syntax)
    • Args can be names, values, or tasks
    Python Code Dask Graph
    a = 1
    b = 2
    x = inc(a)
    y = inc(b)
    z = mul(x, y)
    dsk = {"a": 1,
    "b": 2,
    "x": (inc, "a"),
    "y": (inc, "b"),
    "z": (mul, "x", "y")}

    View Slide

  35. Dask collections fit many problems…
    … but not everything.

    View Slide

  36. Can create graphs directly
    def load(filename):
    ...
    def clean(data):
    ...
    def analyze(sequence_of_data):
    ...
    def store(result):
    with open(..., 'w') as f:
    f.write(result)
    dsk = {'load-1': (load, 'myfile.a.data'),
    'load-2': (load, 'myfile.b.data'),
    'load-3': (load, 'myfile.c.data'),
    'clean-1': (clean, 'load-1'),
    'clean-2': (clean, 'load-2'),
    'clean-3': (clean, 'load-3'),
    'analyze': (analyze, ['clean-%d' % i for i in [1, 2, 3]]),
    'store': (store, 'analyze')}

    View Slide

  37. Or use dask.imperative
    @do
    def load(filename):
    ...
    @do
    def clean(data):
    ...
    @do
    def analyze(sequence_of_data):
    ...
    @do
    def store(result):
    with open(..., 'w') as f:
    f.write(result)
    files = ['myfile.a.data', 'myfile.b.data', 'myfile.c.data']
    loaded = [load(f) for f in files]
    cleaned = [clean(i) for i in loaded]
    analyzed = analyze(cleaned)
    stored = store(analyze)

    View Slide

  38. Custom Workflows!
    Can use dask.imperative to compose
    custom workflows easily!

    View Slide

  39. Distributed???

    View Slide

  40. “For workloads that are processing multi-gigabytes
    rather than terabyte+ scale, a big-memory server
    may well provide better performance per dollar
    than a cluster.”
    “Nobody ever got fired for using Hadoop on a cluster”
    http://research.microsoft.com/pubs/163083/hotcbp12%20final.pdf

    View Slide

  41. Not yet, but we’re working on it…
    http://distributed.readthedocs.org/en/latest/

    View Slide

  42. Takeaways
    • Python can still handle large data using blocked algorithms
    • Dask collections form task graphs expressing these algorithms
    • Dask schedulers execute these graphs in parallel
    • Dask graphs can be directly created for custom pipelines

    View Slide

  43. Further Information
    • Docs: http://dask.pydata.org/
    • Examples: https://github.com/blaze/dask-examples
    • Try them online with Binder!
    • Blaze blog: http://blaze.pydata.org/
    • Chat: https://gitter.im/blaze/dask
    • Github: https://github.com/blaze/dask

    View Slide

  44. Questions?
    http://dask.pydata.org

    View Slide