Pro Yearly is on sale from $80 to $50! »

Dask - Parallelizing NumPy/Pandas Through Task Scheduling

85bba1ca66eb909a289448a90e88f53a?s=47 Jim Crist
November 11, 2015

Dask - Parallelizing NumPy/Pandas Through Task Scheduling

Talk given at PyData NYC 2015 on Dask.

Video: https://www.youtube.com/watch?v=mHd8AI8GQhQ
Materials can be found here: https://github.com/jcrist/Dask_PyData_NYC

85bba1ca66eb909a289448a90e88f53a?s=128

Jim Crist

November 11, 2015
Tweet

Transcript

  1. Dask - Parallelizing NumPy/Pandas Through Task Scheduling Jim Crist Continuum

    Analytics // github.com/jcrist // @jiminy_crist PyData NYC 2015
  2. A Motivating Example

  3. Ocean Temperature Data • Daily mean ocean temperature every 1/4

    degree • 720 x 1440 array every day • http://www.esrl.noaa.gov/psd/data/gridded/ data.noaa.oisst.v2.highres.html
  4. One year’s worth from netCDF4 import Dataset import matplotlib.pyplot as

    plt from numpy import flipud data = Dataset("sst.day.mean.2015.v2.nc").variables["sst"] year_mean = data[:].mean(axis=0) plt.imshow(flipud(year_mean), cmap="RdBu_r") plt.title("Average Global Ocean Temperature, 2015")
  5. 36 year’s worth $ ls sst.day.mean.1981.v2.nc sst.day.mean.1993.v2.nc sst.day.mean.2005.v2.nc sst.day.mean.1982.v2.nc sst.day.mean.1994.v2.nc

    sst.day.mean.2006.v2.nc sst.day.mean.1983.v2.nc sst.day.mean.1995.v2.nc sst.day.mean.2007.v2.nc sst.day.mean.1984.v2.nc sst.day.mean.1996.v2.nc sst.day.mean.2008.v2.nc sst.day.mean.1985.v2.nc sst.day.mean.1997.v2.nc sst.day.mean.2009.v2.nc ... ... ... $ du -h 15G . 720 x 1440 x 12341 x 4 = 51 GB uncompressed!
  6. Can’t just load this all into Numpy… what now?

  7. Solution: blocked algorithms!

  8. Blocked Algorithms Blocked mean x = h5py.File('myfile.hdf5')['x'] # Trillion element

    array on disk sums = [] counts = [] for i in range(1000000): # One million times chunk = x[1000000*i: 1000000*(i+1)] # Pull out chunk sums.append(np.sum(chunk)) # Sum chunk counts.append(len(chunk)) # Count chunk result = sum(sums) / sum(counts) # Aggregate results
  9. Blocked Algorithms

  10. Blocked algorithms allow for • parallelism • lower ram usage

    The trick is figuring out how to break the computation into blocks. This is where dask comes in.
  11. Dask is: • A parallel computing framework • That leverages

    the excellent python ecosystem • Using blocked algorithms and task scheduling • Written in pure python
  12. Dask

  13. dask.array • Out-of-core, parallel, n-dimensional array library • Copies the

    numpy interface • Arithmetic: +, *, … • Reductions: mean, max, … • Slicing: x[10:, 100:50:-2] • Fancy indexing: x[:, [3, 1, 2]] • Some linear algebra: tensordot, qr, svd, …
  14. Demo

  15. Task Schedulers

  16. Task Schedulers

  17. But what about the GIL???

  18. Most Python Programs…

  19. x = np.arange(9000000).reshape((3000, 3000)) a = da.from_array(x, chunks=(750, 750)) a.dot(a.T).sum(axis=0).compute()

  20. Out-of-core arrays import dask.array as da from netCDF4 import Dataset

    from glob import glob from numpy import flipud import matplotlib.pyplot as plt files = sorted(glob('*.nc')) data = [Dataset(f).variables['sst'] for f in files] arrs = [da.from_array(x, chunks=(24, 360, 360)) for x in data] x = da.concatenate(arrs, axis=0) full_mean = x.mean(axis=0) plt.imshow(np.flipud(full_mean), cmap='RdBu_r') plt.title('Average Global Ocean Temperature, 1981-2015')
  21. Out-of-core arrays

  22. Parallel, Out-of-core SVD

  23. Parallel, Out-of-core SVD Randomized

  24. dask.dataframe • Out-of-core, blocked parallel DataFrame • Mirrors pandas interface

  25. dask.dataframe • Elementwise operations: df.x + df.y • Row-wise selections:

    df[df.x > 0] • Aggregations: df.x.max() • groupby-aggregate: df.groupby(df.x).y.max() • Value counts: df.x.value_counts() • Resampling: df.x.resample('d', how=‘mean') • Expanding window: df.x.cumsum() • Joins: dd.merge(df1, df2, left_index=True, right_index=True)
  26. dask.dataframe • From csvs: dd.read_csv('*.csv') • From pandas: dd.from_pandas(some_pandas_object) •

    From castra: dd.from_castra('path_to_castra')
  27. Out-of-core dataframes • Yearly csvs of all American flights 1987-2008

    • Contains information on times, airlines, locations, etc… • Roughly 121 million rows, 11GB of data • http://www.transtats.bts.gov/Fields.asp?Table_ID=236
  28. Demo

  29. df.depdelay.groupby(df.origin).mean().nlargest(10).compute()

  30. dask.bag • Out-of-core, unordered list • toolz + multiprocessing •

    map, filter, reduce, groupby, take, … • Good for log files, json blobs, etc…
  31. import dask.bag as db # Create a bag of tuples

    of (subreddit, body) b = db.from_castra('reddit.castra', columns=['subreddit', 'body'], npartitions=8) # Filter out comments not in r/MachineLearning matches_subreddit = b.filter(lambda x: x[0] == 'MachineLearning') # Convert each comment into a list of words, and concatenate words = matches_subreddit.pluck(1).map(to_words).concat() # Count the frequencies for each word, and take the top 100 top_words = words.frequencies().topk(100, key=1).compute() Example - Reddit Data http://blaze.pydata.org/blog/2015/09/08/reddit-comments/
  32. from wordcloud import WordCloud # Make a word cloud from

    the results wc = WordCloud() wc = generate_from_frequencies(top_words) wc.to_image() Example - Reddit Data
  33. • Collections build task graphs • Schedulers execute task graphs

    • Graph specification = uniting interface
  34. Dask Specification • Dictionary of {name: task} • Tasks are

    tuples of (func, args...) (lispy syntax) • Args can be names, values, or tasks Python Code Dask Graph a = 1 b = 2 x = inc(a) y = inc(b) z = mul(x, y) dsk = {"a": 1, "b": 2, "x": (inc, "a"), "y": (inc, "b"), "z": (mul, "x", "y")}
  35. Dask collections fit many problems… … but not everything.

  36. Can create graphs directly def load(filename): ... def clean(data): ...

    def analyze(sequence_of_data): ... def store(result): with open(..., 'w') as f: f.write(result) dsk = {'load-1': (load, 'myfile.a.data'), 'load-2': (load, 'myfile.b.data'), 'load-3': (load, 'myfile.c.data'), 'clean-1': (clean, 'load-1'), 'clean-2': (clean, 'load-2'), 'clean-3': (clean, 'load-3'), 'analyze': (analyze, ['clean-%d' % i for i in [1, 2, 3]]), 'store': (store, 'analyze')}
  37. Or use dask.imperative @do def load(filename): ... @do def clean(data):

    ... @do def analyze(sequence_of_data): ... @do def store(result): with open(..., 'w') as f: f.write(result) files = ['myfile.a.data', 'myfile.b.data', 'myfile.c.data'] loaded = [load(f) for f in files] cleaned = [clean(i) for i in loaded] analyzed = analyze(cleaned) stored = store(analyze)
  38. Custom Workflows! Can use dask.imperative to compose custom workflows easily!

  39. Distributed???

  40. “For workloads that are processing multi-gigabytes rather than terabyte+ scale,

    a big-memory server may well provide better performance per dollar than a cluster.” “Nobody ever got fired for using Hadoop on a cluster” http://research.microsoft.com/pubs/163083/hotcbp12%20final.pdf
  41. Not yet, but we’re working on it… http://distributed.readthedocs.org/en/latest/

  42. Takeaways • Python can still handle large data using blocked

    algorithms • Dask collections form task graphs expressing these algorithms • Dask schedulers execute these graphs in parallel • Dask graphs can be directly created for custom pipelines
  43. Further Information • Docs: http://dask.pydata.org/ • Examples: https://github.com/blaze/dask-examples • Try

    them online with Binder! • Blaze blog: http://blaze.pydata.org/ • Chat: https://gitter.im/blaze/dask • Github: https://github.com/blaze/dask
  44. Questions? http://dask.pydata.org