Jim Crist
July 08, 2015
11k

Talk given at SciPy 2015.
Video: https://youtu.be/1kkFZ4P-XHg

Dask Array implements the NumPy ndarray interface using blocked algorithms, cutting up the large array into many small arrays. This lets us compute on arrays larger than memory using all of our cores. In this talk we describe dask, dask.array, dask.dataframe, as well as task scheduling generally.

July 08, 2015

## Transcript

Out-of-core Numpy/Pandas
Jim Crist
[email protected]

2. A Motivating Example

3. Ocean Temperature Data
• Daily mean ocean temperature every 1/4 degree
• 720 x 1440 array every day
• http://www.esrl.noaa.gov/psd/data/gridded/
data.noaa.oisst.v2.highres.html

4. One year’s worth
from netCDF4 import Dataset
import matplotlib.pyplot as plt
from numpy import flipud
data = Dataset("sst.day.mean.2015.v2.nc").variables["sst"]
year_mean = data[:].mean(axis=0)
plt.imshow(flipud(year_mean), cmap="RdBu_r")
plt.title("Average Global Ocean Temperature, 2015")

5. 36 year’s worth
\$ ls
sst.day.mean.1981.v2.nc sst.day.mean.1993.v2.nc sst.day.mean.2005.v2.nc
sst.day.mean.1982.v2.nc sst.day.mean.1994.v2.nc sst.day.mean.2006.v2.nc
sst.day.mean.1983.v2.nc sst.day.mean.1995.v2.nc sst.day.mean.2007.v2.nc
sst.day.mean.1984.v2.nc sst.day.mean.1996.v2.nc sst.day.mean.2008.v2.nc
sst.day.mean.1985.v2.nc sst.day.mean.1997.v2.nc sst.day.mean.2009.v2.nc
... ... ...
\$ du -h
15G .

6. 36 year’s worth
\$ ls
sst.day.mean.1981.v2.nc sst.day.mean.1993.v2.nc sst.day.mean.2005.v2.nc
sst.day.mean.1982.v2.nc sst.day.mean.1994.v2.nc sst.day.mean.2006.v2.nc
sst.day.mean.1983.v2.nc sst.day.mean.1995.v2.nc sst.day.mean.2007.v2.nc
sst.day.mean.1984.v2.nc sst.day.mean.1996.v2.nc sst.day.mean.2008.v2.nc
sst.day.mean.1985.v2.nc sst.day.mean.1997.v2.nc sst.day.mean.2009.v2.nc
... ... ...
\$ du -h
15G .
720 x 1440 x 12341 x 4 = 51 GB uncompressed!

7. Can’t just load this all into Numpy… what now?

8. Solution: blocked algorithms!

9. Blocked Algorithms
Blocked mean
x = h5py.File('myfile.hdf5')['x'] # Trillion element array on disk
sums = []
counts = []
for i in range(1000000): # One million times
chunk = x[1000000*i: 1000000*(i+1)] # Pull out chunk
sums.append(np.sum(chunk)) # Sum chunk
counts.append(len(chunk)) # Count chunk
result = sum(sums) / sum(counts) # Aggregate results

10. Blocked Algorithms

11. Blocked algorithms allow for
• parallelism
• lower ram usage

12. Blocked algorithms allow for
• parallelism
• lower ram usage
The trick is figuring out how to
break the computation into blocks.

13. Blocked algorithms allow for
• parallelism
• lower ram usage
The trick is figuring out how to
break the computation into blocks.
comes in.

• A parallel computing framework

• A parallel computing framework
• That leverages the excellent python ecosystem

• A parallel computing framework
• That leverages the excellent python ecosystem
• Using blocked algorithms and task scheduling

• A parallel computing framework
• That leverages the excellent python ecosystem
• Using blocked algorithms and task scheduling
• Written in pure python

Out-of-core, parallel, n-dimensional array library

• Copies the numpy interface
• Arithmetic: +, *, …
• Reductions: mean, max, …
• Slicing: x[10:, 100:50:-2]
• Fancy indexing: x[:, [3, 1, 2]]
• Some linear algebra: tensordot, qr,
svd, …
Out-of-core, parallel, n-dimensional array library

• Copies the numpy interface
• Arithmetic: +, *, …
• Reductions: mean, max, …
• Slicing: x[10:, 100:50:-2]
• Fancy indexing: x[:, [3, 1, 2]]
• Some linear algebra: tensordot, qr,
svd, …
Out-of-core, parallel, n-dimensional array library
• New operations
• Parallel algorithms (approximate
quantiles, topk, …)
• Slightly overlapping arrays
• Integration with HDF5

23. Demo

28. Out-of-core arrays
from netCDF4 import Dataset
from glob import glob
from numpy import flipud
import matplotlib.pyplot as plt
files = sorted(glob('*.nc'))
data = [Dataset(f).variables['sst'] for f in files]
arrs = [da.from_array(x, chunks=(24, 360, 360)) for x in data]
x = da.concatenate(arrs, axis=0)
full_mean = x.mean(axis=0)
plt.imshow(np.flipud(full_mean), cmap='RdBu_r')
plt.title('Average Global Ocean Temperature, 1981-2015')

29. Out-of-core arrays

• Out-of-core, blocked parallel DataFrame
• Mirrors pandas interface
• Only implements a subset of pandas operations (currently)

Efficient operations
• Elementwise operations: df.x + df.y
• Row-wise selections: df[df.x > 0]
• Aggregations: df.x.max()
• groupby-aggregate: df.groupby(df.x).y.max()
• Value counts: df.x.value_counts()
• Drop duplicates: df.x.drop_duplicates()
• Join on index: dd.merge(df1, df2, left_index=True, right_index=True)

Less efficient operations (require shuffle unless on index)
• Set index: df.set_index(df.x)
• groupby-apply
• Join not on the index: pd.merge(df1, df2, on='name')

33. Out-of-core dataframes
• Yearly csvs of all American flights since 1990
• Contains information on times, airlines, locations, etc…
• http://www.transtats.bts.gov/Fields.asp?Table_ID=236

34. Out-of-core dataframes
# Create a dataframe from csv files
>>> df = dd.read_csv('*.csv', usecols=['Origin', 'DepTime', 'CRSDepTime', 'Cancelled'])
# Get time series of non-cancelled and delayed flights
>>> not_cancelled = df[df.Cancelled != 1]
>>> delayed = not_cancelled[not_cancelled.DepTime > not_cancelled.CRSDepTime]
# Count total and delayed flights per airport
>>> total_per_airport = not_cancelled.Origin.value_counts()
>>> delayed_per_airport = delayed.Origin.value_counts()
# Calculate percent delayed per airport
>>> percent_delayed = delayed_per_airport/total_per_airport
# Remove airports that had less than 500 flights a year on average
>>> out = percent_delayed[total_per_airport > 10000]

35. Out-of-core dataframes
# Convert to pandas dataframe, sort, and output top 10
>>> result = out.compute()
>>> result.sort(ascending=False)
ATL 0.538589
PIT 0.515708
ORD 0.513163
PHL 0.508329
DFW 0.506470
CLT 0.501259
DEN 0.474589
JFK 0.453212
SFO 0.452156
CVG 0.452117
dtype: float64

36. Out-of-core dataframes
• 10 GB on disk
• Need to read ~4 GB
subset to perform
computation
• Max memory during
computation is only 0.75
GB

37. • Collections build task graphs
• Graph specification = uniting interface

• Tasks are tuples of (func, args...) (lispy syntax)
• Args can be names, values, or tasks
a = 1
b = 2
x = inc(a)
y = inc(b)
z = mul(x, y)
dsk = {"a": 1,
"b": 2,
"x": (inc, "a"),
"y": (inc, "b"),
"z": (mul, "x", "y")}

39. Dask collections fit many problems…
… but not everything.

40. Can create graphs directly
...
def clean(data):
...
def analyze(sequence_of_data):
...
def store(result):
with open(..., 'w') as f:
f.write(result)
'analyze': (analyze, ['clean-%d' % i for i in [1, 2, 3]]),
'store': (store, 'analyze')}

41. Takeaways

42. Takeaways
• Python can still handle large data using blocked
algorithms

43. Takeaways
• Python can still handle large data using blocked
algorithms
algorithms

44. Takeaways
• Python can still handle large data using blocked
algorithms