Slide 1

Slide 1 text

Labeled arrays and datasets in Python Stephan Hoyer (@shoyer) Data Structures for Data Science Workshop UC Berkeley BIDS, 18 September 2015

Slide 2

Slide 2 text

pandas DataFrames organize messy data the index

Slide 3

Slide 3 text

xray’s Dataset is a multi-dimensional DataFrame time longitude latitude land_cover elevation

Slide 4

Slide 4 text

Labeled doesn’t (necessarily) mean slow hash table + array with lots of Cython

Slide 5

Slide 5 text

Label based selection and aggregation makes for more reliable and maintainable code

Slide 6

Slide 6 text

Xray vectorizes by dimension names, not order time time space + = space time space + = time space time space

Slide 7

Slide 7 text

Indexes enable label-based alignment + = d e f g h i a b c d e f g d e f g

Slide 8

Slide 8 text

People need incentives to preserve metadata >>> sst.plot() Metadata enables easy & reliable computation

Slide 9

Slide 9 text

The ecosystem needs duck-typed ndarrays Dask Xray’s current backends Future(?) backends What’s the generic version of np.concatenate?

Slide 10

Slide 10 text

We also need better data types Categorical Dates & times Missing data Physical Units We should be able to write new dtypes in Python

Slide 11

Slide 11 text

Backup

Slide 12

Slide 12 text

Method chaining pipelines explicitly show data flow

Slide 13

Slide 13 text

Indexes enable label-based alignment + = d e f g h i a b c d e f g d e f g xray style (intersection) = a b c d e f g h i pandas style (union) + d e f g h i a b c d e f g

Slide 14

Slide 14 text

Lots of other features for easy data analysis Feature Pandas Xray IO read_csv(), .to_csv() open_dataset(), .to_netcdf() Split-apply-combine .groupby(...).mean() Plotting .plot() Missing values .isnull(), .fillna(), .dropna() Convert to NumPy .values

Slide 15

Slide 15 text

Printing a Dataset shows its contents >>> ds Dimensions: (time: 10, latitude: 8, longitude: 8) Coordinates: * time (time) datetime64 2015-01-01 2015-01-02 2015-01-03 2015-01-04 ... * latitude (latitude) float64 50.0 47.5 45.0 42.5 40.0 37.5 35.0 32.5 * longitude (longitude) float64 -105.0 -102.5 -100.0 -97.5 -95.0 -92.5 ... elevation (longitude, latitude) int64 201 231 582 239 1848 1004 1004 ... land_cover (longitude, latitude) object 'forest' 'urban' 'farmland'... Data variables: temperature (time, longitude, latitude) float64 13.7 8.031 18.36 24.95 ... pressure (time, longitude, latitude) float64 1.374 1.142 1.388 0.9992 ...

Slide 16

Slide 16 text

Labeled data knows how to plot itself ds.air_temperature.plot() df.plot()

Slide 17

Slide 17 text

Xray and pandas make missing data (mostly) easy Select it: .isnull() Skip it: .mean() Drop it: .dropna() Fill it: .fillna()