Slide 1

Slide 1 text

Xarray explicit/flexible indexes Benoît Bovy Dask Distributed Summit (May 2021)

Slide 2

Slide 2 text

Xarray data model

Slide 3

Slide 3 text

Some (many) of Xarray’s core features…

Slide 4

Slide 4 text

Label-based data selection

Slide 5

Slide 5 text

Automatic alignment

Slide 6

Slide 6 text

Group-by, interpolate, resample, di ff erentiate, integrate…

Slide 7

Slide 7 text

… (internally) rely on pandas.Index

Slide 8

Slide 8 text

Works well but has some limitations…

Slide 9

Slide 9 text

Irregular data X

Slide 10

Slide 10 text

Irregular data X Only one index per dimension

Slide 11

Slide 11 text

Irregular data X pd.MultiIndex does not support method=‘nearest’

Slide 12

Slide 12 text

Large size dimension(s)

Slide 13

Slide 13 text

Large size dimension(s) Unlike Dask DataFrame, no index partitioning in Xarray

Slide 14

Slide 14 text

Other (domain-speci fi c) cases • Geospatial data: Coordinate Reference System (CRS) pd.Index does not make any assumption about CRS • Staggered grids (cell centers vs. cell edges) • Any other case? Let’s hear from you! (src: http://thevisualroom.com) (src: QGIS documentation)

Slide 15

Slide 15 text

Workarounds

Slide 16

Slide 16 text

“Hacking” Xarray Use extra coordinates, attributes and/or Dataset (DataArray) accessors to store some objects…

Slide 17

Slide 17 text

Explicit, flexible indexes! (roadmap, ongoing, CZI grant)

Slide 18

Slide 18 text

“Explicit” indexes Make indexes 1st-class citizens of the Xarray data model Indexes: lat lon time Float64Index Float64Index DatetimeIndex

Slide 19

Slide 19 text

“Flexible” indexes Provide xarray.Index API (data selection, alignment… more?) + extension mechanism (e.g., entrypoints) Indexes: x, y KDTreeIndex An index may be built from several coordinates (possibly also from multi-dimension coordinates and/or coordinates with different dimensions)

Slide 20

Slide 20 text

More info pydata/xarray: design_notes/flexible_indexes_notes.md Projects - > Explicit Indexes

Slide 21

Slide 21 text

Use case: Point-wise selection of irregular data xarray - contrib/xoak

Slide 22

Slide 22 text

An example: 2D irregular mesh with lat/lon coordinates

Slide 23

Slide 23 text

Select nearest-neighbors (1D query point dataset)

Slide 24

Slide 24 text

this won’t be needed anymore with Xarray fl exible indexes Select nearest-neighbors (1D query point dataset)

Slide 25

Slide 25 text

Select nearest-neighbors (2D query point dataset) Xarray advanced indexing is really powerful!

Slide 26

Slide 26 text

Experimental Dask support (chunked point coordinates) 1st stage: “map” index lookup query points chunks Index points chunks index<1> .query()

Slide 27

Slide 27 text

Experimental Dask support (chunked point coordinates) 2nd stage: “reduce” brute-force lookup query points chunks Index points chunks dask.array.argmin

Slide 28

Slide 28 text

Experimental Dask support (chunked point coordinates) It works well, sometimes… …but often fails (miserably) It is challenging! - Chunk size matters - Dask inter-worker communication - Indexes are often complex objects - C/C++ native & dynamic data structures - Memory footprint? - Serialization?