Slide 1

Slide 1 text

Xarray-DataTree: Hierarchical Data Structures for Multi-Model Science Thomas Nicholas*, Julius Busecke Or What Trees Can do for You 1 *[email protected] *github.com/TomNicholas

Slide 2

Slide 2 text

● Oceanographer (ex-plasma physicist) Who am I? 2 ● Xarray core dev & Pangeo user ● I like tools that all fields of science can use… ● My first AMS! 😁

Slide 3

Slide 3 text

1. Problem of Data Complexity 2. Xarray-DataTree can help! 3. Examples of people using DataTree already Talk overview 3

Slide 4

Slide 4 text

● Big data has many challenges: ○ Compute / scalability … (use dask etc.) ○ Storage / IO / archiving … (use Zarr etc.) ○ But also complexity! Complexity of Big Data 4 E.g. searching for some CMIP6 data… But so much going on!

Slide 5

Slide 5 text

● Data at multiple resolutions ● Many parameters ○ e.g. CMIP models, scenarios, experiments, baselines... ● Heterogeneous data ○ e.g. observations + simulations ● Anytime your netCDF file has multiple groups Science has lots of complicated data 5

Slide 6

Slide 6 text

● Python package providing labelled data structures ○ Wraps numpy-like arrays What is xarray? 6 ● axis=0 vs dim=’time’ ● In-memory representation of a netCDF group ● Scales via dask to TeraBytes of data!

Slide 7

Slide 7 text

● Who here uses many separate xarray.Dataset objects? Have you done this? 7 ● Issue: real use cases are too complicated for one Dataset, ● I hope to banish this (anti-)pattern using DataTree!

Slide 8

Slide 8 text

● A “DataTree” is a hierarchical tree of xarray Datasets Enter xarray-DataTree 8

Slide 9

Slide 9 text

● Open any netCDF file (/ Zarr store) containing multiple groups as a nested tree ● (Can save back out as file with multiple groups too) Features 1: I / O 9

Slide 10

Slide 10 text

Features 2: Interactive HTML representation 10

Slide 11

Slide 11 text

● Groups are connected as parent/children (& siblings/ancestors etc...) Features 3: Node relationships 11 ● Access via filepath-like syntax

Slide 12

Slide 12 text

● Xarray’s computation methods are automatically mapped over entire tree below Features 4: Map computations over tree 12 ● Can also map custom computation

Slide 13

Slide 13 text

● Julius Busecke and I did a multi-scenario, multi-model analysis ● See pangeo blog post + Julius’ talk (3:45 pm tomorrow!) ○ “Reproducible IPCC Science Using Pangeo Tools in the Cloud” Example 1: CMIP6 data 13

Slide 14

Slide 14 text

● Guilherme Castelao’s ocean glider datasets ● Multiple sensors ○ Different time sampling rates Example 2: Ocean Glider data 14

Slide 15

Slide 15 text

● (by Joe Hamman / CarbonPlan) Example 3: Multi-resolution maps 15

Slide 16

Slide 16 text

● Integrate datatree upstream into xarray main ● Automatic “tree broadcasting” of operations ● Attribute-like access to child nodes ○ (e.g. dt.model. --> dt.model.experiment_a) ● Give me Your Ideas! Future plans 16 github.com/xarray-contrib/datatree P.S. I am looking for my next big project 😁

Slide 17

Slide 17 text

● Alex Kerney’s Forecast model run collections Example 4 (Bonus): Forecast data 17

Slide 18

Slide 18 text

● Example 5 (Bonus): xRadar 18