Upgrade to Pro — share decks privately, control downloads, hide ads and more …

xarray-Datatree: Hierarchical Data Structures for Multi-Model Science

xarray-Datatree: Hierarchical Data Structures for Multi-Model Science

Talk from the AMS 2023 conference.

Abstract:

Real scientific workflows often require working with many heterogeneous but related datasets. Examples in geoscience include: (1) scenario simulations by many different climate models in the same intercomparison project, (2) simulation data at multiple resolutions from a convergence scan or sub-grid-scale study, and (3) observational + simulation data of the same region.

There is a need for a general high-level data structure which can organize such data in an accessible way, whilst still being flexible enough to adapt to the user’s mental model of their data. It should also be intuitive, so that simple operations such as calculating average climatologies are still simple to express. It should also serialize to a commonly-used data format, so as not to create backwards compatibility problems.

The new xarray-datatree [1] package solves these problems, by providing a tree-like hierarchical data structure that is general enough to be useful in a wide variety of cases. Datatree extends xarray - generalizing xarray.Dataset to build upon an interface that many geoscientists are already familiar with. Analysis operations can be mapped over a whole tree, allowing simple operations to be expressed intuitively, even over complex heterogeneous datasets.

Datatree is inspired by netCDF: Xarray’s highest-level object is currently an xarray.Dataset, which stores collections of arrays with a shared coordinate system and corresponds to a single group in a netCDF file. A DataTree object is instead a structured hierarchical collection of Datasets, and would map to multiple netCDF groups. Therefore serialization to and from netCDF files is possible with datatree, so backwards compatibility is maintained.

We will explain the model of datatree, its relation to netCDF & Zarr, and how to use the data structure to simplify your own work. We will also give examples of using datatree with real geoscience datasets, such as CMIP6 model data. [2]

[1] https://github.com/xarray-contrib/datatree
[2] https://medium.com/pangeo/easy-ipcc-part-1-multi-model-datatree-469b87cf9114

Tom Nicholas

January 09, 2023
Tweet

More Decks by Tom Nicholas

Other Decks in Programming

Transcript

  1. • Oceanographer (ex-plasma physicist) Who am I? 2 • Xarray

    core dev & Pangeo user • I like tools that all fields of science can use… • My first AMS! 😁
  2. 1. Problem of Data Complexity 2. Xarray-DataTree can help! 3.

    Examples of people using DataTree already Talk overview 3
  3. • Big data has many challenges: ◦ Compute / scalability

    … (use dask etc.) ◦ Storage / IO / archiving … (use Zarr etc.) ◦ But also complexity! Complexity of Big Data 4 E.g. searching for some CMIP6 data… But so much going on!
  4. • Data at multiple resolutions • Many parameters ◦ e.g.

    CMIP models, scenarios, experiments, baselines... • Heterogeneous data ◦ e.g. observations + simulations • Anytime your netCDF file has multiple groups Science has lots of complicated data 5
  5. • Python package providing labelled data structures ◦ Wraps numpy-like

    arrays What is xarray? 6 • axis=0 vs dim=’time’ • In-memory representation of a netCDF group • Scales via dask to TeraBytes of data!
  6. • Who here uses many separate xarray.Dataset objects? Have you

    done this? 7 • Issue: real use cases are too complicated for one Dataset, • I hope to banish this (anti-)pattern using DataTree!
  7. • Open any netCDF file (/ Zarr store) containing multiple

    groups as a nested tree • (Can save back out as file with multiple groups too) Features 1: I / O 9
  8. • Groups are connected as parent/children (& siblings/ancestors etc...) Features

    3: Node relationships 11 • Access via filepath-like syntax
  9. • Xarray’s computation methods are automatically mapped over entire tree

    below Features 4: Map computations over tree 12 • Can also map custom computation
  10. • Julius Busecke and I did a multi-scenario, multi-model analysis

    • See pangeo blog post + Julius’ talk (3:45 pm tomorrow!) ◦ “Reproducible IPCC Science Using Pangeo Tools in the Cloud” Example 1: CMIP6 data 13
  11. • Guilherme Castelao’s ocean glider datasets • Multiple sensors ◦

    Different time sampling rates Example 2: Ocean Glider data 14
  12. • Integrate datatree upstream into xarray main • Automatic “tree

    broadcasting” of operations • Attribute-like access to child nodes ◦ (e.g. dt.model.<click tab> --> dt.model.experiment_a) • Give me Your Ideas! Future plans 16 github.com/xarray-contrib/datatree P.S. I am looking for my next big project 😁