Talk from the AMS 2023 conference.
Abstract:
Real scientific workflows often require working with many heterogeneous but related datasets. Examples in geoscience include: (1) scenario simulations by many different climate models in the same intercomparison project, (2) simulation data at multiple resolutions from a convergence scan or sub-grid-scale study, and (3) observational + simulation data of the same region.
There is a need for a general high-level data structure which can organize such data in an accessible way, whilst still being flexible enough to adapt to the user’s mental model of their data. It should also be intuitive, so that simple operations such as calculating average climatologies are still simple to express. It should also serialize to a commonly-used data format, so as not to create backwards compatibility problems.
The new xarray-datatree [1] package solves these problems, by providing a tree-like hierarchical data structure that is general enough to be useful in a wide variety of cases. Datatree extends xarray - generalizing xarray.Dataset to build upon an interface that many geoscientists are already familiar with. Analysis operations can be mapped over a whole tree, allowing simple operations to be expressed intuitively, even over complex heterogeneous datasets.
Datatree is inspired by netCDF: Xarray’s highest-level object is currently an xarray.Dataset, which stores collections of arrays with a shared coordinate system and corresponds to a single group in a netCDF file. A DataTree object is instead a structured hierarchical collection of Datasets, and would map to multiple netCDF groups. Therefore serialization to and from netCDF files is possible with datatree, so backwards compatibility is maintained.
We will explain the model of datatree, its relation to netCDF & Zarr, and how to use the data structure to simplify your own work. We will also give examples of using datatree with real geoscience datasets, such as CMIP6 model data. [2]
[1] https://github.com/xarray-contrib/datatree
[2] https://medium.com/pangeo/easy-ipcc-part-1-multi-model-datatree-469b87cf9114