Upgrade to Pro — share decks privately, control downloads, hide ads and more …

xarray-Datatree: Hierarchical Data Structures for Multi-Model Science

xarray-Datatree: Hierarchical Data Structures for Multi-Model Science

Talk from the AMS 2023 conference.

Abstract:

Real scientific workflows often require working with many heterogeneous but related datasets. Examples in geoscience include: (1) scenario simulations by many different climate models in the same intercomparison project, (2) simulation data at multiple resolutions from a convergence scan or sub-grid-scale study, and (3) observational + simulation data of the same region.

There is a need for a general high-level data structure which can organize such data in an accessible way, whilst still being flexible enough to adapt to the user’s mental model of their data. It should also be intuitive, so that simple operations such as calculating average climatologies are still simple to express. It should also serialize to a commonly-used data format, so as not to create backwards compatibility problems.

The new xarray-datatree [1] package solves these problems, by providing a tree-like hierarchical data structure that is general enough to be useful in a wide variety of cases. Datatree extends xarray - generalizing xarray.Dataset to build upon an interface that many geoscientists are already familiar with. Analysis operations can be mapped over a whole tree, allowing simple operations to be expressed intuitively, even over complex heterogeneous datasets.

Datatree is inspired by netCDF: Xarray’s highest-level object is currently an xarray.Dataset, which stores collections of arrays with a shared coordinate system and corresponds to a single group in a netCDF file. A DataTree object is instead a structured hierarchical collection of Datasets, and would map to multiple netCDF groups. Therefore serialization to and from netCDF files is possible with datatree, so backwards compatibility is maintained.

We will explain the model of datatree, its relation to netCDF & Zarr, and how to use the data structure to simplify your own work. We will also give examples of using datatree with real geoscience datasets, such as CMIP6 model data. [2]

[1] https://github.com/xarray-contrib/datatree
[2] https://medium.com/pangeo/easy-ipcc-part-1-multi-model-datatree-469b87cf9114

Tom Nicholas

January 09, 2023
Tweet

More Decks by Tom Nicholas

Other Decks in Programming

Transcript

  1. Xarray-DataTree: Hierarchical Data Structures for
    Multi-Model Science
    Thomas Nicholas*,
    Julius Busecke
    Or What Trees Can do for You
    1
    *[email protected]
    *github.com/TomNicholas

    View full-size slide

  2. ● Oceanographer (ex-plasma
    physicist)
    Who am I?
    2
    ● Xarray core dev &
    Pangeo user
    ● I like tools that all fields of
    science can use…
    ● My first AMS! 😁

    View full-size slide

  3. 1. Problem of Data Complexity
    2. Xarray-DataTree can help!
    3. Examples of people using DataTree already
    Talk overview
    3

    View full-size slide

  4. ● Big data has many challenges:
    ○ Compute / scalability … (use dask etc.)
    ○ Storage / IO / archiving … (use Zarr etc.)
    ○ But also complexity!
    Complexity of Big Data
    4
    E.g. searching for
    some CMIP6 data…
    But so much going on!

    View full-size slide

  5. ● Data at multiple resolutions
    ● Many parameters
    ○ e.g. CMIP models, scenarios, experiments, baselines...
    ● Heterogeneous data
    ○ e.g. observations + simulations
    ● Anytime your netCDF file has multiple groups
    Science has lots of complicated data
    5

    View full-size slide

  6. ● Python package providing labelled data structures
    ○ Wraps numpy-like arrays
    What is xarray?
    6
    ● axis=0 vs dim=’time’
    ● In-memory representation
    of a netCDF group
    ● Scales via dask to
    TeraBytes of data!

    View full-size slide

  7. ● Who here uses many separate xarray.Dataset objects?
    Have you done this?
    7
    ● Issue: real use cases are too complicated for one Dataset,
    ● I hope to banish this (anti-)pattern using DataTree!

    View full-size slide

  8. ● A “DataTree” is a hierarchical tree of xarray Datasets
    Enter xarray-DataTree
    8

    View full-size slide

  9. ● Open any netCDF file (/
    Zarr store) containing
    multiple groups as a
    nested tree
    ● (Can save back out as
    file with multiple groups
    too)
    Features 1: I / O
    9

    View full-size slide

  10. Features 2: Interactive HTML representation
    10

    View full-size slide

  11. ● Groups are connected as parent/children (& siblings/ancestors etc...)
    Features 3: Node relationships
    11
    ● Access via filepath-like syntax

    View full-size slide

  12. ● Xarray’s computation methods are automatically
    mapped over entire tree below
    Features 4: Map computations over tree
    12
    ● Can also map custom computation

    View full-size slide

  13. ● Julius Busecke and I did a
    multi-scenario, multi-model analysis
    ● See pangeo blog post + Julius’ talk
    (3:45 pm tomorrow!)
    ○ “Reproducible IPCC Science
    Using Pangeo Tools in the Cloud”
    Example 1: CMIP6 data
    13

    View full-size slide

  14. ● Guilherme
    Castelao’s ocean
    glider datasets
    ● Multiple sensors
    ○ Different time
    sampling rates
    Example 2: Ocean Glider data
    14

    View full-size slide

  15. ● (by Joe Hamman / CarbonPlan)
    Example 3: Multi-resolution maps
    15

    View full-size slide

  16. ● Integrate datatree upstream into xarray main
    ● Automatic “tree broadcasting” of operations
    ● Attribute-like access to child nodes
    ○ (e.g. dt.model. -->
    dt.model.experiment_a)
    ● Give me Your Ideas!
    Future plans
    16
    github.com/xarray-contrib/datatree
    P.S. I am looking for my
    next big project 😁

    View full-size slide

  17. ● Alex Kerney’s Forecast
    model run collections
    Example 4 (Bonus): Forecast data
    17

    View full-size slide


  18. Example 5 (Bonus): xRadar
    18

    View full-size slide