Pangeo Analysis-Ready Cloud-Optimized CMIP Data

Slide 1

Slide 1 text

JULIUS BUSECKE | NOV 21 2024 | ESGF WEBINAR Lessons learned and future directions PANGEO ANALYSIS-READY CLOUD- OPTIMIZED CMIP DATA

Slide 2

Slide 2 text

WHO AM I? M²LInES jbusecke juliusbusecke.com @JuliusBusecke @[email protected] @codeandcurrents.bsky.social 🌊 Climate Scientist Ocean transport of Heat, Carbon Oxygen Impact of small scale processes on global climate variability. 🤓Developer/Data Nerd Pangeo CMIP6 Cloud Data xMIP/xGCM 🤝 Open Science Advocate Manager for Data and Computation - NSF- LEAP Lead of Open Research - m2lines 🙂↔💀 Screw you, Elon!

Slide 3

Slide 3 text

- ESGF "Power User" + Maintainer of Pangeo Cloud Data - Trying to integrate the lessons learned in the past years into the new ESGF infrastructure - Representative of many di ff erent CMIP6 Users and their struggles - Many people fi nd it incredibly hard to work with CMIP6 data, and I believe we should do as much as we can to make it as easy as possible for users. - If work is shifted downstream to each user, toil is ampli fi ed, and results become harder to reproduce. This hurts science. - This can often be overlooked when we have privileged access to the data and are focused on deeply technical details - I learned a lot about CMIP data, ESGF and how users work with that data in the past years! - Thank you for all your work! WHAT HAT AM I WEARING TODAY?

Slide 4

Slide 4 text

- CMIP data is used more and more broadly - "Traditional Users" at large orgs like universities and labs - "New Users" are increasingly interested in accessing CMIP data - Industry (insurance, climate service providers, ...) - Local Gov, Non-Pro fi t, Defense ... - The "Random person" not belonging to any of the above INTRO “Universal Declaration of Human Rights.” United Nations. Accessed August 19, 2024. https://www.un.org/en/about-us/universal-declaration-of-human-rights.

Slide 5

Slide 5 text

Download Clean and Combine Crunch the data Interpret Results

Slide 6

Slide 6 text

Download Clean and Combine Crunch the data Interpret Results ⏳💸🚫

Slide 7

Slide 7 text

Download Clean and Combine Crunch the data Interpret Results ❌

Slide 8

Slide 8 text

P r i v i l e g e d I n s t i t u t i o n s c r e a t e “ D a t a F o r t r e s s e s * ” 8 Image credit: Moahim, CC BY-SA 4.0, via Wikimedia Commons *Coined by Chelle Gentemann ❌ Results not reproducible outside fortress ❌ Barrier to collaboration ❌ Inefficient / duplicative ❌ Can’t scale to future data needs ❌ Limits inclusion and knowledge transfer

Slide 9

Slide 9 text

Slide 10

Slide 10 text

- Build prototypes of Analysis-Ready Cloud- Optimized CMIP6 "Caches" - Experiment with di ff erent approaches: - Convert netcdfs to zarr (LDEO) - Google Cloud Storage - Rewrites data but enhances user experience and performance - Replicate netcdfs on cloud storage (GFDL) - AWS - Works as a ESGF replica node THE PANGEO / ESGF CLOUD DATA WORKING GROUP Centralize this cache in the cloud!

Slide 11

Slide 11 text

Zarr Ingestion via Pangeo-Forge The anatomy of the CMIP6 recipe

Slide 12

Slide 12 text

• Recipe Steps • Start with a list of unique identi f iers of a dataset (made up of .nc f iles) Zarr Ingestion via Pangeo-Forge The anatomy of the CMIP6 recipe

Slide 13

Slide 13 text

• Recipe Steps • Start with a list of unique identi f iers of a dataset (made up of .nc f iles) • dynamically request download urls Zarr Ingestion via Pangeo-Forge The anatomy of the CMIP6 recipe

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

• Recipe Steps • Start with a list of unique identi f iers of a dataset (made up of .nc f iles) • dynamically request download urls • dynamic chunking (based on the size of each dataset) • Cataloging and Testing inline • Recipe Execution Zarr Ingestion via Pangeo-Forge The anatomy of the CMIP6 recipe

Slide 17

Slide 17 text

• Recipe Steps • Start with a list of unique identi f iers of a dataset (made up of .nc f iles) • dynamically request download urls • dynamic chunking (based on the size of each dataset) • Cataloging and Testing inline • Recipe Execution • Orchestrating thousands of single recipes in dictionary Zarr Ingestion via Pangeo-Forge The anatomy of the CMIP6 recipe

Slide 18

Slide 18 text

• Recipe Steps • Start with a list of unique identi f iers of a dataset (made up of .nc f iles) • dynamically request download urls • dynamic chunking (based on the size of each dataset) • Cataloging and Testing inline • Recipe Execution • Orchestrating thousands of single recipes in dictionary Zarr Ingestion via Pangeo-Forge The anatomy of the CMIP6 recipe

Slide 19

Slide 19 text

• Recipe Steps • Start with a list of unique identi f iers of a dataset (made up of .nc f iles) • dynamically request download urls • dynamic chunking (based on the size of each dataset) • Cataloging and Testing inline • Recipe Execution • Orchestrating thousands of single recipes in dictionary • Currently runs on Google Data f low but could run on any beam runner! Zarr Ingestion via Pangeo-Forge The anatomy of the CMIP6 recipe

Slide 20

Slide 20 text

CMIP6 CLOUD DATA - WHAT WORKED WELL ESGF Ingestion Pipeline A single data repository in the cloud serves all use cases Everybody rolls their own Custom Code Custom Code Custom Code University Lab Industry ❌ ✋🚫 Storage Provided by Google as Public Dataset

Slide 21

Slide 21 text

CMIP6 CLOUD DATA A single data repository in the cloud serves all use cases Collaborative and agile Research Inclusive Education on real climate data Fast Iteration - Lower Barrier of Entry Portable Methods and Results not just for Academia

Slide 22

Slide 22 text

CMIP6 CLOUD DATA A single data repository in the cloud serves all use cases Collaborative and agile Research Inclusive Education on real climate data Portable Methods and Results not just for Academia Fast Iteration - Lower Barrier of Entry

Slide 23

Slide 23 text

CMIP6 CLOUD DATA A single data repository in the cloud serves all use cases Collaborative and agile Research Portable Methods and Results not just for Academia Inclusive Education on real climate data Fast Iteration - Lower Barrier of Entry

Slide 24

Slide 24 text

CMIP6 CLOUD DATA A single data repository in the cloud serves all use cases Portable Methods and Results not just for Academia Collaborative and agile Research Inclusive Education on real climate data Fast Iteration - Lower Barrier of Entry

Slide 25

Slide 25 text

Slide 26

Slide 26 text

CMIP6 CLOUD DATA - WHAT COULD BE IMPROVED - Cataloging - Simple CSV fi le with facets powering intake catalog - BUT this is workable and cataloging demands are very community speci fi c - Focusing on maximum performance and fl exibility could enable many di ff erent downstream cataloging approaches! - Usage Statistics - We have no usage statistics from Google - Could likely be remedied with a di ff erent agreement in the future - "Yet another way to access CMIP" - Users are already confused by the many options, +1 adds to the burden if docs are fragmented - Consolidated Docs and/or exposing ARCO cache data in the ESGF catalog would improve this!

Slide 27

Slide 27 text

- People love the Dataset/Datacube Access - Promote this entry point to same level as fi le download in the base API and instructions/documentation! - Rewriting the data is really hard! - REST API docs unclear + synchronous clients slow. - Lots of QC issues to work around. - So lets not do it! - Creating virtual (zarr) references preserves a lot of the user experience that people love - Will make it easier to enable high performance (cloud storage based) and/or specialized (rechunked) 'caches' of the o ffi cial data! WHAT WE LEARNED AND WHERE TO GO NEXT

Slide 28

Slide 28 text

- Di ff erent fl avors of the same basic idea: - Expose multiple logically connected fi les (e.g. timesteps) as a single dataset by combining metadata and referencing data chunks from each fi le - > User accesses the smallest scienti fi cally useful unit of data! - Kerchunk (requires fsspec/python) - References can be stored as json or parquet fi les - Works right now with s3 and http served CMIP6 fi les - Zarr V3 (e.g. implemented via icechunk) - Works with development branches - Would allow access - Aggregated NetCDF - ... - Each one of these approaches requires creating and hosting a few small additional fi les per dataset and pointing to a single fi le/url FILE AGGREGATION AND VIRTUAL ZARR REFS Kerchunk JSON example Xarray usage example

Slide 29

Slide 29 text

- Generation of Kerchunk json references has been demonstrated by CEDA for ESGF-NG and will likely be a part of the new STAC catalog 🎉 - We can reference netcdfs over http and s3 - Open Questions: - When to generate/update references? - As part of the publishing or separate event? - Separate event could enable rebuilding di ff erent refs after publishing! - Can only publishers generate refs to 'their' fi les, or can I build references only? - Some caveats! - Additional QC/QA requirements: - Required: Consistent compression codecs and chunk size between fi les - Nice to have: Chunk size range checks FILE AGGREGATION AND VIRTUAL ZARR REFS Publish Node Reference Publish Node Reference Node ? Reference

Slide 30

Slide 30 text

- Using Virtualizarr to produce references (supports writing kerchunk and icechunk) - Demonstrate work fl ow and test performance - Fully working examples of virtual zarr references from netcdf fi les on s3 and http - Using VirtualiZarr to combine fi les using xarray - Currently producing kerchunk references, but experimenting with icechunk - Please raise issues, reach out if you are interested. ONGOING WORK

Slide 31

Slide 31 text

- I think with small changes we can integrate many of our lessons into the new ESGF Infrastructure - We can reproduce much of the UX from the Pangeo CMIP6 Cloud Data without rewriting to Zarr 🎉 - Requires: - ✅ Support for dataset level references/aggregations in the base data access API - ❓Support multiple references/aggregation methods? - ✅ Support replication to S3 storage - ❓QC on compression and chunking - Producing references using Virtualizarr can work as part of the publishing - This will signi fi cantly enhance the user experience and enable downstream e ff orts - Orgs can run cloud enabled replica nodes to enhance performance for parallel access (no commitment from ESGF) - Makes e ff orts to create specialized caches (rechunked native zarr, icechunk, ...) easier - Depending on QC/QA procedure these could be fully veri fi ed against published fi les too - Consolidated Docs 🙏! - Aside from the data access, having a single website to get info for new users, users who want to build downstream tools, maintainers of core infrastructure in one place would go a looooong way towards making access to climate data more inclusive and reduce time demand on ESGF maintainers. SUMMARY

Slide 32

Slide 32 text

I ❤ QUESTIONS AND COMMENTS