Upgrade to Pro — share decks privately, control downloads, hide ads and more …

"How to transform thousands of CMIP6 datasets to Zarr with Pangeo Forge - And why we should never do this again!"

Julius Busecke
November 30, 2023
15

"How to transform thousands of CMIP6 datasets to Zarr with Pangeo Forge - And why we should never do this again!"

Slides for "How to transform thousands of CMIP6 datasets to Zarr with Pangeo Forge - And why we should never do this again!" presented at Pangeo Showcase on November 29, 2023 by Julius Busecke, Charles Stern.

Additional Material:
Recording of the presentation: https://www.youtube.com/watch?v=vZKlcsYNNbU

Julius Busecke

November 30, 2023
Tweet

Transcript

  1. How to transform thousands of CMIP6 datasets to Zarr with

    Pangeo Forge Julius Busecke and Charles Stern | Nov 28th 2023 | Pangeo Showcase And why we should never do this again! Award 8434 Awards 2026932, 2019625
  2. Who am I? • Physical Oceanographer • Senior Sta ff

    Associate at Columbia University • Manager of Data and Computing for LEAP • Lead of Open Science for M2LInES • Core Developer of xGCM, xMIP • Pangeo Fan, User, and Member • Open Source/ Open Science Advocate • Maintainer of the Pangeo CMIP6 zarr stores
  3. Because its fast and easy! Reproducible IPCC Science in Minutes

    IPCC Chapter 9 2-10 minutes on LEAP-Pangeo JupyterHub Code Repository https://github.com/jbusecke/presentation_wcrp_open_science_conference Scipy Talk https://www.youtube.com/watch?v=7niNfs3ZpfQ
  4. Because more and more people need this data Collaboration on

    open data: Everybody wins! 👨💼 👨🔬 🧑💼 🧑💻 Science Paper 👩🔬🎉🎓 Derived Proprietary Data Product 🧑💼💵🎊 Fixes naming Corrects units Tunes Compression Uploads new data Academia Private Sector / Public Sector 👷 👩🏫 🏄 Everyone gets to explore the clean dataset! Public Stakeholder
  5. Challenges • ~13 PB 👀 • How to prioritize the

    most desired datasets? • millions of datasets with variable size, naming need to be converted to Analysis- Ready Cloud-Optimized Zarr stores 🏗 • Each datasets consists of possibly many nc f iles, some are not always available • Many di ff erent grids and timesteps -> vastly di ff erent array shapes • Ingestion is not just one way. Some datasets are retracted, and any mirror needs to re f lect that. ⚠
  6. The Proof of concept • Public Storage Buckets from Google

    and Amazon • Collaborative e ff ort in the Pangeo / ESGF Cloud Data Working Group • Zarr stores manual processed by Naomi Henderson • ~150k datasets uploaded using user requests, and tons of manual labor in jupyter notebooks • Cataloging: Intake-esm collection based on a large csv f ile • Retraction: Remove datasets from catalog but never deleted them. • Early upload enabled downstream work to use and improve the useability of the cloud data early on. • But then Naomi retired 😱 https://github.com/jbusecke/xMIP https://pangeo-data.github.io/pangeo-cmip6-cloud/
  7. Building a robot Naomi • Pangeo-Forge • Open source platform

    for data Extraction, Transformation, Loading (ETL) • Encodes all information needed to recreate an ARCO copy • Originally designed for few massive datasets • CMIP6 is a unique use-case: Massive amount of small-ish datasets • Massive refactor to Apache-Beam f inally enabled us to scale this ingestion • Big Shout out to Charles Stern and all Pangeo-Forge maintainers. https://github.com/pangeo-forge/pangeo-forge-recipes https://pangeo-forge.readthedocs.io/en/latest/index.html
  8. • Recipe Steps • Start with a list of unique

    identi f iers of a dataset (made up of .nc f iles) • dynamically request download urls • dynamic chunking (based on the size of each dataset) • Cataloging and Testing inline • Recipe Execution • Orchestrating thousands of single recipes in dictionary • Currently runs on Google Data f low but could run on any beam runner! The anatomy of the CMIP6 recipe
  9. ESGF query pangeo-forge-esgf • Python client for the ESGF API

    • Parse single instance_ids from wildcards • Return http download urls for a given instance_id • This is my f irst async code, so can probably be vastly improved, but it seems resonably fast. • If there are better packages I should use, Id love the feedback
  10. Dynamic Chunking Bring your own! • Preserve monthly chunks for

    some fancy calendar once zarr allows unequal chunks. • Chunk one dimension only if a certain other dimension is available • ...
  11. Ok so whats up with the "why we should never

    do it again"? Lower barrier of entry + Quick exploration of ideas = New CMIPers! Teaching with the real data Student ➡ Scientist 🚀 Not just for academics. Public/private sector uses the cloud data!