Upgrade to Pro — share decks privately, control downloads, hide ads and more …

"How to transform thousands of CMIP6 datasets to Zarr with Pangeo Forge - And why we should never do this again!"

Julius Busecke
November 30, 2023
9

"How to transform thousands of CMIP6 datasets to Zarr with Pangeo Forge - And why we should never do this again!"

Slides for "How to transform thousands of CMIP6 datasets to Zarr with Pangeo Forge - And why we should never do this again!" presented at Pangeo Showcase on November 29, 2023 by Julius Busecke, Charles Stern.

Additional Material:
Recording of the presentation: https://www.youtube.com/watch?v=vZKlcsYNNbU

Julius Busecke

November 30, 2023
Tweet

More Decks by Julius Busecke

Transcript

  1. How to transform thousands of CMIP6 datasets to Zarr with

    Pangeo Forge Julius Busecke and Charles Stern | Nov 28th 2023 | Pangeo Showcase And why we should never do this again! Award 8434 Awards 2026932, 2019625
  2. Who am I? • Physical Oceanographer • Senior Sta ff

    Associate at Columbia University • Manager of Data and Computing for LEAP • Lead of Open Science for M2LInES • Core Developer of xGCM, xMIP • Pangeo Fan, User, and Member • Open Source/ Open Science Advocate • Maintainer of the Pangeo CMIP6 zarr stores
  3. Because its fast and easy! Reproducible IPCC Science in Minutes

    IPCC Chapter 9 2-10 minutes on LEAP-Pangeo JupyterHub Code Repository https://github.com/jbusecke/presentation_wcrp_open_science_conference Scipy Talk https://www.youtube.com/watch?v=7niNfs3ZpfQ
  4. Because more and more people need this data Collaboration on

    open data: Everybody wins! 👨💼 👨🔬 🧑💼 🧑💻 Science Paper 👩🔬🎉🎓 Derived Proprietary Data Product 🧑💼💵🎊 Fixes naming Corrects units Tunes Compression Uploads new data Academia Private Sector / Public Sector 👷 👩🏫 🏄 Everyone gets to explore the clean dataset! Public Stakeholder
  5. Challenges • ~13 PB 👀 • How to prioritize the

    most desired datasets? • millions of datasets with variable size, naming need to be converted to Analysis- Ready Cloud-Optimized Zarr stores 🏗 • Each datasets consists of possibly many nc f iles, some are not always available • Many di ff erent grids and timesteps -> vastly di ff erent array shapes • Ingestion is not just one way. Some datasets are retracted, and any mirror needs to re f lect that. ⚠
  6. The Proof of concept • Public Storage Buckets from Google

    and Amazon • Collaborative e ff ort in the Pangeo / ESGF Cloud Data Working Group • Zarr stores manual processed by Naomi Henderson • ~150k datasets uploaded using user requests, and tons of manual labor in jupyter notebooks • Cataloging: Intake-esm collection based on a large csv f ile • Retraction: Remove datasets from catalog but never deleted them. • Early upload enabled downstream work to use and improve the useability of the cloud data early on. • But then Naomi retired 😱 https://github.com/jbusecke/xMIP https://pangeo-data.github.io/pangeo-cmip6-cloud/
  7. Building a robot Naomi • Pangeo-Forge • Open source platform

    for data Extraction, Transformation, Loading (ETL) • Encodes all information needed to recreate an ARCO copy • Originally designed for few massive datasets • CMIP6 is a unique use-case: Massive amount of small-ish datasets • Massive refactor to Apache-Beam f inally enabled us to scale this ingestion • Big Shout out to Charles Stern and all Pangeo-Forge maintainers. https://github.com/pangeo-forge/pangeo-forge-recipes https://pangeo-forge.readthedocs.io/en/latest/index.html
  8. • Recipe Steps • Start with a list of unique

    identi f iers of a dataset (made up of .nc f iles) • dynamically request download urls • dynamic chunking (based on the size of each dataset) • Cataloging and Testing inline • Recipe Execution • Orchestrating thousands of single recipes in dictionary • Currently runs on Google Data f low but could run on any beam runner! The anatomy of the CMIP6 recipe
  9. ESGF query pangeo-forge-esgf • Python client for the ESGF API

    • Parse single instance_ids from wildcards • Return http download urls for a given instance_id • This is my f irst async code, so can probably be vastly improved, but it seems resonably fast. • If there are better packages I should use, Id love the feedback
  10. Dynamic Chunking Bring your own! • Preserve monthly chunks for

    some fancy calendar once zarr allows unequal chunks. • Chunk one dimension only if a certain other dimension is available • ...
  11. Ok so whats up with the "why we should never

    do it again"? Lower barrier of entry + Quick exploration of ideas = New CMIPers! Teaching with the real data Student ➡ Scientist 🚀 Not just for academics. Public/private sector uses the cloud data!