Slide 1

Slide 1 text

How to transform thousands of CMIP6 datasets to Zarr with Pangeo Forge Julius Busecke and Charles Stern | Nov 28th 2023 | Pangeo Showcase And why we should never do this again! Award 8434 Awards 2026932, 2019625

Slide 2

Slide 2 text

Who am I? • Physical Oceanographer • Senior Sta ff Associate at Columbia University • Manager of Data and Computing for LEAP • Lead of Open Science for M2LInES • Core Developer of xGCM, xMIP • Pangeo Fan, User, and Member • Open Source/ Open Science Advocate • Maintainer of the Pangeo CMIP6 zarr stores

Slide 3

Slide 3 text

CMIP Climate Science: A global distributed effort

Slide 4

Slide 4 text

Why move CMIP6 data to the cloud?

Slide 5

Slide 5 text

Because its fast and easy! Reproducible IPCC Science in Minutes IPCC Chapter 9 2-10 minutes on LEAP-Pangeo JupyterHub Code Repository https://github.com/jbusecke/presentation_wcrp_open_science_conference Scipy Talk https://www.youtube.com/watch?v=7niNfs3ZpfQ

Slide 6

Slide 6 text

Because its inclusive! Everyone can learn Climate Science from real climate data

Slide 7

Slide 7 text

Because more and more people need this data Collaboration on open data: Everybody wins! 👨💼 👨🔬 🧑💼 🧑💻 Science Paper 👩🔬🎉🎓 Derived Proprietary Data Product 🧑💼💵🎊 Fixes naming Corrects units Tunes Compression Uploads new data Academia Private Sector / Public Sector 👷 👩🏫 🏄 Everyone gets to explore the clean dataset! Public Stakeholder

Slide 8

Slide 8 text

Challenges • ~13 PB 👀 • How to prioritize the most desired datasets? • millions of datasets with variable size, naming need to be converted to Analysis- Ready Cloud-Optimized Zarr stores 🏗 • Each datasets consists of possibly many nc f iles, some are not always available • Many di ff erent grids and timesteps -> vastly di ff erent array shapes • Ingestion is not just one way. Some datasets are retracted, and any mirror needs to re f lect that. ⚠

Slide 9

Slide 9 text

The Proof of concept • Public Storage Buckets from Google and Amazon • Collaborative e ff ort in the Pangeo / ESGF Cloud Data Working Group • Zarr stores manual processed by Naomi Henderson • ~150k datasets uploaded using user requests, and tons of manual labor in jupyter notebooks • Cataloging: Intake-esm collection based on a large csv f ile • Retraction: Remove datasets from catalog but never deleted them. • Early upload enabled downstream work to use and improve the useability of the cloud data early on. • But then Naomi retired 😱 https://github.com/jbusecke/xMIP https://pangeo-data.github.io/pangeo-cmip6-cloud/

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

Building a robot Naomi • Pangeo-Forge • Open source platform for data Extraction, Transformation, Loading (ETL) • Encodes all information needed to recreate an ARCO copy • Originally designed for few massive datasets • CMIP6 is a unique use-case: Massive amount of small-ish datasets • Massive refactor to Apache-Beam f inally enabled us to scale this ingestion • Big Shout out to Charles Stern and all Pangeo-Forge maintainers. https://github.com/pangeo-forge/pangeo-forge-recipes https://pangeo-forge.readthedocs.io/en/latest/index.html

Slide 12

Slide 12 text

• Recipe Steps • Start with a list of unique identi f iers of a dataset (made up of .nc f iles) • dynamically request download urls • dynamic chunking (based on the size of each dataset) • Cataloging and Testing inline • Recipe Execution • Orchestrating thousands of single recipes in dictionary • Currently runs on Google Data f low but could run on any beam runner! The anatomy of the CMIP6 recipe

Slide 13

Slide 13 text

ESGF query pangeo-forge-esgf • Python client for the ESGF API • Parse single instance_ids from wildcards • Return http download urls for a given instance_id • This is my f irst async code, so can probably be vastly improved, but it seems resonably fast. • If there are better packages I should use, Id love the feedback

Slide 14

Slide 14 text

Dynamic Chunking Pangeo Forge `StoreToZarr` accepts custom function

Slide 15

Slide 15 text

Dynamic Chunking `dynamic-chunks` implements algorithms to hit `chunk ratio`

Slide 16

Slide 16 text

Dynamic Chunking Bring your own! • Preserve monthly chunks for some fancy calendar once zarr allows unequal chunks. • Chunk one dimension only if a certain other dimension is available • ...

Slide 17

Slide 17 text

Ok so whats up with the "why we should never do it again"? Lower barrier of entry + Quick exploration of ideas = New CMIPers! Teaching with the real data Student ➡ Scientist 🚀 Not just for academics. Public/private sector uses the cloud data!

Slide 18

Slide 18 text

How to get involved

Slide 19

Slide 19 text

I ❤ Feedback, questions, contributions. jbusecke juliusbusecke.com @[email protected]

Slide 20

Slide 20 text

No content