Slide 1

Slide 1 text

Analysis-Ready Cloud-Optimized Data for your community and the entire world with Pangeo- Forge Julius Busecke | April 29th 2024 | NCAR ESDS Forum M²LInES

Slide 2

Slide 2 text

Who am I? • Physical Oceanographer • Senior Sta ff Associate at Columbia University • Manager of Data and Computing for LEAP • Lead of Open Science for M2LInES • Core Developer of xGCM, xMIP • Pangeo Fan, User, and Member • Open Source/ Open Science Advocate • Maintainer of the Pangeo CMIP6 zarr stores

Slide 3

Slide 3 text

Why even bother about open science?

Slide 4

Slide 4 text

Agile Science - Speed counts! Idea 💡 Result ✅

Slide 5

Slide 5 text

Agile Science - Speed counts! Idea 💡 Result ✅

Slide 6

Slide 6 text

Agile Science - Speed counts! Idea 💡 Result ✅ Tech/Infrastructure limited Understanding limited

Slide 7

Slide 7 text

Agile Science - Speed counts! Idea 💡 Result ✅ Tech/Infrastructure limited Understanding limited How to speed up: - Open/Fast Access to data - Community OSS tools - Infrastructure Support from Private Sector?

Slide 8

Slide 8 text

Agile Science - Speed counts! Idea 💡 Result ✅ Tech/Infrastructure limited Understanding limited How to speed up: - Open/Fast Access to data - Community OSS tools - Infrastructure Support from Private Sector? How to speed up: - Collaboration - More Stakeholders - Reproducibility

Slide 9

Slide 9 text

Idea💡 ... ... Collaborate with other researchers 👩💻👨🔬 💡 💬 Revised Hypothesis💡 Revised Hypothesis💡 Publish Results Test on additional data 💽 ⚙

Slide 10

Slide 10 text

Open Data is the bedrock of Open Science

Slide 11

Slide 11 text

Because its fast and easy! Reproducible IPCC Science in Minutes IPCC Chapter 9 2-10 minutes on LEAP-Pangeo JupyterHub Code Repository https://github.com/jbusecke/presentation_wcrp_open_science_conference Scipy Talk https://www.youtube.com/watch?v=7niNfs3ZpfQ

Slide 12

Slide 12 text

Because its inclusive! Everyone can learn Climate Science from real climate data

Slide 13

Slide 13 text

Because more and more people need climate data Collaboration on open data: Everybody wins! 👨💼 👨🔬 🧑💼 🧑💻 Science Paper 👩🔬🎉🎓 Derived Proprietary Data Product 🧑💼💵🎊 Fixes naming Corrects units Tunes Compression Uploads new data Academia Private Sector / Public Sector 👷 👩🏫 🏄 Everyone gets to explore the clean dataset! Public Stakeholder

Slide 14

Slide 14 text

People love this way of accessing (CMIP) data in the cloud! Lower barrier of entry + Quick exploration of ideas = New CMIPers! Teaching with the real data Student ➡ Scientist 🚀 Not just for academics. Public/private sector uses the cloud data!

Slide 15

Slide 15 text

The key ingredients Publicly accessible raw data via S3 interface Storage Discovery Ingestion Reproducible methods to transform data into Analysis-Ready Cloud- Optimized Formats A reliable way for researchers to discover, cite, and learn more about datasets

Slide 16

Slide 16 text

Collaborative Science means Community Hopping Portability of tools is key 🧑💻 👩💻 👩💻 👨💻 🧑💻 🖥 💽 👩💻 👩💻 👨💻 🧑💻 🖥 👩💻 👩💻 👨💻 🧑💻 🖥 💽 😀 🧑💻 💽 💽 Data maintained by the community but publicly accessible from outside Data maintained by the community but only accessible from community resources

Slide 17

Slide 17 text

Collaborative Science means Community Hopping Portability of tools is key 🧑💻 👩💻 👩💻 👨💻 🧑💻 🖥 💽 👩💻 👩💻 👨💻 🧑💻 🖥 👩💻 👩💻 👨💻 🧑💻 🖥 💽 🧑💻 💽 💽 Data maintained by the community but publicly accessible from outside Data maintained by the community but only accessible from community resources 😡 🚫

Slide 18

Slide 18 text

Collaborative Science means Community Hopping Portability of tools is key 🧑💻 👩💻 👩💻 👨💻 🧑💻 🖥 💽 👩💻 👩💻 👨💻 🧑💻 🖥 👩💻 👩💻 👨💻 🧑💻 🖥 💽 🧑💻 💽 💽 Data maintained by the community but publicly accessible from outside Data maintained by the community but only accessible from community resources 💽 💵⏳

Slide 19

Slide 19 text

Collaborative Science means Community Hopping Portability of tools is key 👩💻 👩💻 👨💻 🧑💻 🖥 💽 👩💻 👩💻 👨💻 🧑💻 🖥 👩💻 👩💻 👨💻 🧑💻 🖥 💽 🧑💻 💽 💽 Data maintained by the community but publicly accessible from outside Data maintained by the community but only accessible from community resources 😀

Slide 20

Slide 20 text

Collaborative Science means Community Hopping Portability of tools is key 👩💻 👩💻 👨💻 🧑💻 🖥 💽 👩💻 👩💻 👨💻 🧑💻 🖥 👩💻 👩💻 👨💻 🧑💻 🖥 💽 🧑💻 💽 💽 Data maintained by the community but publicly accessible from outside Data maintained by the community but only accessible from community resources 😀

Slide 21

Slide 21 text

Collaborative Science means Community Hopping Portability of tools is key 👩💻 👩💻 👨💻 🧑💻 🖥 💽 👩💻 👩💻 👨💻 🧑💻 🖥 👩💻 👩💻 👨💻 🧑💻 🖥 💽 🧑💻 💽 💽 😀 😀 😀

Slide 22

Slide 22 text

Collaborative Science means Community Hopping Portability of tools is key 👩💻 👩💻 👨💻 🧑💻 🖥 💽 👩💻 👩💻 👨💻 🧑💻 🖥 👩💻 👩💻 👨💻 🧑💻 🖥 💽 🧑💻 💽 💽 😀 😀 😀

Slide 23

Slide 23 text

Future Vision Community maintained/curated data • Separation between Data and Compute! - We don't know what the demands in the future will be! • Public access more important than fastest compute IMO! - The time saved in migrating/downloading data for many researchers is huge! • The boundary of communities becomes more permeable for collaboration - I can run the same code on the same data even though I am in another part of the world

Slide 24

Slide 24 text

LEAP-Pangeo

Slide 25

Slide 25 text

ARCO Ingestion - The community Bakery Metadata Recipe Beam Runners: Local Google Data f low Amazon Flink Pyspark (coming soon) Dask (coming soon)

Slide 26

Slide 26 text

ARCO Ingestion - The community Bakery Metadata Recipe Beam Runners: Local Google Data f low Amazon Flink Pyspark (coming soon) Dask (coming soon)

Slide 27

Slide 27 text

Pangeo/ESGF CMIP6 Zarr Data Request some new data!

Slide 28

Slide 28 text

Pangeo/ESGF CMIP6 Zarr Data Request some new data!

Slide 29

Slide 29 text

The anatomy of the CMIP6 recipe

Slide 30

Slide 30 text

• Recipe Steps • Start with a list of unique identi f iers of a dataset (made up of .nc f iles) The anatomy of the CMIP6 recipe

Slide 31

Slide 31 text

• Recipe Steps • Start with a list of unique identi f iers of a dataset (made up of .nc f iles) • dynamically request download urls The anatomy of the CMIP6 recipe

Slide 32

Slide 32 text

• Recipe Steps • Start with a list of unique identi f iers of a dataset (made up of .nc f iles) • dynamically request download urls • dynamic chunking (based on the size of each dataset) The anatomy of the CMIP6 recipe

Slide 33

Slide 33 text

• Recipe Steps • Start with a list of unique identi f iers of a dataset (made up of .nc f iles) • dynamically request download urls • dynamic chunking (based on the size of each dataset) • Cataloging and Testing inline The anatomy of the CMIP6 recipe

Slide 34

Slide 34 text

• Recipe Steps • Start with a list of unique identi f iers of a dataset (made up of .nc f iles) • dynamically request download urls • dynamic chunking (based on the size of each dataset) • Cataloging and Testing inline • Recipe Execution The anatomy of the CMIP6 recipe

Slide 35

Slide 35 text

• Recipe Steps • Start with a list of unique identi f iers of a dataset (made up of .nc f iles) • dynamically request download urls • dynamic chunking (based on the size of each dataset) • Cataloging and Testing inline • Recipe Execution • Orchestrating thousands of single recipes in dictionary The anatomy of the CMIP6 recipe

Slide 36

Slide 36 text

• Recipe Steps • Start with a list of unique identi f iers of a dataset (made up of .nc f iles) • dynamically request download urls • dynamic chunking (based on the size of each dataset) • Cataloging and Testing inline • Recipe Execution • Orchestrating thousands of single recipes in dictionary The anatomy of the CMIP6 recipe

Slide 37

Slide 37 text

• Recipe Steps • Start with a list of unique identi f iers of a dataset (made up of .nc f iles) • dynamically request download urls • dynamic chunking (based on the size of each dataset) • Cataloging and Testing inline • Recipe Execution • Orchestrating thousands of single recipes in dictionary • Currently runs on Google Data f low but could run on any beam runner! The anatomy of the CMIP6 recipe

Slide 38

Slide 38 text

ESGF query pangeo-forge-esgf

Slide 39

Slide 39 text

ESGF query pangeo-forge-esgf • Python client for the ESGF API

Slide 40

Slide 40 text

ESGF query pangeo-forge-esgf • Python client for the ESGF API • Parse single instance_ids from wildcards

Slide 41

Slide 41 text

ESGF query pangeo-forge-esgf • Python client for the ESGF API • Parse single instance_ids from wildcards • Return http download urls for a given instance_id

Slide 42

Slide 42 text

ESGF query pangeo-forge-esgf • Python client for the ESGF API • Parse single instance_ids from wildcards • Return http download urls for a given instance_id • This is my f irst async code, so can probably be vastly improved, but it seems resonably fast.

Slide 43

Slide 43 text

ESGF query pangeo-forge-esgf • Python client for the ESGF API • Parse single instance_ids from wildcards • Return http download urls for a given instance_id • This is my f irst async code, so can probably be vastly improved, but it seems resonably fast. • If there are better packages I should use, Id love the feedback

Slide 44

Slide 44 text

Dynamic Chunking Pangeo Forge `StoreToZarr` accepts custom function

Slide 45

Slide 45 text

Dynamic Chunking Pangeo Forge `StoreToZarr` accepts custom function

Slide 46

Slide 46 text

Lessons learned: • Data Ingestion for a community/public is hard work, but saves many people time and adds new members to the research discourse. • It should be funded/acknowledged appropriately! • Relying on short grant funding is risky. • Pangeo-Forge has come a long way in part due to these e ff orts! It is a good time to get involved! https://github.com/pangeo-forge • This vision is not dependent on commercial cloud storage, just the idea to expose data to the public in a 'cloud-like' fashion.

Slide 47

Slide 47 text

How to get involved https://github.com/leap-stc/data-management https://github.com/leap-stc/cmip6-leap-feedstock

Slide 48

Slide 48 text

29 I ❤ questions + comments jbusecke juliusbusecke.com @JuliusBusecke @[email protected] @codeandcurrents.bsky.social

Slide 49

Slide 49 text

Discussion 👩💻 👩💻 👨💻 🧑💻 🖥 💽 👩💻 👩💻 👨💻 🧑💻 🖥 👩💻 👩💻 👨💻 🧑💻 🖥 💽 🧑💻 💽 💽 😀 😀 😀