Slide 1

Slide 1 text

Open Climate Data for Agile and Inclusive Science Communities Julius Busecke | Sep 5 2024 | US Clivar P

Slide 2

Slide 2 text

Who am I? 🌊 Climate Scientist Ocean transport of Heat, Carbon Oxygen Impact of small scale processes on global climate variability. πŸ€“ Open Science Nerd/ Advocate Maintainer: Pangeo CMIP6 Cloud Data xMIP/xGCM βš™ Integration Engineer Manager for Data and Computation - NSF-LEAP Lead of Open Research - m2lines MΒ²LInES jbusecke juliusbusecke.com @JuliusBusecke @[email protected] @codeandcurrents.bsky.social

Slide 3

Slide 3 text

How do we advance science faster? Idea πŸ’‘ Result βœ…

Slide 4

Slide 4 text

How do we advance science faster? Idea πŸ’‘ Result βœ… Tech/Infrastructure limited Understanding limited

Slide 5

Slide 5 text

How do we advance science faster? Idea πŸ’‘ Result βœ… Tech/Infrastructure limited Understanding limited How to speed up: - Collaboration - More Stakeholders - Reproducibility

Slide 6

Slide 6 text

How do we advance science faster? Idea πŸ’‘ Result βœ… Tech/Infrastructure limited Understanding limited How to speed up: - Collaboration - More Stakeholders - Reproducibility How to speed up: - Open/Fast Access to data - Community OSS tools

Slide 7

Slide 7 text

How do we advance science faster? Truly open access to ARCO data! Idea πŸ’‘ Result βœ… Tech/Infrastructure limited Understanding limited How to speed up: - Collaboration - More Stakeholders - Reproducibility How to speed up: - Open/Fast Access to data - Community OSS tools

Slide 8

Slide 8 text

Analysis-Ready Cloud-Optimized (ARCO) data

Slide 9

Slide 9 text

Analysis-Ready Cloud-Optimized (ARCO) data Analysis-Ready: β€’ Think in β€œDatasets/Datacubes” not β€œ f iles” and "folders" β€’ No need for tedious homogenizing / cleaning steps β€’ Curated and cataloged Chunked appropriately for analysis Rich metadata Everything in one dataset object

Slide 10

Slide 10 text

Analysis-Ready Cloud-Optimized (ARCO) data Analysis-Ready: β€’ Think in β€œDatasets/Datacubes” not β€œ f iles” and "folders" β€’ No need for tedious homogenizing / cleaning steps β€’ Curated and cataloged Cloud Optimized: β€’ Compatible with object storage (access via HTTP) β€’ Supports lazy access, intelligent subsetting, and streaming access β€’ Integrates with high-level analysis libraries and distributed frameworks for high parallel throughput Abernathey et al., "Cloud-Native Repositories for Big Scienti fi c Data," 2021, doi: 10.1109/MCSE.2021.3059437

Slide 11

Slide 11 text

Pangeo CMIP6 Cloud Data ESGF

Slide 12

Slide 12 text

Pangeo CMIP6 Cloud Data ESGF Everybody rolls their own Custom Code Custom Code Custom Code University Lab Industry ❌ βœ‹πŸš«

Slide 13

Slide 13 text

Pangeo CMIP6 Cloud Data ESGF Ingestion Pipeline A single data repository in the cloud serves all use cases Everybody rolls their own Custom Code Custom Code Custom Code University Lab Industry ❌ βœ‹πŸš«

Slide 14

Slide 14 text

People love this way of accessing (CMIP) data in the cloud! Lower barrier of entry + Quick exploration of ideas = New CMIPers! Teaching with the real data Student ➑ Scientist πŸš€ Not just for academics. Public/private sector uses the cloud data!

Slide 15

Slide 15 text

Pangeo/ESGF CMIP6 Zarr Data Request some new data! https://github.com/leap-stc/cmip6-leap-feedstock

Slide 16

Slide 16 text

Pangeo/ESGF CMIP6 Zarr Data Request some new data! https://github.com/leap-stc/cmip6-leap-feedstock

Slide 17

Slide 17 text

Why do process studies need ARCO data? Planning Improve observational design, resource allocation, more targeted simulations by having easy and fast access to data for researchers across institutions Leverage Outside Expertise Cross Discipline Collaboration across scienti f ic f ields, engaging the ML community, industry and non-pro f it sector. Legacy Reuse of observational data and modeling beyond initial study

Slide 18

Slide 18 text

Collaborative Science means Community Hopping Portability of tools is key πŸ§‘πŸ’» πŸ‘©πŸ’» πŸ‘©πŸ’» πŸ‘¨πŸ’» πŸ§‘πŸ’» πŸ–₯ πŸ’½ πŸ‘©πŸ’» πŸ‘©πŸ’» πŸ‘¨πŸ’» πŸ§‘πŸ’» πŸ–₯ πŸ‘©πŸ’» πŸ‘©πŸ’» πŸ‘¨πŸ’» πŸ§‘πŸ’» πŸ–₯ πŸ’½ πŸ˜€ πŸ§‘πŸ’» πŸ’½ πŸ’½ Data maintained by the community but publicly accessible from outside Data maintained by the community but only accessible from community resources

Slide 19

Slide 19 text

Collaborative Science means Community Hopping Portability of tools is key πŸ§‘πŸ’» πŸ‘©πŸ’» πŸ‘©πŸ’» πŸ‘¨πŸ’» πŸ§‘πŸ’» πŸ–₯ πŸ’½ πŸ‘©πŸ’» πŸ‘©πŸ’» πŸ‘¨πŸ’» πŸ§‘πŸ’» πŸ–₯ πŸ‘©πŸ’» πŸ‘©πŸ’» πŸ‘¨πŸ’» πŸ§‘πŸ’» πŸ–₯ πŸ’½ πŸ§‘πŸ’» πŸ’½ πŸ’½ Data maintained by the community but publicly accessible from outside Data maintained by the community but only accessible from community resources 😑 🚫

Slide 20

Slide 20 text

Collaborative Science means Community Hopping Portability of tools is key πŸ§‘πŸ’» πŸ‘©πŸ’» πŸ‘©πŸ’» πŸ‘¨πŸ’» πŸ§‘πŸ’» πŸ–₯ πŸ’½ πŸ‘©πŸ’» πŸ‘©πŸ’» πŸ‘¨πŸ’» πŸ§‘πŸ’» πŸ–₯ πŸ‘©πŸ’» πŸ‘©πŸ’» πŸ‘¨πŸ’» πŸ§‘πŸ’» πŸ–₯ πŸ’½ πŸ§‘πŸ’» πŸ’½ πŸ’½ Data maintained by the community but publicly accessible from outside Data maintained by the community but only accessible from community resources πŸ’½ πŸ’΅β³

Slide 21

Slide 21 text

Collaborative Science means Community Hopping Portability of tools is key πŸ‘©πŸ’» πŸ‘©πŸ’» πŸ‘¨πŸ’» πŸ§‘πŸ’» πŸ–₯ πŸ’½ πŸ‘©πŸ’» πŸ‘©πŸ’» πŸ‘¨πŸ’» πŸ§‘πŸ’» πŸ–₯ πŸ‘©πŸ’» πŸ‘©πŸ’» πŸ‘¨πŸ’» πŸ§‘πŸ’» πŸ–₯ πŸ’½ πŸ§‘πŸ’» πŸ’½ πŸ’½ Data maintained by the community but publicly accessible from outside Data maintained by the community but only accessible from community resources πŸ˜€

Slide 22

Slide 22 text

Collaborative Science means Community Hopping Portability of tools is key πŸ‘©πŸ’» πŸ‘©πŸ’» πŸ‘¨πŸ’» πŸ§‘πŸ’» πŸ–₯ πŸ’½ πŸ‘©πŸ’» πŸ‘©πŸ’» πŸ‘¨πŸ’» πŸ§‘πŸ’» πŸ–₯ πŸ‘©πŸ’» πŸ‘©πŸ’» πŸ‘¨πŸ’» πŸ§‘πŸ’» πŸ–₯ πŸ’½ πŸ§‘πŸ’» πŸ’½ πŸ’½ Data maintained by the community but publicly accessible from outside Data maintained by the community but only accessible from community resources πŸ˜€

Slide 23

Slide 23 text

Collaborative Science means Community Hopping Portability of tools is key πŸ‘©πŸ’» πŸ‘©πŸ’» πŸ‘¨πŸ’» πŸ§‘πŸ’» πŸ–₯ πŸ’½ πŸ‘©πŸ’» πŸ‘©πŸ’» πŸ‘¨πŸ’» πŸ§‘πŸ’» πŸ–₯ πŸ‘©πŸ’» πŸ‘©πŸ’» πŸ‘¨πŸ’» πŸ§‘πŸ’» πŸ–₯ πŸ’½ πŸ§‘πŸ’» πŸ’½ πŸ’½ πŸ˜€ πŸ˜€ πŸ˜€

Slide 24

Slide 24 text

Collaborative Science means Community Hopping Portability of tools is key πŸ‘©πŸ’» πŸ‘©πŸ’» πŸ‘¨πŸ’» πŸ§‘πŸ’» πŸ–₯ πŸ’½ πŸ‘©πŸ’» πŸ‘©πŸ’» πŸ‘¨πŸ’» πŸ§‘πŸ’» πŸ–₯ πŸ‘©πŸ’» πŸ‘©πŸ’» πŸ‘¨πŸ’» πŸ§‘πŸ’» πŸ–₯ πŸ’½ πŸ§‘πŸ’» πŸ’½ πŸ’½ πŸ˜€ πŸ˜€ πŸ˜€

Slide 25

Slide 25 text

Maintaining a community ARCO dataset How we do it at LEAP/m2lines Data Expert: Knows about the dataset (where f iles are stored, how they are named, how to add metadata to make the data more useful). Encodes this knowledge into recipe. Infrastructure Expert: Knows how to execute receipe on pipelines and how to populate cloud storage Science Expert: Finds ARCO data in catalog and uses the data for science. Provides feedback. More info on the LEAP Data Ingestion: https://leap-stc.github.io/guides/data_guide.html#ingesting-datasets-into-cloud-storage

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

Challenges We have to admit that we are not even close yet! β€’ The reality still looks very di ff erent! β€’ The majority of our science is not accessible beyond institutional barriers β€’ There is near 0 enforcement of meaningful requirements for open data by publishers. β€’ It is hard work to produce and maintain ARCO data! β€’ This should not be done by students and postdocs alone! Researchers should not have to become DevOps Engineers to do their work! β€’ But we need better ways of acknowledging the data engineering work within science!

Slide 29

Slide 29 text

Challenges ⚠ Opinionated Take ⚠ β€’ The world around is rapidly changing. Doing 'science the way it was always done' is not su ff icient anymore! β€’ Academia and government labs are not the only players anymore. β€’ Working together as much as possible is imperative to deal with the climate crises and interrelated crises β€’ Access to science is not just a human right, it will also reduce the amount of toil in the scienti f ic community! β€’ Investing in truly public open data for climate science is a long term investment into science in general, and our collective mental health in particular. Everyone has the right to freely participate in the cultural life of the community, to enjoy the arts and to share in scienti f ic advancement and its bene f its. UN Declaration of Human Rights

Slide 30

Slide 30 text

Challenges (continued) ⚠ Opinionated Take ⚠ β€’ Lets put our money and time where our mouth is! β€’ We should all lobby our employers, funding agencies, colleagues to embrace open and cloud native access to the datasets we produce!

Slide 31

Slide 31 text

We can do this, together!

Slide 32

Slide 32 text

No content

Slide 33

Slide 33 text

23

Slide 34

Slide 34 text

LEAP-Pangeo https://leap-stc.github.io/intro.html