Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Open Climate Data for Agile and Inclusive Scien...

Julius Busecke
September 16, 2024
3

Open Climate Data for Agile and Inclusive Science Communities

Slides for "2024-09-05_Open Climate Data for Agile and Inclusive Science Communities" presented at 2024 US Clivar PSMI Panel on September 05, 2024 by Julius Busecke.

Julius Busecke

September 16, 2024
Tweet

Transcript

  1. Who am I? 🌊 Climate Scientist Ocean transport of Heat,

    Carbon Oxygen Impact of small scale processes on global climate variability. 🤓 Open Science Nerd/ Advocate Maintainer: Pangeo CMIP6 Cloud Data xMIP/xGCM ⚙ Integration Engineer Manager for Data and Computation - NSF-LEAP Lead of Open Research - m2lines M²LInES jbusecke juliusbusecke.com @JuliusBusecke @[email protected] @codeandcurrents.bsky.social
  2. How do we advance science faster? Idea 💡 Result ✅

    Tech/Infrastructure limited Understanding limited
  3. How do we advance science faster? Idea 💡 Result ✅

    Tech/Infrastructure limited Understanding limited How to speed up: - Collaboration - More Stakeholders - Reproducibility
  4. How do we advance science faster? Idea 💡 Result ✅

    Tech/Infrastructure limited Understanding limited How to speed up: - Collaboration - More Stakeholders - Reproducibility How to speed up: - Open/Fast Access to data - Community OSS tools
  5. How do we advance science faster? Truly open access to

    ARCO data! Idea 💡 Result ✅ Tech/Infrastructure limited Understanding limited How to speed up: - Collaboration - More Stakeholders - Reproducibility How to speed up: - Open/Fast Access to data - Community OSS tools
  6. Analysis-Ready Cloud-Optimized (ARCO) data Analysis-Ready: • Think in “Datasets/Datacubes” not

    “ f iles” and "folders" • No need for tedious homogenizing / cleaning steps • Curated and cataloged Chunked appropriately for analysis Rich metadata Everything in one dataset object
  7. Analysis-Ready Cloud-Optimized (ARCO) data Analysis-Ready: • Think in “Datasets/Datacubes” not

    “ f iles” and "folders" • No need for tedious homogenizing / cleaning steps • Curated and cataloged Cloud Optimized: • Compatible with object storage (access via HTTP) • Supports lazy access, intelligent subsetting, and streaming access • Integrates with high-level analysis libraries and distributed frameworks for high parallel throughput Abernathey et al., "Cloud-Native Repositories for Big Scienti fi c Data," 2021, doi: 10.1109/MCSE.2021.3059437
  8. Pangeo CMIP6 Cloud Data ESGF Everybody rolls their own Custom

    Code Custom Code Custom Code University Lab Industry ❌ ✋🚫
  9. Pangeo CMIP6 Cloud Data ESGF Ingestion Pipeline A single data

    repository in the cloud serves all use cases Everybody rolls their own Custom Code Custom Code Custom Code University Lab Industry ❌ ✋🚫
  10. People love this way of accessing (CMIP) data in the

    cloud! Lower barrier of entry + Quick exploration of ideas = New CMIPers! Teaching with the real data Student ➡ Scientist 🚀 Not just for academics. Public/private sector uses the cloud data!
  11. Why do process studies need ARCO data? Planning Improve observational

    design, resource allocation, more targeted simulations by having easy and fast access to data for researchers across institutions Leverage Outside Expertise Cross Discipline Collaboration across scienti f ic f ields, engaging the ML community, industry and non-pro f it sector. Legacy Reuse of observational data and modeling beyond initial study
  12. Collaborative Science means Community Hopping Portability of tools is key

    🧑💻 👩💻 👩💻 👨💻 🧑💻 🖥 💽 👩💻 👩💻 👨💻 🧑💻 🖥 👩💻 👩💻 👨💻 🧑💻 🖥 💽 😀 🧑💻 💽 💽 Data maintained by the community but publicly accessible from outside Data maintained by the community but only accessible from community resources
  13. Collaborative Science means Community Hopping Portability of tools is key

    🧑💻 👩💻 👩💻 👨💻 🧑💻 🖥 💽 👩💻 👩💻 👨💻 🧑💻 🖥 👩💻 👩💻 👨💻 🧑💻 🖥 💽 🧑💻 💽 💽 Data maintained by the community but publicly accessible from outside Data maintained by the community but only accessible from community resources 😡 🚫
  14. Collaborative Science means Community Hopping Portability of tools is key

    🧑💻 👩💻 👩💻 👨💻 🧑💻 🖥 💽 👩💻 👩💻 👨💻 🧑💻 🖥 👩💻 👩💻 👨💻 🧑💻 🖥 💽 🧑💻 💽 💽 Data maintained by the community but publicly accessible from outside Data maintained by the community but only accessible from community resources 💽 💵⏳
  15. Collaborative Science means Community Hopping Portability of tools is key

    👩💻 👩💻 👨💻 🧑💻 🖥 💽 👩💻 👩💻 👨💻 🧑💻 🖥 👩💻 👩💻 👨💻 🧑💻 🖥 💽 🧑💻 💽 💽 Data maintained by the community but publicly accessible from outside Data maintained by the community but only accessible from community resources 😀
  16. Collaborative Science means Community Hopping Portability of tools is key

    👩💻 👩💻 👨💻 🧑💻 🖥 💽 👩💻 👩💻 👨💻 🧑💻 🖥 👩💻 👩💻 👨💻 🧑💻 🖥 💽 🧑💻 💽 💽 Data maintained by the community but publicly accessible from outside Data maintained by the community but only accessible from community resources 😀
  17. Collaborative Science means Community Hopping Portability of tools is key

    👩💻 👩💻 👨💻 🧑💻 🖥 💽 👩💻 👩💻 👨💻 🧑💻 🖥 👩💻 👩💻 👨💻 🧑💻 🖥 💽 🧑💻 💽 💽 😀 😀 😀
  18. Collaborative Science means Community Hopping Portability of tools is key

    👩💻 👩💻 👨💻 🧑💻 🖥 💽 👩💻 👩💻 👨💻 🧑💻 🖥 👩💻 👩💻 👨💻 🧑💻 🖥 💽 🧑💻 💽 💽 😀 😀 😀
  19. Maintaining a community ARCO dataset How we do it at

    LEAP/m2lines Data Expert: Knows about the dataset (where f iles are stored, how they are named, how to add metadata to make the data more useful). Encodes this knowledge into recipe. Infrastructure Expert: Knows how to execute receipe on pipelines and how to populate cloud storage Science Expert: Finds ARCO data in catalog and uses the data for science. Provides feedback. More info on the LEAP Data Ingestion: https://leap-stc.github.io/guides/data_guide.html#ingesting-datasets-into-cloud-storage
  20. Challenges We have to admit that we are not even

    close yet! • The reality still looks very di ff erent! • The majority of our science is not accessible beyond institutional barriers • There is near 0 enforcement of meaningful requirements for open data by publishers. • It is hard work to produce and maintain ARCO data! • This should not be done by students and postdocs alone! Researchers should not have to become DevOps Engineers to do their work! • But we need better ways of acknowledging the data engineering work within science!
  21. Challenges ⚠ Opinionated Take ⚠ • The world around is

    rapidly changing. Doing 'science the way it was always done' is not su ff icient anymore! • Academia and government labs are not the only players anymore. • Working together as much as possible is imperative to deal with the climate crises and interrelated crises • Access to science is not just a human right, it will also reduce the amount of toil in the scienti f ic community! • Investing in truly public open data for climate science is a long term investment into science in general, and our collective mental health in particular. Everyone has the right to freely participate in the cultural life of the community, to enjoy the arts and to share in scienti f ic advancement and its bene f its. UN Declaration of Human Rights
  22. Challenges (continued) ⚠ Opinionated Take ⚠ • Lets put our

    money and time where our mouth is! • We should all lobby our employers, funding agencies, colleagues to embrace open and cloud native access to the datasets we produce!
  23. 23