Going both far AND fast as a community - Climate Science with Pangeo

Going both far AND fast as a community Julius Busecke
| LANL COSIM Climate and Ocean Group Webinar | 05/18/2022 Climate Science with Pangeo

Who am I? Physical Oceanographer Studies the role of ocean
currents for the climate and ecosystems. Associate Research Scientist - Columbia University Core-Developer of xgcm Core-Developer of cmip6_preprocessing Maintainer of the Pangeo CMIP6 Cache Pangeo Fan, User, and Member Open Source/Open Science Advocate

Outline

Outline Why should we change the way we do science?

How can we do it?

How can we do it? It already works!

Source: https://www.un.org/en/global-issues/climate-change

5 Credit: NASA's Goddard Space Flight Center

5 Credit: NASA's Goddard Space Flight Center https://earthdata.nasa.gov/eosdis/cloud-evolution SWOT NISAR

6 Credit: NASA's Goddard Space Flight Center

Climate Data is not a niche product anymore

How do we get to work with all this data?

How do we get to work with all this data?
FTP / OPeNDAP / etc. Download Files

MB 😀 FTP / OPeNDAP / etc.

GB 😐 FTP / OPeNDAP / etc.

TB 😖 FTP / OPeNDAP / etc.

PB 😱 FTP / OPeNDAP / etc.

So what should people do?

Get Frustrated?

Work with only a few models/ members/variables

P r i v i l e g e d
I n s t i t u t i o n s c r e a t e “ D a t a F o r t r e s s e s * ” 16 Image credit: Moahim, CC BY-SA 4.0, via Wikimedia Commons *Coined by Chelle Gentemann

Privileged Institutions create “Data Fortresses*” Image credit: Moahim, CC BY-SA
4.0, via Wikimedia Commons Data *Coined by Chelle Gentemann

Privileged Institutions create “Data Fortresses*” Image credit: Moahim, CC BY-SA
4.0, via Wikimedia Commons Data ❌ Results not reproducible outside fortress ❌ Barrier to collaboration ❌ Inefficient / duplicative ❌ Can’t scale to future data needs ❌ Limits inclusion and knowledge transfer *Coined by Chelle Gentemann

Bring the compute to the data! But how?

Use a “Platform”

The Trouble with “Platforms”

The Trouble with “Platforms” Scientists’ creativity often exceeds pre-baked capabilities.
  Desire to go under the hood

  Desire to go under the hood What if you want to access data that isn’t included?   Data catalog is determined by provider, not users

  Desire to go under the hood What if you want to access data that isn’t included?   Data catalog is determined by provider, not users Platforms are “single instance”:   Fear of lock-in, possibility platform will disappear

OPEN Cloud Architecture Data Provider’s $ Data Consumer’s $

Interactive Computing Data Provider’s $ Data Consumer’s $ OPEN Cloud
Architecture

Data Provider’s $ Data Consumer’s $ Interactive Computing Parallel Computing
OPEN Cloud Architecture

Data Provider’s $ Data Consumer’s $ Interactive Computing Parallel Computing
Analysis Ready Data  Cloud Optimized Formats OPEN Cloud Architecture

O p e n O c e a n C
l o u d W i l l b e a   “ D a t a W a t e r i n g H o l e * ” 27 *Coined by Fernando Perez

28 O p e n O c e a n
C l o u d W i l l b e a   “ D a t a W a t e r i n g H o l e * ” Data Library Compute Environment *Coined by Fernando Perez

29 👩💻👨💻👩💻 Group A:   Air-Sea Interaction 👩💻👨💻👩💻 Group B:
  Seasonal Forecasting Research Education Industry *Coined by Fernando Perez O p e n O c e a n C l o u d W i l l b e a   “ D a t a W a t e r i n g H o l e * ” Data Library Compute Environment

29 👩💻👨💻👩💻 Group A:   Air-Sea Interaction 👩💻👨💻👩💻 Group B:
  Seasonal Forecasting Research Education Industry *Coined by Fernando Perez O p e n O c e a n C l o u d W i l l b e a   “ D a t a W a t e r i n g H o l e * ” Data Library Compute Environment ✅ Faster science, more discoveries ✅ Inherently reproducible ✅ Allows seamless global collaboration ✅ Unleashes creativity ✅ Cost effective ✅ Accessible to all ✅ Connects with industry

https://openocean.cloud Material adopted from Ryan Abernathey

What is Pangeo?

What is Pangeo? Community obsessed with e ff icient data
processing.   Founded in 2017. Scientists and software developers coming together. http://pangeo.io/   Weekly meeting / seminar. Discourse Forum. Annual meeting. Workshops at AGU / AMS / etc. Interoperable Software     Foundation in Open Source Scienti f ic Python: Jupyter, Xarray, Dask, Zarr. Broad ecosystem of interoperable packages for analysis, visualization, and machine learning. Data and Computing Infrastructure     Deployment recipes for cloud and HPC. Open, public, cloud-based JupyterHubs and Binders for Data-proximate computing. PB of analysis-ready, cloud-optimized data stored in public cloud (GCS, AWS) and OpenStorageNetwork.

33 Inspiration: Stephan Hoyer, Jake Vanderplas (SciPy 2015) SciPy

Depth Coordinates Density Coordinates

CMIP6 Cloud Dataset • Pangeo partnered with ESGF and Google
Cloud to provide a new public dataset • > 1 PB and counting • Data stored in Zarr format • Google provides free hosting in GCS • Mirrored on AWS https://pangeo-data.github.io/pangeo-cmip6-cloud/

Analyzing Petabyte scale climate data in your browser with Pangeo
Custom Analysis applied to each model and member

What we want to do

What we want to do 💡 Have an idea

What we want to do Write some code

What we want to do Rock some science

Time to science?

Di ff erent dimension names in the CMIP data.  
  Not quite analysis -ready No! Time to homogenize data!

🤔 💡 No! Time to clean data!

🤔 💡 Competition for brain power

🤔 Isn’t there a better way?

There is! + Analysis Ready Data in the cloud Crowd-Sourced
Data Cleaning   (peer-to-peer learning)

There is! + Analysis Ready Data in the cloud Crowd-Sourced
Data Cleaning   (peer-to-peer learning) Less data wrangling, more 💡 =

What powers this work f low?

What powers this work f low? Pangeo cloud deployments +
ARCO CMIP6 cache

• Think in “Datasets” not “data fi les” • No
need for tedious homogenizing / cleaning steps • Curated and cataloged A R C O D a t a 56 Analysis Ready, Cloud Optimzed $VGDWDVFLHQFHEHFRPHVPRUHFRPPRQSODFHDQG VLPXOWDQHRXVO\DELWGHP\VWLĆHGZHH[SHFWWKLV WUHQGWRFRQWLQXHDVZHOO$IWHUDOOODVW\HDUèV respondents were just as excited about their ZRUN DERXWZHUHêVDWLVĆHGëRUEHWWHU How a Data Scientist Spends Their Day +HUHèVZKHUHWKHSRSXODUYLHZRIGDWDVFLHQWLVWVGLYHUJHVSUHWW\VLJQLĆFDQWO\IURPUHDOLW\*HQ ZHWKLQNRIGDWDVFLHQWLVWVEXLOGLQJDOJRULWKPVH[SORULQJGDWDDQGGRLQJSUHGLFWLYHDQDO\VLV7 actually not what they spend most of their time doing, however. $V\RXFDQVHHIURPWKHFKDUWDERYHRXWRIHYHU\GDWDVFLHQWLVWVZHVXUYH\HGDFWXDOO\VSHQ PRVWWLPHFOHDQLQJDQGRUJDQL]LQJGDWD<RXPD\KDYHKHDUGWKLVUHIHUUHGWRDVêGDWDZUDQJOLQ FRPSDUHGWRGLJLWDOMDQLWRUZRUN(YHU\WKLQJIURPOLVWYHULĆFDWLRQWRUHPRYLQJFRPPDVWRGHE databases–that time adds up and it adds up immensely. Messy data is by far the more time- con DVSHFWRIWKHW\SLFDOGDWDVFLHQWLVWèVZRUNćRZ$QGQHDUO\VDLGWKH\VLPSO\VSHQWWRRPXF Data scientist job satisfaction 60% 19% 9% 4% 5% 3% Building training sets: 3% Cleaning and organizing data: 60% Collecting data sets; 19% Mining data for patterns: 9% 5HĆQLQJDOJRULWKPV Other: 5% ,!;&!;!9$-'2ধ9;996'2&;,'139;ধ1'&3 2 1 How do data scientists spend their time? Crowd fl ower Data Science Report (2016) What is “Analysis Ready”?

57 Chunked appropriately for analysis Rich metadata Everything in one
dataset object https://catalog.pangeo.io/browse/master/ocean/sea_surface_height/

• Compatible with object storage   (access via HTTP) •
Supports lazy access and intelligent subsetting • Integrates with high-level analysis libraries and distributed frameworks A R C O D a t a 58 Analysis Ready, Cloud Optimzed What is “Cloud Optimized”?

A R C O D a t a i s
F a s t ! 59 https://doi.org/10.1109/MCSE.2021.3059437

60 Making ARCO Data is Hard! Domain Expertise:   How
to fi nd, clean, and homogenize data Tech Knowledge:   How to ef fi ciently produce cloud-optimized formats Compute Resources:   A place where to stage and upload the ARCO data Communication Skills:   To explain to others how to use the data To produce useful ARCO data, you must have: Data Scientist 😩

61 Let’s democratize the production of ARCO data! Domain Expertise:
  How to fi nd, clean, and homogenize data 🤓 Data Scientist

62 Pangeo Forge Recipes Pangeo Forge Cloud Open source python
package for describing and running data pipelines (“recipes”) Cloud platform for automatically executing recipes stored in GitHub repos. https://github.com/pangeo-forge/pangeo-forge-recipes https://pangeo-forge.org/

A concrete example: Our ARCO CMIP6 cache Balaji, V. et
al. Requirements for a global data infrastructure in support of CMIP6. Geosci. Model Dev. 11, 3659–3680 (2018). Datasets are directly downloaded from ESGF nodes and minimally processed Files -> Dataset Carry complete metadata, handle_id etc for provenance We regularly f ilter our catalog for retracted datasets WIP: Fully automate work f low with pangeo-forge

A concrete example: Our ARCO CMIP6 cache Quesions? Suggestions? Join
the Pangeo / ESGF Cloud Data Working Group biweekly calls! https://pangeo-data.github.io/pangeo-cmip6-cloud/

65 http://pangeo.io https://discourse.pangeo.io/ https://github.com/pangeo-data/ https://medium.com/pangeo @pangeo_data @JuliusBusecke jbusecke [email protected] I
❤ F e e d b a c k , q u e s t i o n s , c o n t r i b u t i o n s . CMIP data in the cloud Pangeo Cloud Deployments - https://pangeo.io/cloud.html •

Ocean Science Meeting 2022 | Julius Busecke (@JuliusBusecke) | Laure
Resplandy | Sam Ditkovsky | Jasmin John Oxygen Minimum Zones in the Tropical Pacific Will they expand or shrink?

Going both far AND fast as a community - Climat...

Going both far AND fast as a community - Climate Science with Pangeo

More Decks by Julius Busecke

Featured

Transcript