Slide 1

Slide 1 text

Going both far AND fast as a community Julius Busecke | LANL COSIM Climate and Ocean Group Webinar | 05/18/2022 Climate Science with Pangeo

Slide 2

Slide 2 text

Who am I? Physical Oceanographer Studies the role of ocean currents for the climate and ecosystems. Associate Research Scientist - Columbia University Core-Developer of xgcm Core-Developer of cmip6_preprocessing Maintainer of the Pangeo CMIP6 Cache Pangeo Fan, User, and Member Open Source/Open Science Advocate

Slide 3

Slide 3 text

Outline

Slide 4

Slide 4 text

Outline Why should we change the way we do science?

Slide 5

Slide 5 text

Outline Why should we change the way we do science? How can we do it?

Slide 6

Slide 6 text

Outline Why should we change the way we do science? How can we do it? It already works!

Slide 7

Slide 7 text

Source: https://www.un.org/en/global-issues/climate-change

Slide 8

Slide 8 text

5 Credit: NASA's Goddard Space Flight Center

Slide 9

Slide 9 text

5 Credit: NASA's Goddard Space Flight Center

Slide 10

Slide 10 text

5 Credit: NASA's Goddard Space Flight Center https://earthdata.nasa.gov/eosdis/cloud-evolution SWOT NISAR

Slide 11

Slide 11 text

6 Credit: NASA's Goddard Space Flight Center

Slide 12

Slide 12 text

6 Credit: NASA's Goddard Space Flight Center

Slide 13

Slide 13 text

Climate Data is not a niche product anymore

Slide 14

Slide 14 text

How do we get to work with all this data?

Slide 15

Slide 15 text

How do we get to work with all this data? FTP / OPeNDAP / etc. Download Files

Slide 16

Slide 16 text

MB 😀 FTP / OPeNDAP / etc.

Slide 17

Slide 17 text

GB 😐 FTP / OPeNDAP / etc.

Slide 18

Slide 18 text

TB 😖 FTP / OPeNDAP / etc.

Slide 19

Slide 19 text

PB 😱 FTP / OPeNDAP / etc.

Slide 20

Slide 20 text

So what should people do?

Slide 21

Slide 21 text

Get Frustrated?

Slide 22

Slide 22 text

Get Frustrated?

Slide 23

Slide 23 text

Work with only a few models/ members/variables

Slide 24

Slide 24 text

Work with only a few models/ members/variables

Slide 25

Slide 25 text

Work with only a few models/ members/variables

Slide 26

Slide 26 text

P r i v i l e g e d I n s t i t u t i o n s c r e a t e “ D a t a F o r t r e s s e s * ” 16 Image credit: Moahim, CC BY-SA 4.0, via Wikimedia Commons *Coined by Chelle Gentemann

Slide 27

Slide 27 text

P r i v i l e g e d I n s t i t u t i o n s c r e a t e “ D a t a F o r t r e s s e s * ” 16 Image credit: Moahim, CC BY-SA 4.0, via Wikimedia Commons *Coined by Chelle Gentemann

Slide 28

Slide 28 text

Privileged Institutions create “Data Fortresses*” Image credit: Moahim, CC BY-SA 4.0, via Wikimedia Commons Data *Coined by Chelle Gentemann

Slide 29

Slide 29 text

Privileged Institutions create “Data Fortresses*” Image credit: Moahim, CC BY-SA 4.0, via Wikimedia Commons Data *Coined by Chelle Gentemann

Slide 30

Slide 30 text

Privileged Institutions create “Data Fortresses*” Image credit: Moahim, CC BY-SA 4.0, via Wikimedia Commons Data ❌ Results not reproducible outside fortress ❌ Barrier to collaboration ❌ Inefficient / duplicative ❌ Can’t scale to future data needs ❌ Limits inclusion and knowledge transfer *Coined by Chelle Gentemann

Slide 31

Slide 31 text

Bring the compute to the data! But how?

Slide 32

Slide 32 text

Use a “Platform”

Slide 33

Slide 33 text

The Trouble with “Platforms”

Slide 34

Slide 34 text

The Trouble with “Platforms” Scientists’ creativity often exceeds pre-baked capabilities. 
 Desire to go under the hood

Slide 35

Slide 35 text

The Trouble with “Platforms” Scientists’ creativity often exceeds pre-baked capabilities. 
 Desire to go under the hood What if you want to access data that isn’t included? 
 Data catalog is determined by provider, not users

Slide 36

Slide 36 text

The Trouble with “Platforms” Scientists’ creativity often exceeds pre-baked capabilities. 
 Desire to go under the hood What if you want to access data that isn’t included? 
 Data catalog is determined by provider, not users Platforms are “single instance”: 
 Fear of lock-in, possibility platform will disappear

Slide 37

Slide 37 text

OPEN Cloud Architecture Data Provider’s $ Data Consumer’s $

Slide 38

Slide 38 text

Interactive Computing Data Provider’s $ Data Consumer’s $ OPEN Cloud Architecture

Slide 39

Slide 39 text

Data Provider’s $ Data Consumer’s $ Interactive Computing Parallel Computing OPEN Cloud Architecture

Slide 40

Slide 40 text

Data Provider’s $ Data Consumer’s $ Interactive Computing Parallel Computing Analysis Ready Data
 Cloud Optimized Formats OPEN Cloud Architecture

Slide 41

Slide 41 text

Data Provider’s $ Data Consumer’s $ Interactive Computing Parallel Computing Analysis Ready Data
 Cloud Optimized Formats OPEN Cloud Architecture

Slide 42

Slide 42 text

O p e n O c e a n C l o u d W i l l b e a 
 “ D a t a W a t e r i n g H o l e * ” 27 *Coined by Fernando Perez

Slide 43

Slide 43 text

28 O p e n O c e a n C l o u d W i l l b e a 
 “ D a t a W a t e r i n g H o l e * ” Data Library Compute Environment *Coined by Fernando Perez

Slide 44

Slide 44 text

29 👩💻👨💻👩💻 Group A: 
 Air-Sea Interaction 👩💻👨💻👩💻 Group B: 
 Seasonal Forecasting Research Education Industry *Coined by Fernando Perez O p e n O c e a n C l o u d W i l l b e a 
 “ D a t a W a t e r i n g H o l e * ” Data Library Compute Environment

Slide 45

Slide 45 text

29 👩💻👨💻👩💻 Group A: 
 Air-Sea Interaction 👩💻👨💻👩💻 Group B: 
 Seasonal Forecasting Research Education Industry *Coined by Fernando Perez O p e n O c e a n C l o u d W i l l b e a 
 “ D a t a W a t e r i n g H o l e * ” Data Library Compute Environment ✅ Faster science, more discoveries ✅ Inherently reproducible ✅ Allows seamless global collaboration ✅ Unleashes creativity ✅ Cost effective ✅ Accessible to all ✅ Connects with industry

Slide 46

Slide 46 text

https://openocean.cloud Material adopted from Ryan Abernathey

Slide 47

Slide 47 text

What is Pangeo?

Slide 48

Slide 48 text

What is Pangeo? Community obsessed with e ff icient data processing. 
 Founded in 2017. Scientists and software developers coming together. http://pangeo.io/ 
 Weekly meeting / seminar. Discourse Forum. Annual meeting. Workshops at AGU / AMS / etc. Interoperable Software 
 
 Foundation in Open Source Scienti f ic Python: Jupyter, Xarray, Dask, Zarr. Broad ecosystem of interoperable packages for analysis, visualization, and machine learning. Data and Computing Infrastructure 
 
 Deployment recipes for cloud and HPC. Open, public, cloud-based JupyterHubs and Binders for Data-proximate computing. PB of analysis-ready, cloud-optimized data stored in public cloud (GCS, AWS) and OpenStorageNetwork.

Slide 49

Slide 49 text

What is Pangeo? Community obsessed with e ff icient data processing. 
 Founded in 2017. Scientists and software developers coming together. http://pangeo.io/ 
 Weekly meeting / seminar. Discourse Forum. Annual meeting. Workshops at AGU / AMS / etc. Interoperable Software 
 
 Foundation in Open Source Scienti f ic Python: Jupyter, Xarray, Dask, Zarr. Broad ecosystem of interoperable packages for analysis, visualization, and machine learning. Data and Computing Infrastructure 
 
 Deployment recipes for cloud and HPC. Open, public, cloud-based JupyterHubs and Binders for Data-proximate computing. PB of analysis-ready, cloud-optimized data stored in public cloud (GCS, AWS) and OpenStorageNetwork.

Slide 50

Slide 50 text

😍

Slide 51

Slide 51 text

33 Inspiration: Stephan Hoyer, Jake Vanderplas (SciPy 2015) SciPy

Slide 52

Slide 52 text

No content

Slide 53

Slide 53 text

No content

Slide 54

Slide 54 text

No content

Slide 55

Slide 55 text

No content

Slide 56

Slide 56 text

No content

Slide 57

Slide 57 text

No content

Slide 58

Slide 58 text

No content

Slide 59

Slide 59 text

No content

Slide 60

Slide 60 text

No content

Slide 61

Slide 61 text

No content

Slide 62

Slide 62 text

No content

Slide 63

Slide 63 text

No content

Slide 64

Slide 64 text

No content

Slide 65

Slide 65 text

No content

Slide 66

Slide 66 text

Depth Coordinates Density Coordinates

Slide 67

Slide 67 text

Depth Coordinates Density Coordinates

Slide 68

Slide 68 text

No content

Slide 69

Slide 69 text

CMIP6 Cloud Dataset • Pangeo partnered with ESGF and Google Cloud to provide a new public dataset • > 1 PB and counting • Data stored in Zarr format • Google provides free hosting in GCS • Mirrored on AWS https://pangeo-data.github.io/pangeo-cmip6-cloud/

Slide 70

Slide 70 text

Analyzing Petabyte scale climate data in your browser with Pangeo Custom Analysis applied to each model and member

Slide 71

Slide 71 text

What we want to do

Slide 72

Slide 72 text

What we want to do 💡 Have an idea

Slide 73

Slide 73 text

What we want to do Write some code

Slide 74

Slide 74 text

What we want to do Rock some science

Slide 75

Slide 75 text

What we want to do Rock some science

Slide 76

Slide 76 text

Time to science?

Slide 77

Slide 77 text

Di ff erent dimension names in the CMIP data. 
 
 Not quite analysis -ready No! Time to homogenize data!

Slide 78

Slide 78 text

🤔 💡 No! Time to clean data!

Slide 79

Slide 79 text

🤔 💡 Competition for brain power

Slide 80

Slide 80 text

🤔 Isn’t there a better way?

Slide 81

Slide 81 text

There is! + Analysis Ready Data in the cloud Crowd-Sourced Data Cleaning 
 (peer-to-peer learning)

Slide 82

Slide 82 text

There is! + Analysis Ready Data in the cloud Crowd-Sourced Data Cleaning 
 (peer-to-peer learning) Less data wrangling, more 💡 =

Slide 83

Slide 83 text

Demo

Slide 84

Slide 84 text

What powers this work f low?

Slide 85

Slide 85 text

What powers this work f low? Pangeo cloud deployments + ARCO CMIP6 cache

Slide 86

Slide 86 text

• Think in “Datasets” not “data fi les” • No need for tedious homogenizing / cleaning steps • Curated and cataloged A R C O D a t a 56 Analysis Ready, Cloud Optimzed $VGDWDVFLHQFHEHFRPHVPRUHFRPPRQSODFHDQG VLPXOWDQHRXVO\DELWGHP\VWLĆHGZHH[SHFWWKLV WUHQGWRFRQWLQXHDVZHOO$IWHUDOOODVW\HDUèV respondents were just as excited about their ZRUNDERXWZHUHêVDWLVĆHGëRUEHWWHU How a Data Scientist Spends Their Day +HUHèVZKHUHWKHSRSXODUYLHZRIGDWDVFLHQWLVWVGLYHUJHVSUHWW\VLJQLĆFDQWO\IURPUHDOLW\*HQ ZHWKLQNRIGDWDVFLHQWLVWVEXLOGLQJDOJRULWKPVH[SORULQJGDWDDQGGRLQJSUHGLFWLYHDQDO\VLV7 actually not what they spend most of their time doing, however. $V\RXFDQVHHIURPWKHFKDUWDERYHRXWRIHYHU\GDWDVFLHQWLVWVZHVXUYH\HGDFWXDOO\VSHQ PRVWWLPHFOHDQLQJDQGRUJDQL]LQJGDWD

Slide 87

Slide 87 text

57 Chunked appropriately for analysis Rich metadata Everything in one dataset object https://catalog.pangeo.io/browse/master/ocean/sea_surface_height/

Slide 88

Slide 88 text

• Compatible with object storage 
 (access via HTTP) • Supports lazy access and intelligent subsetting • Integrates with high-level analysis libraries and distributed frameworks A R C O D a t a 58 Analysis Ready, Cloud Optimzed What is “Cloud Optimized”?

Slide 89

Slide 89 text

A R C O D a t a i s F a s t ! 59 https://doi.org/10.1109/MCSE.2021.3059437

Slide 90

Slide 90 text

60 Making ARCO Data is Hard! Domain Expertise: 
 How to fi nd, clean, and homogenize data Tech Knowledge: 
 How to ef fi ciently produce cloud-optimized formats Compute Resources: 
 A place where to stage and upload the ARCO data Communication Skills: 
 To explain to others how to use the data To produce useful ARCO data, you must have: Data Scientist 😩

Slide 91

Slide 91 text

61 Let’s democratize the production of ARCO data! Domain Expertise: 
 How to fi nd, clean, and homogenize data 🤓 Data Scientist

Slide 92

Slide 92 text

62 Pangeo Forge Recipes Pangeo Forge Cloud Open source python package for describing and running data pipelines (“recipes”) Cloud platform for automatically executing recipes stored in GitHub repos. https://github.com/pangeo-forge/pangeo-forge-recipes https://pangeo-forge.org/

Slide 93

Slide 93 text

A concrete example: Our ARCO CMIP6 cache Balaji, V. et al. Requirements for a global data infrastructure in support of CMIP6. Geosci. Model Dev. 11, 3659–3680 (2018). Datasets are directly downloaded from ESGF nodes and minimally processed Files -> Dataset Carry complete metadata, handle_id etc for provenance We regularly f ilter our catalog for retracted datasets WIP: Fully automate work f low with pangeo-forge

Slide 94

Slide 94 text

A concrete example: Our ARCO CMIP6 cache Quesions? Suggestions? Join the Pangeo / ESGF Cloud Data Working Group biweekly calls! https://pangeo-data.github.io/pangeo-cmip6-cloud/

Slide 95

Slide 95 text

65 http://pangeo.io https://discourse.pangeo.io/ https://github.com/pangeo-data/ https://medium.com/pangeo @pangeo_data @JuliusBusecke jbusecke [email protected] I ❤ F e e d b a c k , q u e s t i o n s , c o n t r i b u t i o n s . CMIP data in the cloud Pangeo Cloud Deployments - https://pangeo.io/cloud.html •

Slide 96

Slide 96 text

No content

Slide 97

Slide 97 text

No content

Slide 98

Slide 98 text

Ocean Science Meeting 2022 | Julius Busecke (@JuliusBusecke) | Laure Resplandy | Sam Ditkovsky | Jasmin John Oxygen Minimum Zones in the Tropical Pacific Will they expand or shrink?

Slide 99

Slide 99 text

Ocean Science Meeting 2022 | Julius Busecke (@JuliusBusecke) | Laure Resplandy | Sam Ditkovsky | Jasmin John Oxygen Minimum Zones in the Tropical Pacific Will they expand or shrink?