Slide 1

Slide 1 text

Open Science with Pangeo Julius Busecke | UCLA IDRE Open Science Workshop Mar 24 2022 | Contains slides adopted from Ryan Abernathey From community to climate science in the cloud

Slide 2

Slide 2 text

Who am I? Physical Oceanographer Studies the role of ocean currents for the climate and ecosystems. Associate Research Scientist - Columbia University Core-Developer of xgcm Core-Developer of cmip6_preprocessing Pangeo Fan, User, and Member Open Source/Open Science Advocate

Slide 3

Slide 3 text

Source: https://www.un.org/en/global-issues/climate-change

Slide 4

Slide 4 text

4 Credit: NASA's Goddard Space Flight Center

Slide 5

Slide 5 text

4 Credit: NASA's Goddard Space Flight Center

Slide 6

Slide 6 text

4 Credit: NASA's Goddard Space Flight Center https://earthdata.nasa.gov/eosdis/cloud-evolution SWOT NISAR

Slide 7

Slide 7 text

5 Credit: NASA's Goddard Space Flight Center

Slide 8

Slide 8 text

5 Credit: NASA's Goddard Space Flight Center

Slide 9

Slide 9 text

How do we get to work with all this data?

Slide 10

Slide 10 text

How do we get to work with all this data? FTP / OPeNDAP / etc. Download Files

Slide 11

Slide 11 text

MB 😀 FTP / OPeNDAP / etc.

Slide 12

Slide 12 text

GB 😐 FTP / OPeNDAP / etc.

Slide 13

Slide 13 text

TB 😖 FTP / OPeNDAP / etc.

Slide 14

Slide 14 text

PB 😱 FTP / OPeNDAP / etc.

Slide 15

Slide 15 text

P r i v i l e g e d I n s t i t u t i o n s c r e a t e “ D a t a F o r t r e s s e s * ” 11 Image credit: Moahim, CC BY-SA 4.0, via Wikimedia Commons *Coined by Chelle Gentemann

Slide 16

Slide 16 text

P r i v i l e g e d I n s t i t u t i o n s c r e a t e “ D a t a F o r t r e s s e s * ” 11 Image credit: Moahim, CC BY-SA 4.0, via Wikimedia Commons *Coined by Chelle Gentemann

Slide 17

Slide 17 text

Privileged Institutions create “Data Fortresses*” Image credit: Moahim, CC BY-SA 4.0, via Wikimedia Commons Data *Coined by Chelle Gentemann

Slide 18

Slide 18 text

Privileged Institutions create “Data Fortresses*” Image credit: Moahim, CC BY-SA 4.0, via Wikimedia Commons Data *Coined by Chelle Gentemann

Slide 19

Slide 19 text

Privileged Institutions create “Data Fortresses*” Image credit: Moahim, CC BY-SA 4.0, via Wikimedia Commons Data ❌ Results not reproducible outside fortress ❌ Barrier to collaboration ❌ Inefficient / duplicative ❌ Can’t scale to future data needs ❌ Limits inclusion and knowledge transfer *Coined by Chelle Gentemann

Slide 20

Slide 20 text

Bring the compute to the data! But how?

Slide 21

Slide 21 text

Use a “Platform”

Slide 22

Slide 22 text

The Trouble with “Platforms”

Slide 23

Slide 23 text

The Trouble with “Platforms” Scientists’ creativity often exceeds pre-baked capabilities. 
 Desire to go under the hood

Slide 24

Slide 24 text

The Trouble with “Platforms” Scientists’ creativity often exceeds pre-baked capabilities. 
 Desire to go under the hood What if you want to access data that isn’t included? 
 Data catalog is determined by provider, not users

Slide 25

Slide 25 text

The Trouble with “Platforms” Scientists’ creativity often exceeds pre-baked capabilities. 
 Desire to go under the hood What if you want to access data that isn’t included? 
 Data catalog is determined by provider, not users Platforms are “single instance”: 
 Fear of lock-in, possibility platform will disappear

Slide 26

Slide 26 text

OPEN Cloud Architecture Data Provider’s $ Data Consumer’s $

Slide 27

Slide 27 text

Interactive Computing Data Provider’s $ Data Consumer’s $ OPEN Cloud Architecture

Slide 28

Slide 28 text

Data Provider’s $ Data Consumer’s $ Interactive Computing Parallel Computing OPEN Cloud Architecture

Slide 29

Slide 29 text

Data Provider’s $ Data Consumer’s $ Interactive Computing Parallel Computing Analysis Ready Data
 Cloud Optimized Formats OPEN Cloud Architecture

Slide 30

Slide 30 text

Data Provider’s $ Data Consumer’s $ Interactive Computing Parallel Computing Analysis Ready Data
 Cloud Optimized Formats OPEN Cloud Architecture

Slide 31

Slide 31 text

O p e n O c e a n C l o u d W i l l b e a 
 “ D a t a W a t e r i n g H o l e * ” 22 *Coined by Fernando Perez

Slide 32

Slide 32 text

23 O p e n O c e a n C l o u d W i l l b e a 
 “ D a t a W a t e r i n g H o l e * ” Data Library Compute Environment *Coined by Fernando Perez

Slide 33

Slide 33 text

24 👩💻👨💻👩💻 Group A: 
 Air-Sea Interaction 👩💻👨💻👩💻 Group B: 
 Seasonal Forecasting Research Education Industry *Coined by Fernando Perez O p e n O c e a n C l o u d W i l l b e a 
 “ D a t a W a t e r i n g H o l e * ” Data Library Compute Environment

Slide 34

Slide 34 text

24 👩💻👨💻👩💻 Group A: 
 Air-Sea Interaction 👩💻👨💻👩💻 Group B: 
 Seasonal Forecasting Research Education Industry *Coined by Fernando Perez O p e n O c e a n C l o u d W i l l b e a 
 “ D a t a W a t e r i n g H o l e * ” Data Library Compute Environment ✅ Faster science, more discoveries ✅ Inherently reproducible ✅ Allows seamless global collaboration ✅ Unleashes creativity ✅ Cost effective ✅ Accessible to all ✅ Connects with industry

Slide 35

Slide 35 text

https://openocean.cloud Material adopted from Ryan Abernathey

Slide 36

Slide 36 text

What is Pangeo?

Slide 37

Slide 37 text

What is Pangeo? Community obsessed with e ff icient data processing. 
 Founded in 2017. Scientists and software developers coming together. http://pangeo.io/ 
 Weekly meeting / seminar. Discourse Forum. Annual meeting. Workshops at AGU / AMS / etc. Interoperable Software 
 
 Foundation in Open Source Scienti f ic Python: Jupyter, Xarray, Dask, Zarr. Broad ecosystem of interoperable packages for analysis, visualization, and machine learning. Data and Computing Infrastructure 
 
 Deployment recipes for cloud and HPC. Open, public, cloud-based JupyterHubs and Binders for Data-proximate computing. PB of analysis-ready, cloud-optimized data stored in public cloud (GCS, AWS) and OpenStorageNetwork.

Slide 38

Slide 38 text

What is Pangeo? Community obsessed with e ff icient data processing. 
 Founded in 2017. Scientists and software developers coming together. http://pangeo.io/ 
 Weekly meeting / seminar. Discourse Forum. Annual meeting. Workshops at AGU / AMS / etc. Interoperable Software 
 
 Foundation in Open Source Scienti f ic Python: Jupyter, Xarray, Dask, Zarr. Broad ecosystem of interoperable packages for analysis, visualization, and machine learning. Data and Computing Infrastructure 
 
 Deployment recipes for cloud and HPC. Open, public, cloud-based JupyterHubs and Binders for Data-proximate computing. PB of analysis-ready, cloud-optimized data stored in public cloud (GCS, AWS) and OpenStorageNetwork.

Slide 39

Slide 39 text

😍

Slide 40

Slide 40 text

CMIP6 Cloud Dataset • Pangeo partnered with ESGF and Google Cloud to provide a new public dataset • > 1 PB and counting • Data stored in Zarr format • Google provides free hosting in GCS • Mirrored on AWS https://pangeo-data.github.io/pangeo-cmip6-cloud/

Slide 41

Slide 41 text

Analyzing Petabyte scale climate data in your browser with Pangeo Custom Analysis applied to each model and member

Slide 42

Slide 42 text

What we want to do

Slide 43

Slide 43 text

What we want to do 💡 Have an idea

Slide 44

Slide 44 text

What we want to do Write some code

Slide 45

Slide 45 text

What we want to do Rock some science

Slide 46

Slide 46 text

What we want to do Rock some science

Slide 47

Slide 47 text

Time to science?

Slide 48

Slide 48 text

Di ff erent dimension names in the CMIP data. 
 
 Not quite analysis -ready No! Time to homogenize data!

Slide 49

Slide 49 text

🤔 💡 No! Time to clean data!

Slide 50

Slide 50 text

🤔 💡 Competition for brain power

Slide 51

Slide 51 text

🤔 Isn’t there a better way?

Slide 52

Slide 52 text

There is! + Analysis Ready Data in the cloud Crowd-Sourced Data Cleaning 
 (peer-to-peer learning)

Slide 53

Slide 53 text

There is! + Analysis Ready Data in the cloud Crowd-Sourced Data Cleaning 
 (peer-to-peer learning) Less data wrangling, more 💡 =

Slide 54

Slide 54 text

Demo

Slide 55

Slide 55 text

• Think in “Datasets” not “data fi les” • No need for tedious homogenizing / cleaning steps • Curated and cataloged A R C O D a t a 40 Analysis Ready, Cloud Optimzed $VGDWDVFLHQFHEHFRPHVPRUHFRPPRQSODFHDQG VLPXOWDQHRXVO\DELWGHP\VWLĆHGZHH[SHFWWKLV WUHQGWRFRQWLQXHDVZHOO$IWHUDOOODVW\HDUèV respondents were just as excited about their ZRUNDERXWZHUHêVDWLVĆHGëRUEHWWHU How a Data Scientist Spends Their Day +HUHèVZKHUHWKHSRSXODUYLHZRIGDWDVFLHQWLVWVGLYHUJHVSUHWW\VLJQLĆFDQWO\IURPUHDOLW\*HQ ZHWKLQNRIGDWDVFLHQWLVWVEXLOGLQJDOJRULWKPVH[SORULQJGDWDDQGGRLQJSUHGLFWLYHDQDO\VLV7 actually not what they spend most of their time doing, however. $V\RXFDQVHHIURPWKHFKDUWDERYHRXWRIHYHU\GDWDVFLHQWLVWVZHVXUYH\HGDFWXDOO\VSHQ PRVWWLPHFOHDQLQJDQGRUJDQL]LQJGDWD

Slide 56

Slide 56 text

E X A M P L E O F A R C O D ATA 41 Chunked appropriately for analysis Rich metadata Everything in one dataset object https://catalog.pangeo.io/browse/master/ocean/sea_surface_height/

Slide 57

Slide 57 text

• Compatible with object storage 
 (access via HTTP) • Supports lazy access and intelligent subsetting • Integrates with high-level analysis libraries and distributed frameworks A R C O D a t a 42 Analysis Ready, Cloud Optimzed What is “Cloud Optimized”?

Slide 58

Slide 58 text

P r o b l e m : 43 Making ARCO Data is Hard! Domain Expertise: 
 How to fi nd, clean, and homogenize data Tech Knowledge: 
 How to ef fi ciently produce cloud-optimized formats Compute Resources: 
 A place where to stage and upload the ARCO data Communication Skills: 
 To explain to others how to use the data To produce useful ARCO data, you must have: Data Scientist 😩

Slide 59

Slide 59 text

P a n g e o F o r g e 44 Let’s democratize the production of ARCO data! Domain Expertise: 
 How to fi nd, clean, and homogenize data 🤓 Data Scientist

Slide 60

Slide 60 text

45 Pangeo Forge Recipes Pangeo Forge Cloud Open source python package for describing and running data pipelines (“recipes”) Cloud platform for automatically executing recipes stored in GitHub repos. https://github.com/pangeo-forge/pangeo-forge-recipes https://pangeo-forge.org/

Slide 61

Slide 61 text

L e a r n M o r e 46 http://pangeo.io https://discourse.pangeo.io/ https://github.com/pangeo-data/ https://medium.com/pangeo @pangeo_data @JuliusBusecke jbusecke [email protected]

Slide 62

Slide 62 text

No content

Slide 63

Slide 63 text

A R C O D a t a i s F a s t ! 48 https://doi.org/10.1109/MCSE.2021.3059437