Slide 1

Slide 1 text

A cloud-native data library for ocean, weather, and climate science. 


Slide 2

Slide 2 text

Y o u r I n s t r u c t o r s 2 Ryan Abernathey 
 Columbia / LDEO Rachel Wegener 
 U. Maryland Charles Stern 
 Columbia / LDEO

Slide 3

Slide 3 text

S c h e d u l e 3 11:30-11:50 Intro to Pangeo Forge Ryan 11:50-12:15 Pangeo Forge Recipes Tutorial Rachel 12:15-12:30 Pangeo Forge Cloud Tutorial Charles
 (video) 12:30-12:40 Break ➡ Breakouts 12:40-1:20 Work on recipes in breakouts 1:20-1:30 Reconvene / wrap up Rachel & Ryan

Slide 4

Slide 4 text

L e a r n i n g G o a l s 4 Goal Evidence Understand what Pangeo Forge Recipes does and how it relates to Pangeo Forge Cloud Explain when someone would need Pangeo Forge Recipes and when someone would need Pangeo Forge Cloud Learn to use Pangeo Forge Recipes for simple (concat-only) recipes Write a simple (concat-only) recipe and run it in binder Learn to transition a recipe to Pangeo Forge Cloud Make a PR to staged recipes Help other people to use Pangeo Forge on their own when the tutorial is done Generate ideas of how Pangeo Forge could benefit their work Understand path to become a tool contributor Create actionable issues on pangeo-forge-recipes issue tracker

Slide 5

Slide 5 text

compute node C l o u d N at i v e D ata A n a ly t i c s 5 1. Analysis-Ready, Cloud-Optimized Data 2. Data-Proximate Computing 3. Elastic Distributed Processing compute node compute node compute node compute node compute node compute node compute node compute node compute node Compute Environment

Slide 6

Slide 6 text

compute node C l o u d N at i v e D ata A n a ly t i c s 5 1. Analysis-Ready, Cloud-Optimized Data 2. Data-Proximate Computing 3. Elastic Distributed Processing compute node compute node compute node compute node compute node compute node compute node compute node compute node Compute Environment

Slide 7

Slide 7 text

• Think in “Datasets” not “data fi les” • No need for tedious homogenizing / cleaning steps • Curated and cataloged A R C O D ata 6 Analysis Ready, Cloud Optimzed $VGDWDVFLHQFHEHFRPHVPRUHFRPPRQSODFHDQG VLPXOWDQHRXVO\DELWGHP\VWLĆHGZHH[SHFWWKLV WUHQGWRFRQWLQXHDVZHOO$IWHUDOOODVW\HDUèV respondents were just as excited about their ZRUN DERXWZHUHêVDWLVĆHGëRUEHWWHU  How a Data Scientist Spends Their Day +HUHèVZKHUHWKHSRSXODUYLHZRIGDWDVFLHQWLVWVGLYHUJHVSUHWW\VLJQLĆFDQWO\IURPUHDOLW\*HQ ZHWKLQNRIGDWDVFLHQWLVWVEXLOGLQJDOJRULWKPVH[SORULQJGDWDDQGGRLQJSUHGLFWLYHDQDO\VLV7 actually not what they spend most of their time doing, however. $V\RXFDQVHHIURPWKHFKDUWDERYHRXWRIHYHU\GDWDVFLHQWLVWVZHVXUYH\HGDFWXDOO\VSHQ PRVWWLPHFOHDQLQJDQGRUJDQL]LQJGDWD

Slide 8

Slide 8 text

E X A M P L E O F A R C O D ATA 7 Chunked appropriately for analysis Rich metadata Everything in one dataset object https://catalog.pangeo.io/browse/master/ocean/sea_surface_height/

Slide 9

Slide 9 text

• Compatible with object storage 
 (access via HTTP) • Supports lazy access and intelligent subsetting • Integrates with high-level analysis libraries and distributed frameworks A R C O D ata 8 Analysis Ready, Cloud Optimzed What is “Cloud Optimized”?

Slide 10

Slide 10 text

• Compatible with object storage 
 (access via HTTP) • Supports lazy access and intelligent subsetting • Integrates with high-level analysis libraries and distributed frameworks A R C O D ata 8 Analysis Ready, Cloud Optimzed What is “Cloud Optimized”?

Slide 11

Slide 11 text

A R C o D ata i s Fa s t ! 9 https://doi.org/10.1109/MCSE.2021.3059437

Slide 12

Slide 12 text

• Pangeo partnered with ESGF and Google Cloud to provide a new public dataset • > 1 PB and counting • Data stored in Zarr format • Google provides free hosting in GCS • Mirrored on AWS C M I P 6 C l o u d D ata s e t 10

Slide 13

Slide 13 text

P r o b l e m : 11 Making ARCO Data is Hard! To produce useful ARCO data, you must have: Data Scientist 😩

Slide 14

Slide 14 text

P r o b l e m : 11 Making ARCO Data is Hard! Domain Expertise: 
 How to fi nd, clean, and homogenize data To produce useful ARCO data, you must have: Data Scientist 😩

Slide 15

Slide 15 text

P r o b l e m : 11 Making ARCO Data is Hard! Domain Expertise: 
 How to fi nd, clean, and homogenize data Tech Knowledge: 
 How to ef fi ciently produce cloud-optimized formats To produce useful ARCO data, you must have: Data Scientist 😩

Slide 16

Slide 16 text

P r o b l e m : 11 Making ARCO Data is Hard! Domain Expertise: 
 How to fi nd, clean, and homogenize data Tech Knowledge: 
 How to ef fi ciently produce cloud-optimized formats Compute Resources: 
 A place where to stage and upload the ARCO data To produce useful ARCO data, you must have: Data Scientist 😩

Slide 17

Slide 17 text

P r o b l e m : 11 Making ARCO Data is Hard! Domain Expertise: 
 How to fi nd, clean, and homogenize data Tech Knowledge: 
 How to ef fi ciently produce cloud-optimized formats Compute Resources: 
 A place where to stage and upload the ARCO data Communication Skills: 
 To explain to others how to use the data To produce useful ARCO data, you must have: Data Scientist 😩

Slide 18

Slide 18 text

Pa n g e o F o r g e 12 Let’s democratize the production of ARCO data! 🤓 Data Scientist

Slide 19

Slide 19 text

Pa n g e o F o r g e 12 Let’s democratize the production of ARCO data! Domain Expertise: 
 How to fi nd, clean, and homogenize data 🤓 Data Scientist

Slide 20

Slide 20 text

I n s p i r at i o n : C o n d a F o r g e 13

Slide 21

Slide 21 text

14 Pangeo Forge Recipes Pangeo Forge Cloud Open source python package for describing and running data pipelines (“recipes”) Cloud platform for automatically executing recipes stored in GitHub repos. https://github.com/pangeo-forge/pangeo-forge-recipes https://pangeo-forge.org/

Slide 22

Slide 22 text

Pa n g e o F o r g e R e c i p e s 15 FilePattern StorageConfig Recipe Executor Describes where to fi nd the source fi les which are the inputs to the recipe Describes where to store the outputs of our recipe A complete, self- contained representation of the pipeline Knows how to run the recipe. https://pangeo-forge.readthedocs.io/

Slide 23

Slide 23 text

F i l e Pat t e r n s 16 Describe where to fi nd the source fi les which are the inputs to the recipe ConcatDim (time) temperature humidity MergeDim
 (variable) http://data-provider.org/data/humidity_03.txt

Slide 24

Slide 24 text

R e c i p e s 17 A complete, self-contained representation of a data transformation pipeline (Sensible defaults + lots of options to customize the transformation.)

Slide 25

Slide 25 text

E x e c u t o r s 18 Executors know how to run recipes. Use whichever one makes sense for you!

Slide 26

Slide 26 text

Pa n g e o F o r g e C l o u d 19 Feedstock Contains the code and metadata for one or more Recipes Bakery https://pangeo-forge.org/ Storage Runs the recipes in the cloud using elastic scaling clusters Runs the recipes in the cloud using elastic scaling clusters GCS

Slide 27

Slide 27 text

V i s i o n : C o l l a b o r at i v e D ata C u r at i o n 20 Feedstock 🤓 Data User 🤓 Data Producer 🤓 Data Manager These data look weird… …Oh the metadata need an update. Ok I’ll make a PR to the recipe.

Slide 28

Slide 28 text

Pa n g e o F o r g e D e v e l o p m e n t 21 https://github.com/pangeo-forge/roadmap This is a 💯% open project!

Slide 29

Slide 29 text

Y o u a r e a G u i n e a P i g ! 22 This is all brand new! You are the fi rst people to try Pangeo Cloud. It will almost certainly break. Your feedback will help us improve it. 🙏