Intro to Pangeo Forge Ryan 11:50-12:15 Pangeo Forge Recipes Tutorial Rachel 12:15-12:30 Pangeo Forge Cloud Tutorial Charles (video) 12:30-12:40 Break ➡ Breakouts 12:40-1:20 Work on recipes in breakouts 1:20-1:30 Reconvene / wrap up Rachel & Ryan
a l s 4 Goal Evidence Understand what Pangeo Forge Recipes does and how it relates to Pangeo Forge Cloud Explain when someone would need Pangeo Forge Recipes and when someone would need Pangeo Forge Cloud Learn to use Pangeo Forge Recipes for simple (concat-only) recipes Write a simple (concat-only) recipe and run it in binder Learn to transition a recipe to Pangeo Forge Cloud Make a PR to staged recipes Help other people to use Pangeo Forge on their own when the tutorial is done Generate ideas of how Pangeo Forge could benefit their work Understand path to become a tool contributor Create actionable issues on pangeo-forge-recipes issue tracker
v e D ata A n a ly t i c s 5 1. Analysis-Ready, Cloud-Optimized Data 2. Data-Proximate Computing 3. Elastic Distributed Processing compute node compute node compute node compute node compute node compute node compute node compute node compute node Compute Environment
v e D ata A n a ly t i c s 5 1. Analysis-Ready, Cloud-Optimized Data 2. Data-Proximate Computing 3. Elastic Distributed Processing compute node compute node compute node compute node compute node compute node compute node compute node compute node Compute Environment
need for tedious homogenizing / cleaning steps • Curated and cataloged A R C O D ata 6 Analysis Ready, Cloud Optimzed $VGDWDVFLHQFHEHFRPHVPRUHFRPPRQSODFHDQG VLPXOWDQHRXVO\DELWGHP\VWLĆHGZHH[SHFWWKLV WUHQGWRFRQWLQXHDVZHOO$IWHUDOOODVW\HDUèV respondents were just as excited about their ZRUN DERXWZHUHêVDWLVĆHGëRUEHWWHU How a Data Scientist Spends Their Day +HUHèVZKHUHWKHSRSXODUYLHZRIGDWDVFLHQWLVWVGLYHUJHVSUHWW\VLJQLĆFDQWO\IURPUHDOLW\*HQ ZHWKLQNRIGDWDVFLHQWLVWVEXLOGLQJDOJRULWKPVH[SORULQJGDWDDQGGRLQJSUHGLFWLYHDQDO\VLV7 actually not what they spend most of their time doing, however. $V\RXFDQVHHIURPWKHFKDUWDERYHRXWRIHYHU\GDWDVFLHQWLVWVZHVXUYH\HGDFWXDOO\VSHQ PRVWWLPHFOHDQLQJDQGRUJDQL]LQJGDWD<RXPD\KDYHKHDUGWKLVUHIHUUHGWRDVêGDWDZUDQJOLQ FRPSDUHGWRGLJLWDOMDQLWRUZRUN(YHU\WKLQJIURPOLVWYHULĆFDWLRQWRUHPRYLQJFRPPDVWRGHE databases–that time adds up and it adds up immensely. Messy data is by far the more time- con DVSHFWRIWKHW\SLFDOGDWDVFLHQWLVWèVZRUNćRZ$QGQHDUO\VDLGWKH\VLPSO\VSHQWWRRPXF Data scientist job satisfaction 60% 19% 9% 4% 5% 3% Building training sets: 3% Cleaning and organizing data: 60% Collecting data sets; 19% Mining data for patterns: 9% 5HĆQLQJDOJRULWKPV Other: 5% ,!;&!;!9$-'2ধ9;996'2&;,'139;ধ1'&3 2 1 How do data scientists spend their time? Crowd fl ower Data Science Report (2016) What is “Analysis Ready”?
R C O D ATA 7 Chunked appropriately for analysis Rich metadata Everything in one dataset object https://catalog.pangeo.io/browse/master/ocean/sea_surface_height/
Supports lazy access and intelligent subsetting • Integrates with high-level analysis libraries and distributed frameworks A R C O D ata 8 Analysis Ready, Cloud Optimzed What is “Cloud Optimized”?
Supports lazy access and intelligent subsetting • Integrates with high-level analysis libraries and distributed frameworks A R C O D ata 8 Analysis Ready, Cloud Optimzed What is “Cloud Optimized”?
a new public dataset • > 1 PB and counting • Data stored in Zarr format • Google provides free hosting in GCS • Mirrored on AWS C M I P 6 C l o u d D ata s e t 10
ARCO Data is Hard! Domain Expertise: How to fi nd, clean, and homogenize data Tech Knowledge: How to ef fi ciently produce cloud-optimized formats To produce useful ARCO data, you must have: Data Scientist 😩
ARCO Data is Hard! Domain Expertise: How to fi nd, clean, and homogenize data Tech Knowledge: How to ef fi ciently produce cloud-optimized formats Compute Resources: A place where to stage and upload the ARCO data To produce useful ARCO data, you must have: Data Scientist 😩
ARCO Data is Hard! Domain Expertise: How to fi nd, clean, and homogenize data Tech Knowledge: How to ef fi ciently produce cloud-optimized formats Compute Resources: A place where to stage and upload the ARCO data Communication Skills: To explain to others how to use the data To produce useful ARCO data, you must have: Data Scientist 😩
package for describing and running data pipelines (“recipes”) Cloud platform for automatically executing recipes stored in GitHub repos. https://github.com/pangeo-forge/pangeo-forge-recipes https://pangeo-forge.org/
R e c i p e s 15 FilePattern StorageConfig Recipe Executor Describes where to fi nd the source fi les which are the inputs to the recipe Describes where to store the outputs of our recipe A complete, self- contained representation of the pipeline Knows how to run the recipe. https://pangeo-forge.readthedocs.io/
16 Describe where to fi nd the source fi les which are the inputs to the recipe ConcatDim (time) temperature humidity MergeDim (variable) http://data-provider.org/data/humidity_03.txt
C l o u d 19 Feedstock Contains the code and metadata for one or more Recipes Bakery https://pangeo-forge.org/ Storage Runs the recipes in the cloud using elastic scaling clusters Runs the recipes in the cloud using elastic scaling clusters GCS
l a b o r at i v e D ata C u r at i o n 20 Feedstock 🤓 Data User 🤓 Data Producer 🤓 Data Manager These data look weird… …Oh the metadata need an update. Ok I’ll make a PR to the recipe.
n e a P i g ! 22 This is all brand new! You are the fi rst people to try Pangeo Cloud. It will almost certainly break. Your feedback will help us improve it. 🙏