Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Pangeo Forge Tutorial Intoduction

Pangeo Forge Tutorial Intoduction

654d48d6c1c10c50c160954ba31207a2?s=128

Ryan Abernathey

March 07, 2022
Tweet

More Decks by Ryan Abernathey

Other Decks in Science

Transcript

  1. A cloud-native data library for ocean, weather, and climate science.

  2. Y o u r I n s t r u

    c t o r s 2 Ryan Abernathey 
 Columbia / LDEO Rachel Wegener 
 U. Maryland Charles Stern 
 Columbia / LDEO
  3. S c h e d u l e 3 11:30-11:50

    Intro to Pangeo Forge Ryan 11:50-12:15 Pangeo Forge Recipes Tutorial Rachel 12:15-12:30 Pangeo Forge Cloud Tutorial Charles
 (video) 12:30-12:40 Break ➡ Breakouts 12:40-1:20 Work on recipes in breakouts 1:20-1:30 Reconvene / wrap up Rachel & Ryan
  4. L e a r n i n g G o

    a l s 4 Goal Evidence Understand what Pangeo Forge Recipes does and how it relates to Pangeo Forge Cloud Explain when someone would need Pangeo Forge Recipes and when someone would need Pangeo Forge Cloud Learn to use Pangeo Forge Recipes for simple (concat-only) recipes Write a simple (concat-only) recipe and run it in binder Learn to transition a recipe to Pangeo Forge Cloud Make a PR to staged recipes Help other people to use Pangeo Forge on their own when the tutorial is done Generate ideas of how Pangeo Forge could benefit their work Understand path to become a tool contributor Create actionable issues on pangeo-forge-recipes issue tracker
  5. compute node C l o u d N at i

    v e D ata A n a ly t i c s 5 1. Analysis-Ready, Cloud-Optimized Data 2. Data-Proximate Computing 3. Elastic Distributed Processing compute node compute node compute node compute node compute node compute node compute node compute node compute node Compute Environment
  6. compute node C l o u d N at i

    v e D ata A n a ly t i c s 5 1. Analysis-Ready, Cloud-Optimized Data 2. Data-Proximate Computing 3. Elastic Distributed Processing compute node compute node compute node compute node compute node compute node compute node compute node compute node Compute Environment
  7. • Think in “Datasets” not “data fi les” • No

    need for tedious homogenizing / cleaning steps • Curated and cataloged A R C O D ata 6 Analysis Ready, Cloud Optimzed $VGDWDVFLHQFHEHFRPHVPRUHFRPPRQSODFHDQG VLPXOWDQHRXVO\DELWGHP\VWLĆHGZHH[SHFWWKLV WUHQGWRFRQWLQXHDVZHOO$IWHUDOOODVW\HDUèV respondents were just as excited about their ZRUN DERXWZHUHêVDWLVĆHGëRUEHWWHU  How a Data Scientist Spends Their Day +HUHèVZKHUHWKHSRSXODUYLHZRIGDWDVFLHQWLVWVGLYHUJHVSUHWW\VLJQLĆFDQWO\IURPUHDOLW\*HQ ZHWKLQNRIGDWDVFLHQWLVWVEXLOGLQJDOJRULWKPVH[SORULQJGDWDDQGGRLQJSUHGLFWLYHDQDO\VLV7 actually not what they spend most of their time doing, however. $V\RXFDQVHHIURPWKHFKDUWDERYHRXWRIHYHU\GDWDVFLHQWLVWVZHVXUYH\HGDFWXDOO\VSHQ PRVWWLPHFOHDQLQJDQGRUJDQL]LQJGDWD<RXPD\KDYHKHDUGWKLVUHIHUUHGWRDVêGDWDZUDQJOLQ FRPSDUHGWRGLJLWDOMDQLWRUZRUN(YHU\WKLQJIURPOLVWYHULĆFDWLRQWRUHPRYLQJFRPPDVWRGHE databases–that time adds up and it adds up immensely. Messy data is by far the more time- con DVSHFWRIWKHW\SLFDOGDWDVFLHQWLVWèVZRUNćRZ$QGQHDUO\VDLGWKH\VLPSO\VSHQWWRRPXF Data scientist job satisfaction 60% 19% 9% 4% 5% 3% Building training sets: 3% Cleaning and organizing data: 60% Collecting data sets; 19% Mining data for patterns: 9% 5HĆQLQJDOJRULWKPV Other: 5% ,!;&!;!9$-'2ধ9;996'2&;,'139;ধ1'&3 2 1   How do data scientists spend their time? Crowd fl ower Data Science Report (2016) What is “Analysis Ready”?
  8. E X A M P L E O F A

    R C O D ATA 7 Chunked appropriately for analysis Rich metadata Everything in one dataset object https://catalog.pangeo.io/browse/master/ocean/sea_surface_height/
  9. • Compatible with object storage 
 (access via HTTP) •

    Supports lazy access and intelligent subsetting • Integrates with high-level analysis libraries and distributed frameworks A R C O D ata 8 Analysis Ready, Cloud Optimzed What is “Cloud Optimized”?
  10. • Compatible with object storage 
 (access via HTTP) •

    Supports lazy access and intelligent subsetting • Integrates with high-level analysis libraries and distributed frameworks A R C O D ata 8 Analysis Ready, Cloud Optimzed What is “Cloud Optimized”?
  11. A R C o D ata i s Fa s

    t ! 9 https://doi.org/10.1109/MCSE.2021.3059437
  12. • Pangeo partnered with ESGF and Google Cloud to provide

    a new public dataset • > 1 PB and counting • Data stored in Zarr format • Google provides free hosting in GCS • Mirrored on AWS C M I P 6 C l o u d D ata s e t 10
  13. P r o b l e m : 11 Making

    ARCO Data is Hard! To produce useful ARCO data, you must have: Data Scientist 😩
  14. P r o b l e m : 11 Making

    ARCO Data is Hard! Domain Expertise: 
 How to fi nd, clean, and homogenize data To produce useful ARCO data, you must have: Data Scientist 😩
  15. P r o b l e m : 11 Making

    ARCO Data is Hard! Domain Expertise: 
 How to fi nd, clean, and homogenize data Tech Knowledge: 
 How to ef fi ciently produce cloud-optimized formats To produce useful ARCO data, you must have: Data Scientist 😩
  16. P r o b l e m : 11 Making

    ARCO Data is Hard! Domain Expertise: 
 How to fi nd, clean, and homogenize data Tech Knowledge: 
 How to ef fi ciently produce cloud-optimized formats Compute Resources: 
 A place where to stage and upload the ARCO data To produce useful ARCO data, you must have: Data Scientist 😩
  17. P r o b l e m : 11 Making

    ARCO Data is Hard! Domain Expertise: 
 How to fi nd, clean, and homogenize data Tech Knowledge: 
 How to ef fi ciently produce cloud-optimized formats Compute Resources: 
 A place where to stage and upload the ARCO data Communication Skills: 
 To explain to others how to use the data To produce useful ARCO data, you must have: Data Scientist 😩
  18. Pa n g e o F o r g e

    12 Let’s democratize the production of ARCO data! 🤓 Data Scientist
  19. Pa n g e o F o r g e

    12 Let’s democratize the production of ARCO data! Domain Expertise: 
 How to fi nd, clean, and homogenize data 🤓 Data Scientist
  20. I n s p i r at i o n

    : C o n d a F o r g e 13
  21. 14 Pangeo Forge Recipes Pangeo Forge Cloud Open source python

    package for describing and running data pipelines (“recipes”) Cloud platform for automatically executing recipes stored in GitHub repos. https://github.com/pangeo-forge/pangeo-forge-recipes https://pangeo-forge.org/
  22. Pa n g e o F o r g e

    R e c i p e s 15 FilePattern StorageConfig Recipe Executor Describes where to fi nd the source fi les which are the inputs to the recipe Describes where to store the outputs of our recipe A complete, self- contained representation of the pipeline Knows how to run the recipe. https://pangeo-forge.readthedocs.io/
  23. F i l e Pat t e r n s

    16 Describe where to fi nd the source fi les which are the inputs to the recipe ConcatDim (time) temperature humidity MergeDim
 (variable) http://data-provider.org/data/humidity_03.txt
  24. R e c i p e s 17 A complete,

    self-contained representation of a data transformation pipeline (Sensible defaults + lots of options to customize the transformation.)
  25. E x e c u t o r s 18

    Executors know how to run recipes. Use whichever one makes sense for you!
  26. Pa n g e o F o r g e

    C l o u d 19 Feedstock Contains the code and metadata for one or more Recipes Bakery https://pangeo-forge.org/ Storage Runs the recipes in the cloud using elastic scaling clusters Runs the recipes in the cloud using elastic scaling clusters GCS
  27. V i s i o n : C o l

    l a b o r at i v e D ata C u r at i o n 20 Feedstock 🤓 Data User 🤓 Data Producer 🤓 Data Manager These data look weird… …Oh the metadata need an update. Ok I’ll make a PR to the recipe.
  28. Pa n g e o F o r g e

    D e v e l o p m e n t 21 https://github.com/pangeo-forge/roadmap This is a 💯% open project!
  29. Y o u a r e a G u i

    n e a P i g ! 22 This is all brand new! You are the fi rst people to try Pangeo Cloud. It will almost certainly break. Your feedback will help us improve it. 🙏