Lock in $30 Savings on PRO—Offer Ends Soon! ⏳

Open Science with Pangeo - From community to cl...

Open Science with Pangeo - From community to climate science in the cloud

Presentation at the UCLA IDRE Open Science Symposium

Julius Busecke

March 24, 2022
Tweet

More Decks by Julius Busecke

Other Decks in Science

Transcript

  1. Open Science with Pangeo Julius Busecke | UCLA IDRE Open

    Science Workshop Mar 24 2022 | Contains slides adopted from Ryan Abernathey From community to climate science in the cloud
  2. Who am I? Physical Oceanographer Studies the role of ocean

    currents for the climate and ecosystems. Associate Research Scientist - Columbia University Core-Developer of xgcm Core-Developer of cmip6_preprocessing Pangeo Fan, User, and Member Open Source/Open Science Advocate
  3. How do we get to work with all this data?

    FTP / OPeNDAP / etc. Download Files
  4. P r i v i l e g e d

    I n s t i t u t i o n s c r e a t e “ D a t a F o r t r e s s e s * ” 11 Image credit: Moahim, CC BY-SA 4.0, via Wikimedia Commons *Coined by Chelle Gentemann
  5. P r i v i l e g e d

    I n s t i t u t i o n s c r e a t e “ D a t a F o r t r e s s e s * ” 11 Image credit: Moahim, CC BY-SA 4.0, via Wikimedia Commons *Coined by Chelle Gentemann
  6. Privileged Institutions create “Data Fortresses*” Image credit: Moahim, CC BY-SA

    4.0, via Wikimedia Commons Data *Coined by Chelle Gentemann
  7. Privileged Institutions create “Data Fortresses*” Image credit: Moahim, CC BY-SA

    4.0, via Wikimedia Commons Data *Coined by Chelle Gentemann
  8. Privileged Institutions create “Data Fortresses*” Image credit: Moahim, CC BY-SA

    4.0, via Wikimedia Commons Data ❌ Results not reproducible outside fortress ❌ Barrier to collaboration ❌ Inefficient / duplicative ❌ Can’t scale to future data needs ❌ Limits inclusion and knowledge transfer *Coined by Chelle Gentemann
  9. The Trouble with “Platforms” Scientists’ creativity often exceeds pre-baked capabilities.

    
 Desire to go under the hood What if you want to access data that isn’t included? 
 Data catalog is determined by provider, not users
  10. The Trouble with “Platforms” Scientists’ creativity often exceeds pre-baked capabilities.

    
 Desire to go under the hood What if you want to access data that isn’t included? 
 Data catalog is determined by provider, not users Platforms are “single instance”: 
 Fear of lock-in, possibility platform will disappear
  11. Data Provider’s $ Data Consumer’s $ Interactive Computing Parallel Computing

    Analysis Ready Data
 Cloud Optimized Formats OPEN Cloud Architecture
  12. Data Provider’s $ Data Consumer’s $ Interactive Computing Parallel Computing

    Analysis Ready Data
 Cloud Optimized Formats OPEN Cloud Architecture
  13. O p e n O c e a n C

    l o u d W i l l b e a 
 “ D a t a W a t e r i n g H o l e * ” 22 *Coined by Fernando Perez
  14. 23 O p e n O c e a n

    C l o u d W i l l b e a 
 “ D a t a W a t e r i n g H o l e * ” Data Library Compute Environment *Coined by Fernando Perez
  15. 24 👩💻👨💻👩💻 Group A: 
 Air-Sea Interaction 👩💻👨💻👩💻 Group B:

    
 Seasonal Forecasting Research Education Industry *Coined by Fernando Perez O p e n O c e a n C l o u d W i l l b e a 
 “ D a t a W a t e r i n g H o l e * ” Data Library Compute Environment
  16. 24 👩💻👨💻👩💻 Group A: 
 Air-Sea Interaction 👩💻👨💻👩💻 Group B:

    
 Seasonal Forecasting Research Education Industry *Coined by Fernando Perez O p e n O c e a n C l o u d W i l l b e a 
 “ D a t a W a t e r i n g H o l e * ” Data Library Compute Environment ✅ Faster science, more discoveries ✅ Inherently reproducible ✅ Allows seamless global collaboration ✅ Unleashes creativity ✅ Cost effective ✅ Accessible to all ✅ Connects with industry
  17. What is Pangeo? Community obsessed with e ff icient data

    processing. 
 Founded in 2017. Scientists and software developers coming together. http://pangeo.io/ 
 Weekly meeting / seminar. Discourse Forum. Annual meeting. Workshops at AGU / AMS / etc. Interoperable Software 
 
 Foundation in Open Source Scienti f ic Python: Jupyter, Xarray, Dask, Zarr. Broad ecosystem of interoperable packages for analysis, visualization, and machine learning. Data and Computing Infrastructure 
 
 Deployment recipes for cloud and HPC. Open, public, cloud-based JupyterHubs and Binders for Data-proximate computing. PB of analysis-ready, cloud-optimized data stored in public cloud (GCS, AWS) and OpenStorageNetwork.
  18. What is Pangeo? Community obsessed with e ff icient data

    processing. 
 Founded in 2017. Scientists and software developers coming together. http://pangeo.io/ 
 Weekly meeting / seminar. Discourse Forum. Annual meeting. Workshops at AGU / AMS / etc. Interoperable Software 
 
 Foundation in Open Source Scienti f ic Python: Jupyter, Xarray, Dask, Zarr. Broad ecosystem of interoperable packages for analysis, visualization, and machine learning. Data and Computing Infrastructure 
 
 Deployment recipes for cloud and HPC. Open, public, cloud-based JupyterHubs and Binders for Data-proximate computing. PB of analysis-ready, cloud-optimized data stored in public cloud (GCS, AWS) and OpenStorageNetwork.
  19. CMIP6 Cloud Dataset • Pangeo partnered with ESGF and Google

    Cloud to provide a new public dataset • > 1 PB and counting • Data stored in Zarr format • Google provides free hosting in GCS • Mirrored on AWS https://pangeo-data.github.io/pangeo-cmip6-cloud/
  20. Analyzing Petabyte scale climate data in your browser with Pangeo

    Custom Analysis applied to each model and member
  21. Di ff erent dimension names in the CMIP data. 


    
 Not quite analysis -ready No! Time to homogenize data!
  22. There is! + Analysis Ready Data in the cloud Crowd-Sourced

    Data Cleaning 
 (peer-to-peer learning)
  23. There is! + Analysis Ready Data in the cloud Crowd-Sourced

    Data Cleaning 
 (peer-to-peer learning) Less data wrangling, more 💡 =
  24. • Think in “Datasets” not “data fi les” • No

    need for tedious homogenizing / cleaning steps • Curated and cataloged A R C O D a t a 40 Analysis Ready, Cloud Optimzed $VGDWDVFLHQFHEHFRPHVPRUHFRPPRQSODFHDQG VLPXOWDQHRXVO\DELWGHP\VWLĆHGZHH[SHFWWKLV WUHQGWRFRQWLQXHDVZHOO$IWHUDOOODVW\HDUèV respondents were just as excited about their ZRUN DERXWZHUHêVDWLVĆHGëRUEHWWHU  How a Data Scientist Spends Their Day +HUHèVZKHUHWKHSRSXODUYLHZRIGDWDVFLHQWLVWVGLYHUJHVSUHWW\VLJQLĆFDQWO\IURPUHDOLW\*HQ ZHWKLQNRIGDWDVFLHQWLVWVEXLOGLQJDOJRULWKPVH[SORULQJGDWDDQGGRLQJSUHGLFWLYHDQDO\VLV7 actually not what they spend most of their time doing, however. $V\RXFDQVHHIURPWKHFKDUWDERYHRXWRIHYHU\GDWDVFLHQWLVWVZHVXUYH\HGDFWXDOO\VSHQ PRVWWLPHFOHDQLQJDQGRUJDQL]LQJGDWD<RXPD\KDYHKHDUGWKLVUHIHUUHGWRDVêGDWDZUDQJOLQ FRPSDUHGWRGLJLWDOMDQLWRUZRUN(YHU\WKLQJIURPOLVWYHULĆFDWLRQWRUHPRYLQJFRPPDVWRGHE databases–that time adds up and it adds up immensely. Messy data is by far the more time- con DVSHFWRIWKHW\SLFDOGDWDVFLHQWLVWèVZRUNćRZ$QGQHDUO\VDLGWKH\VLPSO\VSHQWWRRPXF Data scientist job satisfaction 60% 19% 9% 4% 5% 3% Building training sets: 3% Cleaning and organizing data: 60% Collecting data sets; 19% Mining data for patterns: 9% 5HĆQLQJDOJRULWKPV Other: 5% ,!;&!;!9$-'2ধ9;996'2&;,'139;ধ1'&3 2 1   How do data scientists spend their time? Crowd fl ower Data Science Report (2016) What is “Analysis Ready”?
  25. E X A M P L E O F A

    R C O D ATA 41 Chunked appropriately for analysis Rich metadata Everything in one dataset object https://catalog.pangeo.io/browse/master/ocean/sea_surface_height/
  26. • Compatible with object storage 
 (access via HTTP) •

    Supports lazy access and intelligent subsetting • Integrates with high-level analysis libraries and distributed frameworks A R C O D a t a 42 Analysis Ready, Cloud Optimzed What is “Cloud Optimized”?
  27. P r o b l e m : 43 Making

    ARCO Data is Hard! Domain Expertise: 
 How to fi nd, clean, and homogenize data Tech Knowledge: 
 How to ef fi ciently produce cloud-optimized formats Compute Resources: 
 A place where to stage and upload the ARCO data Communication Skills: 
 To explain to others how to use the data To produce useful ARCO data, you must have: Data Scientist 😩
  28. P a n g e o F o r g

    e 44 Let’s democratize the production of ARCO data! Domain Expertise: 
 How to fi nd, clean, and homogenize data 🤓 Data Scientist
  29. 45 Pangeo Forge Recipes Pangeo Forge Cloud Open source python

    package for describing and running data pipelines (“recipes”) Cloud platform for automatically executing recipes stored in GitHub repos. https://github.com/pangeo-forge/pangeo-forge-recipes https://pangeo-forge.org/
  30. L e a r n M o r e 46

    http://pangeo.io https://discourse.pangeo.io/ https://github.com/pangeo-data/ https://medium.com/pangeo @pangeo_data @JuliusBusecke jbusecke [email protected]
  31. A R C O D a t a i s

    F a s t ! 48 https://doi.org/10.1109/MCSE.2021.3059437