Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Going both far AND fast as a community - Climate Science with Pangeo

Julius Busecke
May 19, 2022
63

Going both far AND fast as a community - Climate Science with Pangeo

Talk slides from the LANL COSIM Climate and Ocean Group Webinar on 05/18/2022

Julius Busecke

May 19, 2022
Tweet

More Decks by Julius Busecke

Transcript

  1. Going both far AND fast as a community Julius Busecke

    | LANL COSIM Climate and Ocean Group Webinar | 05/18/2022 Climate Science with Pangeo
  2. Who am I? Physical Oceanographer Studies the role of ocean

    currents for the climate and ecosystems. Associate Research Scientist - Columbia University Core-Developer of xgcm Core-Developer of cmip6_preprocessing Maintainer of the Pangeo CMIP6 Cache Pangeo Fan, User, and Member Open Source/Open Science Advocate
  3. Outline Why should we change the way we do science?

    How can we do it? It already works!
  4. How do we get to work with all this data?

    FTP / OPeNDAP / etc. Download Files
  5. P r i v i l e g e d

    I n s t i t u t i o n s c r e a t e “ D a t a F o r t r e s s e s * ” 16 Image credit: Moahim, CC BY-SA 4.0, via Wikimedia Commons *Coined by Chelle Gentemann
  6. P r i v i l e g e d

    I n s t i t u t i o n s c r e a t e “ D a t a F o r t r e s s e s * ” 16 Image credit: Moahim, CC BY-SA 4.0, via Wikimedia Commons *Coined by Chelle Gentemann
  7. Privileged Institutions create “Data Fortresses*” Image credit: Moahim, CC BY-SA

    4.0, via Wikimedia Commons Data *Coined by Chelle Gentemann
  8. Privileged Institutions create “Data Fortresses*” Image credit: Moahim, CC BY-SA

    4.0, via Wikimedia Commons Data *Coined by Chelle Gentemann
  9. Privileged Institutions create “Data Fortresses*” Image credit: Moahim, CC BY-SA

    4.0, via Wikimedia Commons Data ❌ Results not reproducible outside fortress ❌ Barrier to collaboration ❌ Inefficient / duplicative ❌ Can’t scale to future data needs ❌ Limits inclusion and knowledge transfer *Coined by Chelle Gentemann
  10. The Trouble with “Platforms” Scientists’ creativity often exceeds pre-baked capabilities.

    
 Desire to go under the hood What if you want to access data that isn’t included? 
 Data catalog is determined by provider, not users
  11. The Trouble with “Platforms” Scientists’ creativity often exceeds pre-baked capabilities.

    
 Desire to go under the hood What if you want to access data that isn’t included? 
 Data catalog is determined by provider, not users Platforms are “single instance”: 
 Fear of lock-in, possibility platform will disappear
  12. Data Provider’s $ Data Consumer’s $ Interactive Computing Parallel Computing

    Analysis Ready Data
 Cloud Optimized Formats OPEN Cloud Architecture
  13. Data Provider’s $ Data Consumer’s $ Interactive Computing Parallel Computing

    Analysis Ready Data
 Cloud Optimized Formats OPEN Cloud Architecture
  14. O p e n O c e a n C

    l o u d W i l l b e a 
 “ D a t a W a t e r i n g H o l e * ” 27 *Coined by Fernando Perez
  15. 28 O p e n O c e a n

    C l o u d W i l l b e a 
 “ D a t a W a t e r i n g H o l e * ” Data Library Compute Environment *Coined by Fernando Perez
  16. 29 👩💻👨💻👩💻 Group A: 
 Air-Sea Interaction 👩💻👨💻👩💻 Group B:

    
 Seasonal Forecasting Research Education Industry *Coined by Fernando Perez O p e n O c e a n C l o u d W i l l b e a 
 “ D a t a W a t e r i n g H o l e * ” Data Library Compute Environment
  17. 29 👩💻👨💻👩💻 Group A: 
 Air-Sea Interaction 👩💻👨💻👩💻 Group B:

    
 Seasonal Forecasting Research Education Industry *Coined by Fernando Perez O p e n O c e a n C l o u d W i l l b e a 
 “ D a t a W a t e r i n g H o l e * ” Data Library Compute Environment ✅ Faster science, more discoveries ✅ Inherently reproducible ✅ Allows seamless global collaboration ✅ Unleashes creativity ✅ Cost effective ✅ Accessible to all ✅ Connects with industry
  18. What is Pangeo? Community obsessed with e ff icient data

    processing. 
 Founded in 2017. Scientists and software developers coming together. http://pangeo.io/ 
 Weekly meeting / seminar. Discourse Forum. Annual meeting. Workshops at AGU / AMS / etc. Interoperable Software 
 
 Foundation in Open Source Scienti f ic Python: Jupyter, Xarray, Dask, Zarr. Broad ecosystem of interoperable packages for analysis, visualization, and machine learning. Data and Computing Infrastructure 
 
 Deployment recipes for cloud and HPC. Open, public, cloud-based JupyterHubs and Binders for Data-proximate computing. PB of analysis-ready, cloud-optimized data stored in public cloud (GCS, AWS) and OpenStorageNetwork.
  19. What is Pangeo? Community obsessed with e ff icient data

    processing. 
 Founded in 2017. Scientists and software developers coming together. http://pangeo.io/ 
 Weekly meeting / seminar. Discourse Forum. Annual meeting. Workshops at AGU / AMS / etc. Interoperable Software 
 
 Foundation in Open Source Scienti f ic Python: Jupyter, Xarray, Dask, Zarr. Broad ecosystem of interoperable packages for analysis, visualization, and machine learning. Data and Computing Infrastructure 
 
 Deployment recipes for cloud and HPC. Open, public, cloud-based JupyterHubs and Binders for Data-proximate computing. PB of analysis-ready, cloud-optimized data stored in public cloud (GCS, AWS) and OpenStorageNetwork.
  20. CMIP6 Cloud Dataset • Pangeo partnered with ESGF and Google

    Cloud to provide a new public dataset • > 1 PB and counting • Data stored in Zarr format • Google provides free hosting in GCS • Mirrored on AWS https://pangeo-data.github.io/pangeo-cmip6-cloud/
  21. Analyzing Petabyte scale climate data in your browser with Pangeo

    Custom Analysis applied to each model and member
  22. Di ff erent dimension names in the CMIP data. 


    
 Not quite analysis -ready No! Time to homogenize data!
  23. There is! + Analysis Ready Data in the cloud Crowd-Sourced

    Data Cleaning 
 (peer-to-peer learning)
  24. There is! + Analysis Ready Data in the cloud Crowd-Sourced

    Data Cleaning 
 (peer-to-peer learning) Less data wrangling, more 💡 =
  25. • Think in “Datasets” not “data fi les” • No

    need for tedious homogenizing / cleaning steps • Curated and cataloged A R C O D a t a 56 Analysis Ready, Cloud Optimzed $VGDWDVFLHQFHEHFRPHVPRUHFRPPRQSODFHDQG VLPXOWDQHRXVO\DELWGHP\VWLĆHGZHH[SHFWWKLV WUHQGWRFRQWLQXHDVZHOO$IWHUDOOODVW\HDUèV respondents were just as excited about their ZRUN DERXWZHUHêVDWLVĆHGëRUEHWWHU  How a Data Scientist Spends Their Day +HUHèVZKHUHWKHSRSXODUYLHZRIGDWDVFLHQWLVWVGLYHUJHVSUHWW\VLJQLĆFDQWO\IURPUHDOLW\*HQ ZHWKLQNRIGDWDVFLHQWLVWVEXLOGLQJDOJRULWKPVH[SORULQJGDWDDQGGRLQJSUHGLFWLYHDQDO\VLV7 actually not what they spend most of their time doing, however. $V\RXFDQVHHIURPWKHFKDUWDERYHRXWRIHYHU\GDWDVFLHQWLVWVZHVXUYH\HGDFWXDOO\VSHQ PRVWWLPHFOHDQLQJDQGRUJDQL]LQJGDWD<RXPD\KDYHKHDUGWKLVUHIHUUHGWRDVêGDWDZUDQJOLQ FRPSDUHGWRGLJLWDOMDQLWRUZRUN(YHU\WKLQJIURPOLVWYHULĆFDWLRQWRUHPRYLQJFRPPDVWRGHE databases–that time adds up and it adds up immensely. Messy data is by far the more time- con DVSHFWRIWKHW\SLFDOGDWDVFLHQWLVWèVZRUNćRZ$QGQHDUO\VDLGWKH\VLPSO\VSHQWWRRPXF Data scientist job satisfaction 60% 19% 9% 4% 5% 3% Building training sets: 3% Cleaning and organizing data: 60% Collecting data sets; 19% Mining data for patterns: 9% 5HĆQLQJDOJRULWKPV Other: 5% ,!;&!;!9$-'2ধ9;996'2&;,'139;ধ1'&3 2 1   How do data scientists spend their time? Crowd fl ower Data Science Report (2016) What is “Analysis Ready”?
  26. 57 Chunked appropriately for analysis Rich metadata Everything in one

    dataset object https://catalog.pangeo.io/browse/master/ocean/sea_surface_height/
  27. • Compatible with object storage 
 (access via HTTP) •

    Supports lazy access and intelligent subsetting • Integrates with high-level analysis libraries and distributed frameworks A R C O D a t a 58 Analysis Ready, Cloud Optimzed What is “Cloud Optimized”?
  28. A R C O D a t a i s

    F a s t ! 59 https://doi.org/10.1109/MCSE.2021.3059437
  29. 60 Making ARCO Data is Hard! Domain Expertise: 
 How

    to fi nd, clean, and homogenize data Tech Knowledge: 
 How to ef fi ciently produce cloud-optimized formats Compute Resources: 
 A place where to stage and upload the ARCO data Communication Skills: 
 To explain to others how to use the data To produce useful ARCO data, you must have: Data Scientist 😩
  30. 61 Let’s democratize the production of ARCO data! Domain Expertise:

    
 How to fi nd, clean, and homogenize data 🤓 Data Scientist
  31. 62 Pangeo Forge Recipes Pangeo Forge Cloud Open source python

    package for describing and running data pipelines (“recipes”) Cloud platform for automatically executing recipes stored in GitHub repos. https://github.com/pangeo-forge/pangeo-forge-recipes https://pangeo-forge.org/
  32. A concrete example: Our ARCO CMIP6 cache Balaji, V. et

    al. Requirements for a global data infrastructure in support of CMIP6. Geosci. Model Dev. 11, 3659–3680 (2018). Datasets are directly downloaded from ESGF nodes and minimally processed Files -> Dataset Carry complete metadata, handle_id etc for provenance We regularly f ilter our catalog for retracted datasets WIP: Fully automate work f low with pangeo-forge
  33. A concrete example: Our ARCO CMIP6 cache Quesions? Suggestions? Join

    the Pangeo / ESGF Cloud Data Working Group biweekly calls! https://pangeo-data.github.io/pangeo-cmip6-cloud/
  34. 65 http://pangeo.io https://discourse.pangeo.io/ https://github.com/pangeo-data/ https://medium.com/pangeo @pangeo_data @JuliusBusecke jbusecke [email protected] I

    ❤ F e e d b a c k , q u e s t i o n s , c o n t r i b u t i o n s . CMIP data in the cloud Pangeo Cloud Deployments - https://pangeo.io/cloud.html •
  35. Ocean Science Meeting 2022 | Julius Busecke (@JuliusBusecke) | Laure

    Resplandy | Sam Ditkovsky | Jasmin John Oxygen Minimum Zones in the Tropical Pacific Will they expand or shrink?
  36. Ocean Science Meeting 2022 | Julius Busecke (@JuliusBusecke) | Laure

    Resplandy | Sam Ditkovsky | Jasmin John Oxygen Minimum Zones in the Tropical Pacific Will they expand or shrink?