Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Pangeo OGC March 2022

Pangeo OGC March 2022

Presentation given at the March 2022 OGC meeting: https://portal.ogc.org/meet/?p=default&mid=88

Ryan Abernathey

March 09, 2022
Tweet

More Decks by Ryan Abernathey

Other Decks in Science

Transcript

  1. Pa n g e o A C o m m

    u n i t y P l a t f o r m f o r 
 B i g D a t a G e o s c i e n c e OGC Member Meeting 
 March 2, 2022 http://pangeo.io https://discourse.pangeo.io/ https://github.com/pangeo-data/ https://medium.com/pangeo @pangeo_data
  2. W h o A m I ? 2 Physical Oceanographer

    
 
 Ph.D. From MIT, 2012 Associate Prof. at Columbia / LDEO https://ocean-transport.github.io/ Core developer of Xarray Core developer of Zarr Co-founder of Pangeo Open Source Advocate
  3. 3 Pangeo represents a user-centric approach to climate / ocean

    / weather data analytics. 
 
 We are not data providers. We are data users. 
 
 Pangeo is the user-centric, open-source platform that was missing in 2017 when we started this work.
  4. 8 0 / 2 0 R u l e o

    f D ata S c i e n c e 4 ZRUN DERXWZHUHêVDWLVĆHGëRUEHWWHU  How a Data Scientist Spends Their Day +HUHèVZKHUHWKHSRSXODUYLHZRIGDWDVFLHQWLVWVGLYHUJHVSUHWW\VLJQLĆFDQWO\IURPUHDOLW\*HQHUDOO\ ZHWKLQNRIGDWDVFLHQWLVWVEXLOGLQJDOJRULWKPVH[SORULQJGDWDDQGGRLQJSUHGLFWLYHDQDO\VLV7KDWèV actually not what they spend most of their time doing, however. $V\RXFDQVHHIURPWKHFKDUWDERYHRXWRIHYHU\GDWDVFLHQWLVWVZHVXUYH\HGDFWXDOO\VSHQGWKH Data scientist job satisfaction 60% 19% 9% 4% 5% 3% Building training sets: 3% Cleaning and organizing data: 60% Collecting data sets; 19% Mining data for patterns: 9% 5HĆQLQJDOJRULWKPV Other: 5% ,!;&!;!9$-'2ধ9;996'2&;,'139;ধ1'&3-2+ How do data scientists spend their time? Crowd fl ower Data Science Report (2016)
  5. T h e “ D o w n l o

    a d ” M o d e l 6 a) file-based approach step 1 : dow nload step 2: clean / organize ` file file file b) database / api approach record file file file local disk query files step 3: analyze
  6. 7 MB 😀 T h e “ D o w

    n l o a d ” M o d e l a) file-based approach step 1 : dow nload step 2: clean / organize ` file file file b) database / api approach record file file file local disk query files step 3: analyze
  7. 8 GB 😐 T h e “ D o w

    n l o a d ” M o d e l a) file-based approach step 1 : dow nload step 2: clean / organize ` file file file b) database / api approach record file file file local disk query files step 3: analyze
  8. 9 TB 😖 T h e “ D o w

    n l o a d ” M o d e l a) file-based approach step 1 : dow nload step 2: clean / organize ` file file file b) database / api approach record file file file local disk query files step 3: analyze
  9. 10 PB 😱 T h e “ D o w

    n l o a d ” M o d e l a) file-based approach step 1 : dow nload step 2: clean / organize ` file file file b) database / api approach record file file file local disk query files step 3: analyze
  10. N e v e r M i n d …

    11 H o w ? Let’s “bring the compute to the data”!
  11. • Scientists’ creativity often exceeds pre-baked capabilities. 
 Desire to

    go under the hood • What if you want to access data that isn’t included? 
 Data catalog is determined by provider, not users • Platforms are “single instance”: 
 Fear of lock-in, possibility platform will disappear • Who pays? T h e T r o u b l e w i t h “ P l at f o r m s ” 13
  12. O P E N C l o u d A

    r c h i t e c t u r e 14 Data Provider’s $ Data Consumer’s $
  13. O P E N C l o u d A

    r c h i t e c t u r e 15 Interactive Computing Data Provider’s $ Data Consumer’s $
  14. O P E N C l o u d A

    r c h i t e c t u r e 16 Data Provider’s $ Data Consumer’s $ Interactive Computing Parallel Computing
  15. O P E N C l o u d A

    r c h i t e c t u r e 17 Data Provider’s $ Data Consumer’s $ Interactive Computing Parallel Computing Analysis Ready Data
 Cloud Optimized Formats
  16. O P E N C l o u d A

    r c h i t e c t u r e 18 Data Provider’s $ Data Consumer’s $ Interactive Computing Parallel Computing Analysis Ready Data
 Cloud Optimized Formats
  17. • Community obsessed with ef fi cient data processing. 


    
 Founded in 2017. Scientists and software developers coming together. http://pangeo.io/ 
 Weekly meeting / seminar. Discourse Forum. Annual meeting. Workshops at AGU / AMS / etc. • Interoperable Software 
 
 Foundation in Open Source Scienti fi c Python: Jupyter, Xarray, Dask, Zarr. Broad ecosystem of interoperable packages for analysis, visualization, and machine learning. • Data and Computing Infrastructure 
 
 Deployment recipes for cloud and HPC. Open, public, cloud-based JupyterHubs and Binders for Data-proximate computing. PB of analysis-ready, cloud-optimized data stored in public cloud (GCS, AWS) and OpenStorageNetwork. W h at i s Pa n g e o ? 19
  18. Scientific users / use cases Open-source software libraries HPC and

    cloud infrastructure • Define science questions • Use software / infrastructure • Identify bugs / bottlenecks • Provide feedback to developers • Contribute widely the the open source scientific python ecosystem • Maintain / extend existing libraries, start new ones reluctantly • Solve integration challenges • Deploy interactive analysis environments • Curate analysis-ready datasets • Platform agnostic Agile development 👩💻 T h e Pa n g e o C o m m u n i t y P r o c e s s 20
  19. B r i n g i n g t o

    g e t h e r A c a d e m i a , G o v. A g e n c i e s & I n d u s t r y 21
  20. T h e Pa n g e o O p

    e n - S o u r c e C l o u d S ta c k 22 Cloud-optimized storage for multidimensional arrays. Flexible, general-purpose parallel computing framework. High-level API for analysis of multidimensional labelled arrays. Kubernetes Object Storage Rich interactive computing environment in the web browser. xgcm xr f xhistogram gcm-filters climpred Cloud Services Domain specific packages Etc.
  21. Pa n g e o C l o u d

    I n f r a s t r u c t u r e 23 Compute Services Data Lakes Dask Gateway .zarray .zattrs 0.0 0.1 2.0 1.0 1.1 2.1 .zarray .zattrs 0.0 0.1 2.0 1.0 1.1 2.1 .zarray .zattrs 0.0 0.1 2.0 1.0 1.1 2.1 .zarray .zattrs 0.0 0.1 2.0 1.0 1.1 2.1 Zarr Datasets Node Pools (Autoscaling) preemptible 
 (spot instance) normal http://catalog.pangeo.io
  22. • Pangeo partnered with Google Cloud to provide a new

    public dataset • Data stored in Zarr format • Google provides free hosting in GCS • LDEO does the work of transferring the data from ESGF to GCS • Mirrored on AWS C M I P 6 C l o u d D ata s e t 24
  23. • We didn’t build very much new stuff; we just

    helped existing, community developed tools work together. Open and community-driven from day 1. Sustainability • “Power users” always just want direct, data-proximate access to the raw data. Simplicity • The same stack is an effective base-layer for apps / dashboards / APIs, etc. Modularity M e s s a g e 1 : 
 T h e Pa n g e o A p p r o a c h W o r k s ! 25
  24. A G E N C I E S U S

    I N G PA N G E O 26 Recent talks from the Pangeo Showcase Seminar: https://zenodo.org/communities/pangeo/
  25. • Good ARCO data (COG, Zarr, TileDB, Parquet) + S3

    obviates the need for some APIs / services. • Legacy formats (netCDF / HDF5 / GRIB) don’t always play well with object storage. • ARCO data production takes time and skill. M e s s a g e 2 : A n a ly s i s - R e a d y, C l o u d - O P t i m i z e d ( A R C o ) D ata i s G r e at 27 ARCO Data
  26. F u t u r e C h a l

    l e n g e : D ata G r av i t y 28 “Data gravity is the ability of a body of data to attract applications, services and other data." - Dave McCrory 
 
 
 
 
 NASA (200 PB) NOAA BDP ASDI (incl. CMIP6) 
 NCAR Datasets 
 etc… 
 
 Planetary Computer 
 NOAA BDP 
 
 
 
 Earth Engine 
 NOAA BDP Descartes 
 Pangeo 
 
 
 SentinelHub Climate Change Atmosphere Marine ECMWF DOE XSEDE HECC NCAR
  27. D ata G r av i t y 29 What

    is the stable steady-state solution? DOE XSEDE HECC NCAR ?
  28. W e n e e d a 
 g l

    o b a l S c i e n t i f i c D ata C o m m o n s 30 Edge storage, decentralized web, web3 DOE XSEDE HECC NCAR ?
  29. • The Pangeo approach (open source, modular, collaborative) has been

    embraced by both science “power users” and builders of Earth-System analytics platforms. • Standards are extremely important for modular, open-source analytics platforms! • Analysis Ready, Cloud Optimized Data in object storage is the foundation of performant and fl exible cloud Earth-System analytics platforms. • We need a global scienti fi c commons that lives outside the big cloud providers. Otherwise data gravity will suck all of science into AWS. S u m m a r y 31 Thanks to our funders!
  30. L e a r n M o r e 32

    http://pangeo.io https://discourse.pangeo.io/ https://github.com/pangeo-data/ https://medium.com/pangeo @pangeo_data
  31. A R C O D ata + H T T

    P ( S 3 ) I s M o r e P e r f o r m a n t a n d F l e x i b l e t h a n a B e s p o k e A P I 34 https://xpublish.readthedocs.io/
 Serve dynamically generated Zarr data over HTTP. Client can’t tell the di ff erence.
  32. Pa n g e o F o r g e

    : D e m o c r at i z i n g A R G O D ata P r o d u c t i o n 35 https://pangeo-forge.org/ An open source platform for creating ARCO datasets. We crowdsource “recipes” for ARCO data from the global science community. Cloud automation builds the datasets in a scalable and reproducible way.
  33. K e r c h u n k : M

    a k e y o u r L e g a c y d ata l o o k a n d F e E L l i k e Z a r r 36 • Provides a uni fi ed way to represent a variety of chunked, compressed binary data formats 
 (e.g. NetCDF/HDF5, GRIB2, TIFF, …) • Allows ef fi cient access to data from traditional fi le systems or cloud object storage. • Create virtual datasets from multiple fi les by extracting the byte ranges, compression information etc. and storing this metadata in a new, separate object. • Open Spec, python implementation. https://fsspec.github.io/kerchunk/
  34. T W O Pa p e r S 38 https://doi.org/10.1029/2020AV000354

    https://doi.org/10.1109/MCSE.2021.3059437