Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Pangeo - NASA ESD Workshop

Ryan Abernathey
November 16, 2021

Pangeo - NASA ESD Workshop

Talk given at NASA Earth Systems Division Analysis and Information Products Working Group

Ryan Abernathey

November 16, 2021
Tweet

More Decks by Ryan Abernathey

Other Decks in Science

Transcript

  1. Pa n g e o N A S A E

    S D W o R K S H O P N o v . 8 , 2 0 2 0 Thanks to our funders!
  2. T W O Pa p e r S 2 https://doi.org/10.1029/2020AV000354

    https://doi.org/10.1109/MCSE.2021.3059437
  3. 3 Sample of existing ESD platforms relevant to Analysis and

    Information Products, including lessons learned, progress and future plans.
  4. • Community obsessed with efficient data processing. Founded in 2016.

    Scientists and software developers coming together. http://pangeo.io/ Weekly meeting / seminar. Discourse Forum. Annual meeting. Workshops at AGU / AMS / etc. • Interoperable Software Foundation in Scientific Python: Jupyter, Xarray, Dask, Zarr. Broad ecosystem of interoperable packages for analysis, visualization, and machine learning. • Data and Computing Infrastructure Deployment recipes for cloud and HPC. Open, public, cloud-based JupyterHubs and Binders for Data-proximate computing. PB of analysis-ready, cloud-optimized data stored in public cloud (GCS, AWS) and OpenStorageNetwork. W h at i s Pa n g e o ? 4
  5. 5 0.0 2.0 1.0 Chunks .zattrs Metadata Dask worker Dask

    worker Dask worker Juptyer pod T h e Pa n g e o C l o u d S ta c k Cloud Object Store Cloud Compute Cluster HTTP
 GET http://pangeo.io/cloud.html
  6. • We didn’t build very much new stuff; we just

    helped existing, community developed tools work together. Open and community- driven from day 1. Sustainability • “Power users” always just want direct, data-proximate access to the raw data. Simplicity • The same stack is an effective base-layer for apps / dashboards / APIs, etc. Modularity L e s s o n 1 : T h e Pa n g e o A p p r o a c h W o r k s 6
  7. A G E N C I E S U S

    I N G PA N G E O 7 Recent talks from the Pangeo Showcase Seminar: https://zenodo.org/communities/pangeo/
  8. • Good ARCO data (Zarr, TileDB, Parquet) + S3 obviates

    the need for some APIs / services. • Legacy formats (netCDF / HDF5 / GRIB) don’t always play well with object storage. • ARCO data production takes time and skill. L E S S O N 2 : A n a ly s i s - R e a d y, C l o u d - O P t i m i z e d ( A R C o ) D ata i s G r e at 8 ARCO Data
  9. A R C O D ata + H T T

    P ( S 3 ) I s M o r e P e r f o r m a n t a n d F l e x i b l e t h a n a B e s p o k e A P I 9 https://xpublish.readthedocs.io/
 Serve dynamically generated Zarr data over HTTP. Client can’t tell the difference.
  10. Pa n g e o F o r g e

    : D e m o c r at i z i n g A R G O D ata P r o d u c t i o n 10 https://pangeo-forge.org/ An open source platform for creating ARCO datasets. We crowdsource “recipes” for ARCO data from the global science community. Cloud automation builds the datasets in a scalable and reproducible way.
  11. K e r c h u n k : M

    a k e y o u r L e g a c y d ata l o o k a n d F e E L l i k e Z a r r 11 • Provides a unified way to represent a variety of chunked, compressed binary data formats (e.g. NetCDF/HDF5, GRIB2, TIFF, …) • Allows efficient access to data from traditional file systems or cloud object storage. • Create virtual datasets from multiple files by extracting the byte ranges, compression information etc. and storing this metadata in a new, separate object. • Open Spec, python implementation. https://fsspec.github.io/kerchunk/
  12. L e s s o n L e a r

    n e d : D ata G r av i t y 12 “Data gravity is the ability of a body of data to attract applications, services and other data." - Dave McCrory NASA (200 PB) NOAA BDP ASDI (incl. CMIP6) NCAR Datasets etc… Planetary Computer NOAA BDP Earth Engine NOAA BDP Descartes Pangeo SentinelHub Climate Change Atmosphere Marine ECMWF DOE XSEDE HECC NCAR
  13. D ata G r av i t y 13 What

    is the stable steady-state solution? DOE XSEDE HECC NCAR ?
  14. W e n e e d a g l o

    b a l S c i e n t i f i c D ata C o m m o n s 14 Edge storage, decentralized web, web3 DOE XSEDE HECC NCAR ?
  15. • The Pangeo approach has been embraced by both science

    “power users” and builders of ESD platforms. • Analysis Ready, Cloud Optimized Data is the foundation of performant and flexible cloud ESD platforms. Stretch goal: could all the platforms from this session share the same base data? • We need a global scientific commons that lives outside the big cloud providers. Otherwise data gravity will suck all of science int AWS. S u m m a r y 15