Pangeo - NASA ESD Workshop

Pa n g e o N A S A E
S D W o R K S H O P N o v . 8 , 2 0 2 0 Thanks to our funders!

T W O Pa p e r S 2 https://doi.org/10.1029/2020AV000354
https://doi.org/10.1109/MCSE.2021.3059437

3 Sample of existing ESD platforms relevant to Analysis and
Information Products, including lessons learned, progress and future plans.

• Community obsessed with efﬁcient data processing. Founded in 2016.
Scientists and software developers coming together. http://pangeo.io/ Weekly meeting / seminar. Discourse Forum. Annual meeting. Workshops at AGU / AMS / etc. • Interoperable Software Foundation in Scientiﬁc Python: Jupyter, Xarray, Dask, Zarr. Broad ecosystem of interoperable packages for analysis, visualization, and machine learning. • Data and Computing Infrastructure Deployment recipes for cloud and HPC. Open, public, cloud-based JupyterHubs and Binders for Data-proximate computing. PB of analysis-ready, cloud-optimized data stored in public cloud (GCS, AWS) and OpenStorageNetwork. W h at i s Pa n g e o ? 4

5 0.0 2.0 1.0 Chunks .zattrs Metadata Dask worker Dask
worker Dask worker Juptyer pod T h e Pa n g e o C l o u d S ta c k Cloud Object Store Cloud Compute Cluster HTTP  GET http://pangeo.io/cloud.html

• We didn’t build very much new stuff; we just
helped existing, community developed tools work together. Open and community- driven from day 1. Sustainability • “Power users” always just want direct, data-proximate access to the raw data. Simplicity • The same stack is an effective base-layer for apps / dashboards / APIs, etc. Modularity L e s s o n 1 : T h e Pa n g e o A p p r o a c h W o r k s 6

A G E N C I E S U S
I N G PA N G E O 7 Recent talks from the Pangeo Showcase Seminar: https://zenodo.org/communities/pangeo/

• Good ARCO data (Zarr, TileDB, Parquet) + S3 obviates
the need for some APIs / services. • Legacy formats (netCDF / HDF5 / GRIB) don’t always play well with object storage. • ARCO data production takes time and skill. L E S S O N 2 : A n a ly s i s - R e a d y, C l o u d - O P t i m i z e d ( A R C o ) D ata i s G r e at 8 ARCO Data

A R C O D ata + H T T
P ( S 3 ) I s M o r e P e r f o r m a n t a n d F l e x i b l e t h a n a B e s p o k e A P I 9 https://xpublish.readthedocs.io/  Serve dynamically generated Zarr data over HTTP. Client can’t tell the diﬀerence.

Pa n g e o F o r g e
: D e m o c r at i z i n g A R G O D ata P r o d u c t i o n 10 https://pangeo-forge.org/ An open source platform for creating ARCO datasets. We crowdsource “recipes” for ARCO data from the global science community. Cloud automation builds the datasets in a scalable and reproducible way.

K e r c h u n k : M
a k e y o u r L e g a c y d ata l o o k a n d F e E L l i k e Z a r r 11 • Provides a unified way to represent a variety of chunked, compressed binary data formats (e.g. NetCDF/HDF5, GRIB2, TIFF, …) • Allows efficient access to data from traditional file systems or cloud object storage. • Create virtual datasets from multiple files by extracting the byte ranges, compression information etc. and storing this metadata in a new, separate object. • Open Spec, python implementation. https://fsspec.github.io/kerchunk/

L e s s o n L e a r
n e d : D ata G r av i t y 12 “Data gravity is the ability of a body of data to attract applications, services and other data." - Dave McCrory NASA (200 PB) NOAA BDP ASDI (incl. CMIP6) NCAR Datasets etc… Planetary Computer NOAA BDP Earth Engine NOAA BDP Descartes Pangeo SentinelHub Climate Change Atmosphere Marine ECMWF DOE XSEDE HECC NCAR

D ata G r av i t y 13 What
is the stable steady-state solution? DOE XSEDE HECC NCAR ?

W e n e e d a g l o
b a l S c i e n t i f i c D ata C o m m o n s 14 Edge storage, decentralized web, web3 DOE XSEDE HECC NCAR ?

• The Pangeo approach has been embraced by both science
“power users” and builders of ESD platforms. • Analysis Ready, Cloud Optimized Data is the foundation of performant and ﬂexible cloud ESD platforms. Stretch goal: could all the platforms from this session share the same base data? • We need a global scientiﬁc commons that lives outside the big cloud providers. Otherwise data gravity will suck all of science int AWS. S u m m a r y 15

Pangeo - NASA ESD Workshop

Pangeo - NASA ESD Workshop

Ryan Abernathey

More Decks by Ryan Abernathey

Other Decks in Science

Featured

Transcript

Pa n g e o N A S A E

T W O Pa p e r S 2 https://doi.org/10.1029/2020AV000354

3 Sample of existing ESD platforms relevant to Analysis and

• Community obsessed with efﬁcient data processing. Founded in 2016.

5 0.0 2.0 1.0 Chunks .zattrs Metadata Dask worker Dask

• We didn’t build very much new stuff; we just

A G E N C I E S U S

• Good ARCO data (Zarr, TileDB, Parquet) + S3 obviates

A R C O D ata + H T T

Pa n g e o F o r g e

K e r c h u n k : M

L e s s o n L e a r

D ata G r av i t y 13 What

W e n e e d a g l o

• The Pangeo approach has been embraced by both science