$30 off During Our Annual Pro Sale. View Details »

Pangeo - NASA ESD Workshop

Pangeo - NASA ESD Workshop

Talk given at NASA Earth Systems Division Analysis and Information Products Working Group

Ryan Abernathey

November 16, 2021
Tweet

More Decks by Ryan Abernathey

Other Decks in Science

Transcript

  1. Pa n g e o
    N A S A E S D W o R K S H O P
    N o v . 8 , 2 0 2 0
    Thanks to our funders!

    View Slide

  2. T W O Pa p e r S
    2
    https://doi.org/10.1029/2020AV000354
    https://doi.org/10.1109/MCSE.2021.3059437

    View Slide

  3. 3
    Sample of existing ESD platforms
    relevant to Analysis and Information
    Products, including lessons learned,
    progress and future plans.

    View Slide

  4. • Community obsessed with efficient data processing.
    Founded in 2016. Scientists and software developers coming together. http://pangeo.io/
    Weekly meeting / seminar. Discourse Forum. Annual meeting. Workshops at AGU / AMS / etc.
    • Interoperable Software
    Foundation in Scientific Python: Jupyter, Xarray, Dask, Zarr. Broad ecosystem of interoperable
    packages for analysis, visualization, and machine learning.
    • Data and Computing Infrastructure
    Deployment recipes for cloud and HPC. Open, public, cloud-based JupyterHubs and Binders
    for Data-proximate computing. PB of analysis-ready, cloud-optimized data stored in public
    cloud (GCS, AWS) and OpenStorageNetwork.
    W h at i s Pa n g e o ?
    4

    View Slide

  5. 5
    0.0
    2.0
    1.0
    Chunks
    .zattrs
    Metadata
    Dask worker
    Dask worker
    Dask worker
    Juptyer pod
    T h e Pa n g e o C l o u d S ta c k
    Cloud Object Store Cloud Compute Cluster
    HTTP

    GET
    http://pangeo.io/cloud.html

    View Slide

  6. • We didn’t build very much new
    stuff; we just helped existing,
    community developed tools work
    together. Open and community-
    driven from day 1. Sustainability
    • “Power users” always just want
    direct, data-proximate access to the
    raw data. Simplicity
    • The same stack is an effective
    base-layer for apps / dashboards /
    APIs, etc. Modularity
    L e s s o n 1 :
    T h e Pa n g e o A p p r o a c h W o r k s
    6

    View Slide

  7. A G E N C I E S U S I N G PA N G E O
    7
    Recent talks from the Pangeo Showcase Seminar: https://zenodo.org/communities/pangeo/

    View Slide

  8. • Good ARCO data (Zarr, TileDB,
    Parquet) + S3 obviates the
    need for some APIs / services.
    • Legacy formats (netCDF /
    HDF5 / GRIB) don’t always play
    well with object storage.
    • ARCO data production takes
    time and skill.
    L E S S O N 2 : A n a ly s i s - R e a d y, C l o u d -
    O P t i m i z e d ( A R C o ) D ata i s G r e at
    8
    ARCO Data

    View Slide

  9. A R C O D ata + H T T P ( S 3 ) I s M o r e P e r f o r m a n t
    a n d F l e x i b l e t h a n a B e s p o k e A P I
    9
    https://xpublish.readthedocs.io/

    Serve dynamically generated Zarr data over HTTP.
    Client can’t tell the difference.

    View Slide

  10. Pa n g e o F o r g e : D e m o c r at i z i n g
    A R G O D ata P r o d u c t i o n
    10
    https://pangeo-forge.org/
    An open source platform for creating ARCO datasets.
    We crowdsource “recipes” for ARCO data from the
    global science community. Cloud automation builds
    the datasets in a scalable and reproducible way.

    View Slide

  11. K e r c h u n k : M a k e y o u r L e g a c y
    d ata l o o k a n d F e E L l i k e Z a r r
    11
    • Provides a unified way to represent a variety of
    chunked, compressed binary data formats
    (e.g. NetCDF/HDF5, GRIB2, TIFF, …)
    • Allows efficient access to data from traditional file
    systems or cloud object storage.
    • Create virtual datasets from multiple files by
    extracting the byte ranges, compression information
    etc. and storing this metadata in a new, separate
    object.
    • Open Spec, python implementation.
    https://fsspec.github.io/kerchunk/

    View Slide

  12. L e s s o n L e a r n e d : D ata G r av i t y
    12
    “Data gravity is the ability of a body of data to attract applications, services and other data." - Dave McCrory
    NASA (200 PB)
    NOAA BDP
    ASDI (incl. CMIP6)
    NCAR Datasets
    etc…
    Planetary Computer
    NOAA BDP
    Earth Engine
    NOAA BDP
    Descartes
    Pangeo
    SentinelHub
    Climate Change
    Atmosphere
    Marine
    ECMWF
    DOE
    XSEDE
    HECC
    NCAR

    View Slide

  13. D ata G r av i t y
    13
    What is the stable steady-state solution?
    DOE
    XSEDE
    HECC
    NCAR
    ?

    View Slide

  14. W e n e e d a
    g l o b a l S c i e n t i f i c D ata C o m m o n s
    14
    Edge storage, decentralized web, web3
    DOE
    XSEDE
    HECC
    NCAR
    ?

    View Slide

  15. • The Pangeo approach has been embraced by both science “power
    users” and builders of ESD platforms.
    • Analysis Ready, Cloud Optimized Data is the foundation of performant
    and flexible cloud ESD platforms.
    Stretch goal: could all the platforms from this session share the same
    base data?
    • We need a global scientific commons that lives outside the big cloud
    providers. Otherwise data gravity will suck all of science int AWS.
    S u m m a r y
    15

    View Slide