Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Pangeo: A Scalable, Flexible Framework for Geoprocessing

Rich Signell
October 16, 2020

Pangeo: A Scalable, Flexible Framework for Geoprocessing

Pangeo talk at UNH CCOM/OE Department Seminar

Rich Signell

October 16, 2020
Tweet

More Decks by Rich Signell

Other Decks in Science

Transcript

  1. Pangeo: A community and a framework for flexible, scalable open-source

    geoprocessing Rich Signell Research Oceanographer U.S. Geological Survey Ryan Abernathy (Columbia) Joe Hamman (NCAR) Matthew Rocklin (Anaconda->NVIDIA) Jacob Tomlinson (UK Met Office->NVIDIA) Scott Henderson (UW) and the rest of the Pangeo Community! UNH CCOM/OE Seminar 2020-10-16
  2. U.S. Geological Survey Sediment Transport Modeling ~200TB of coastal ocean

    model output data in 4D (T, Z, Y, X) NetCDF files
  3. Model Data Analysis of the Future (available now!) © 2019,

    Amazon Web Services, Inc. or its affiliates. All rights reserved.
  4. Zarr format • Developed by Genomics community to address problems

    with NetCDF/HDF on cloud storage • Simple format, clear specification • Each chunk is stored as a separate binary object • Lightweight global and variable metadata stored as JSON • Groups, filters, compression using Blosc • Free, open-source software • Read/write in Python using Xarray
  5. NOAA’s Big Data Project One month of forcing and output

    is 15TB NWM is part of the Big Data Project, with data being pushed to the Cloud: Forecast data: s3:noaa-nwm-pds 25 year reanalysis: s3:nwm-archive $25K research credits from Amazon to explore using Pangeo for National Water Model data
  6. Pangeo is not just for model data... © 2019, Amazon

    Web Services, Inc. or its affiliates. All rights reserved.
  7. Pangeo on AWS • Kubernetes cluster deployed with Amazon Elastic

    Container Service for Kubernetes (Amazon EKS) • Three classes of k8s node pools • Core pool: JupyterHub, web proxy (small) • Jupyter pool: autoscaling pool for single-user sessions • Dask pool: autoscaling pool for Dask workers on premptible (e.g., spot) instances • Pangeo installed with Helm chart • Custom Docker environments built with repo2docker at https://github.com/pangeo-data/pangeo-docker-images • Deployed using https://github.com/pangeo-data/pangeo-cloud- federation • Full deploy instructions at pangeo.io
  8. Overcoming barriers to adoption • Concerns about cost: Changing institutional

    computing models, research credits, waving egress charges for research • New skills required: AWS workshops, hackathons, institutional road shows • Data formats and data standardization: benchmarking, blogging
  9. Interested in learning more or trying Pangeo? • Visit pangeo.io,

    try out the demos in the gallery • Read the awesome articles at medium.com/pangeo • Chat with the team on gitter.im/pangeo-data • Install the Pangeo environment on your local computer or HPC • Run a Pangeo JupyterHub on AWS • Rechunk your data with rechunker