Pangeo: A Scalable, Flexible Framework for Geoprocessing

Pangeo: A community and a framework for flexible, scalable open-source
geoprocessing Rich Signell Research Oceanographer U.S. Geological Survey Ryan Abernathy (Columbia) Joe Hamman (NCAR) Matthew Rocklin (Anaconda->NVIDIA) Jacob Tomlinson (UK Met Office->NVIDIA) Scott Henderson (UW) and the rest of the Pangeo Community! UNH CCOM/OE Seminar 2020-10-16

U.S. Geological Survey Sediment Transport Modeling ~200TB of coastal ocean
model output data in 4D (T, Z, Y, X) NetCDF files

From Daniel Arevalo presentation on NAVGEM on the Cloud at
HPC User Forum 2019

Pangeo is a Community

Pangeo is a Flexible Open-Source Framework

Pangeo Cloud Architecture DATA Cloud-friendly ndarray data Cluster = KubeCluster()
or GatewayCluster()

Pangeo HPC Architecture DATA ndarray data cluster = SlurmCluster() or
GatewayCluster()

Pangeo Laptop Architecture DATA ndarray data cluster = LocalCluster()

Matthew Rocklin’s blog post on HDF

Zarr format • Developed by Genomics community to address problems
with NetCDF/HDF on cloud storage • Simple format, clear specification • Each chunk is stored as a separate binary object • Lightweight global and variable metadata stored as JSON • Groups, filters, compression using Blosc • Free, open-source software • Read/write in Python using Xarray

Zarr Format

Zarr is community-driven

NOAA’s Big Data Project One month of forcing and output
is 15TB NWM is part of the Big Data Project, with data being pushed to the Cloud: Forecast data: s3:noaa-nwm-pds 25 year reanalysis: s3:nwm-archive $25K research credits from Amazon to explore using Pangeo for National Water Model data

FUSE-mounted NetCDF/HDF is slow

Cloud-friendly Zarr is fast

So do we need to abandon NetCDF/HDF? Well, uh….

Add a scott slide, perhaps from blog

Pangeo is not just for geoscience data

Pangeo is not just for big data

Pangeo is award winning! https://medium.com/informatics-lab/pangeo-the-award-wining-data-platform-ddf0b55185fa

Pangeo on AWS • Kubernetes cluster deployed with Amazon Elastic
Container Service for Kubernetes (Amazon EKS) • Three classes of k8s node pools • Core pool: JupyterHub, web proxy (small) • Jupyter pool: autoscaling pool for single-user sessions • Dask pool: autoscaling pool for Dask workers on premptible (e.g., spot) instances • Pangeo installed with Helm chart • Custom Docker environments built with repo2docker at https://github.com/pangeo-data/pangeo-docker-images • Deployed using https://github.com/pangeo-data/pangeo-cloud- federation • Full deploy instructions at pangeo.io

Overcoming barriers to adoption • Concerns about cost: Changing institutional
computing models, research credits, waving egress charges for research • New skills required: AWS workshops, hackathons, institutional road shows • Data formats and data standardization: benchmarking, blogging

Interested in learning more or trying Pangeo? • Visit pangeo.io,
try out the demos in the gallery • Read the awesome articles at medium.com/pangeo • Chat with the team on gitter.im/pangeo-data • Install the Pangeo environment on your local computer or HPC • Run a Pangeo JupyterHub on AWS • Rechunk your data with rechunker

Pangeo: A Scalable, Flexible Framework for Geop...

Pangeo: A Scalable, Flexible Framework for Geoprocessing

Rich Signell

More Decks by Rich Signell

Other Decks in Science

Featured

Transcript

Pangeo: A community and a framework for flexible, scalable open-source

U.S. Geological Survey Sediment Transport Modeling ~200TB of coastal ocean

Traditional Model Data Analysis © 2019, Amazon Web Services, Inc.

Model Data Analysis of the Future (available now!) © 2019,

From Daniel Arevalo presentation on NAVGEM on the Cloud at

Pangeo is a Community

Pangeo is a Flexible Open-Source Framework

Pangeo Cloud Architecture DATA Cloud-friendly ndarray data Cluster = KubeCluster()

Pangeo HPC Architecture DATA ndarray data cluster = SlurmCluster() or

Pangeo Laptop Architecture DATA ndarray data cluster = LocalCluster()

Matthew Rocklin’s blog post on HDF

Zarr format • Developed by Genomics community to address problems

Zarr Format

Zarr Format

Zarr is community-driven

NOAA’s Big Data Project One month of forcing and output

FUSE-mounted NetCDF/HDF is slow

Cloud-friendly Zarr is fast

So do we need to abandon NetCDF/HDF? Well, uh….

© 2019, Amazon Web Services, Inc. or its affiliates. All

Pangeo is not just for model data... © 2019, Amazon

Add a scott slide, perhaps from blog

Pangeo is not just for geoscience data

Pangeo is not just for big data

Pangeo is award winning! https://medium.com/informatics-lab/pangeo-the-award-wining-data-platform-ddf0b55185fa

Pangeo on AWS • Kubernetes cluster deployed with Amazon Elastic

Overcoming barriers to adoption • Concerns about cost: Changing institutional

Interested in learning more or trying Pangeo? • Visit pangeo.io,