Slide 1

Slide 1 text

Pangeo: A community and a framework for flexible, scalable open-source geoprocessing Rich Signell Research Oceanographer U.S. Geological Survey Ryan Abernathy (Columbia) Joe Hamman (NCAR) Matthew Rocklin (Anaconda->NVIDIA) Jacob Tomlinson (UK Met Office->NVIDIA) Scott Henderson (UW) and the rest of the Pangeo Community! UNH CCOM/OE Seminar 2020-10-16

Slide 2

Slide 2 text

U.S. Geological Survey Sediment Transport Modeling ~200TB of coastal ocean model output data in 4D (T, Z, Y, X) NetCDF files

Slide 3

Slide 3 text

Traditional Model Data Analysis © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 4

Slide 4 text

Model Data Analysis of the Future (available now!) © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 5

Slide 5 text

From Daniel Arevalo presentation on NAVGEM on the Cloud at HPC User Forum 2019

Slide 6

Slide 6 text

Pangeo is a Community

Slide 7

Slide 7 text

Pangeo is a Flexible Open-Source Framework

Slide 8

Slide 8 text

Pangeo Cloud Architecture DATA Cloud-friendly ndarray data Cluster = KubeCluster() or GatewayCluster()

Slide 9

Slide 9 text

Pangeo HPC Architecture DATA ndarray data cluster = SlurmCluster() or GatewayCluster()

Slide 10

Slide 10 text

Pangeo Laptop Architecture DATA ndarray data cluster = LocalCluster()

Slide 11

Slide 11 text

Matthew Rocklin’s blog post on HDF

Slide 12

Slide 12 text

Zarr format • Developed by Genomics community to address problems with NetCDF/HDF on cloud storage • Simple format, clear specification • Each chunk is stored as a separate binary object • Lightweight global and variable metadata stored as JSON • Groups, filters, compression using Blosc • Free, open-source software • Read/write in Python using Xarray

Slide 13

Slide 13 text

Zarr Format

Slide 14

Slide 14 text

Zarr Format

Slide 15

Slide 15 text

Zarr is community-driven

Slide 16

Slide 16 text

NOAA’s Big Data Project One month of forcing and output is 15TB NWM is part of the Big Data Project, with data being pushed to the Cloud: Forecast data: s3:noaa-nwm-pds 25 year reanalysis: s3:nwm-archive $25K research credits from Amazon to explore using Pangeo for National Water Model data

Slide 17

Slide 17 text

FUSE-mounted NetCDF/HDF is slow

Slide 18

Slide 18 text

Cloud-friendly Zarr is fast

Slide 19

Slide 19 text

So do we need to abandon NetCDF/HDF? Well, uh….

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 23

Slide 23 text

Pangeo is not just for model data... © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 24

Slide 24 text

Add a scott slide, perhaps from blog

Slide 25

Slide 25 text

Pangeo is not just for geoscience data

Slide 26

Slide 26 text

Pangeo is not just for big data

Slide 27

Slide 27 text

Pangeo is award winning! https://medium.com/informatics-lab/pangeo-the-award-wining-data-platform-ddf0b55185fa

Slide 28

Slide 28 text

Pangeo on AWS • Kubernetes cluster deployed with Amazon Elastic Container Service for Kubernetes (Amazon EKS) • Three classes of k8s node pools • Core pool: JupyterHub, web proxy (small) • Jupyter pool: autoscaling pool for single-user sessions • Dask pool: autoscaling pool for Dask workers on premptible (e.g., spot) instances • Pangeo installed with Helm chart • Custom Docker environments built with repo2docker at https://github.com/pangeo-data/pangeo-docker-images • Deployed using https://github.com/pangeo-data/pangeo-cloud- federation • Full deploy instructions at pangeo.io

Slide 29

Slide 29 text

Overcoming barriers to adoption • Concerns about cost: Changing institutional computing models, research credits, waving egress charges for research • New skills required: AWS workshops, hackathons, institutional road shows • Data formats and data standardization: benchmarking, blogging

Slide 30

Slide 30 text

Interested in learning more or trying Pangeo? • Visit pangeo.io, try out the demos in the gallery • Read the awesome articles at medium.com/pangeo • Chat with the team on gitter.im/pangeo-data • Install the Pangeo environment on your local computer or HPC • Run a Pangeo JupyterHub on AWS • Rechunk your data with rechunker