Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Pangeo: A Scalable, Flexible Framework for Geoprocessing

Cdbcc920e73869b6436479419b3a1841?s=47 Rich Signell
October 16, 2020

Pangeo: A Scalable, Flexible Framework for Geoprocessing

Pangeo talk at UNH CCOM/OE Department Seminar


Rich Signell

October 16, 2020

More Decks by Rich Signell

Other Decks in Science


  1. Pangeo: A community and a framework for flexible, scalable open-source

    geoprocessing Rich Signell Research Oceanographer U.S. Geological Survey Ryan Abernathy (Columbia) Joe Hamman (NCAR) Matthew Rocklin (Anaconda->NVIDIA) Jacob Tomlinson (UK Met Office->NVIDIA) Scott Henderson (UW) and the rest of the Pangeo Community! UNH CCOM/OE Seminar 2020-10-16
  2. U.S. Geological Survey Sediment Transport Modeling ~200TB of coastal ocean

    model output data in 4D (T, Z, Y, X) NetCDF files
  3. Traditional Model Data Analysis © 2019, Amazon Web Services, Inc.

    or its affiliates. All rights reserved.
  4. Model Data Analysis of the Future (available now!) © 2019,

    Amazon Web Services, Inc. or its affiliates. All rights reserved.
  5. From Daniel Arevalo presentation on NAVGEM on the Cloud at

    HPC User Forum 2019
  6. Pangeo is a Community

  7. Pangeo is a Flexible Open-Source Framework

  8. Pangeo Cloud Architecture DATA Cloud-friendly ndarray data Cluster = KubeCluster()

    or GatewayCluster()
  9. Pangeo HPC Architecture DATA ndarray data cluster = SlurmCluster() or

  10. Pangeo Laptop Architecture DATA ndarray data cluster = LocalCluster()

  11. Matthew Rocklin’s blog post on HDF

  12. Zarr format • Developed by Genomics community to address problems

    with NetCDF/HDF on cloud storage • Simple format, clear specification • Each chunk is stored as a separate binary object • Lightweight global and variable metadata stored as JSON • Groups, filters, compression using Blosc • Free, open-source software • Read/write in Python using Xarray
  13. Zarr Format

  14. Zarr Format

  15. Zarr is community-driven

  16. NOAA’s Big Data Project One month of forcing and output

    is 15TB NWM is part of the Big Data Project, with data being pushed to the Cloud: Forecast data: s3:noaa-nwm-pds 25 year reanalysis: s3:nwm-archive $25K research credits from Amazon to explore using Pangeo for National Water Model data
  17. FUSE-mounted NetCDF/HDF is slow

  18. Cloud-friendly Zarr is fast

  19. So do we need to abandon NetCDF/HDF? Well, uh….

  20. None
  21. None
  22. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved.
  23. Pangeo is not just for model data... © 2019, Amazon

    Web Services, Inc. or its affiliates. All rights reserved.
  24. Add a scott slide, perhaps from blog

  25. Pangeo is not just for geoscience data

  26. Pangeo is not just for big data

  27. Pangeo is award winning! https://medium.com/informatics-lab/pangeo-the-award-wining-data-platform-ddf0b55185fa

  28. Pangeo on AWS • Kubernetes cluster deployed with Amazon Elastic

    Container Service for Kubernetes (Amazon EKS) • Three classes of k8s node pools • Core pool: JupyterHub, web proxy (small) • Jupyter pool: autoscaling pool for single-user sessions • Dask pool: autoscaling pool for Dask workers on premptible (e.g., spot) instances • Pangeo installed with Helm chart • Custom Docker environments built with repo2docker at https://github.com/pangeo-data/pangeo-docker-images • Deployed using https://github.com/pangeo-data/pangeo-cloud- federation • Full deploy instructions at pangeo.io
  29. Overcoming barriers to adoption • Concerns about cost: Changing institutional

    computing models, research credits, waving egress charges for research • New skills required: AWS workshops, hackathons, institutional road shows • Data formats and data standardization: benchmarking, blogging
  30. Interested in learning more or trying Pangeo? • Visit pangeo.io,

    try out the demos in the gallery • Read the awesome articles at medium.com/pangeo • Chat with the team on gitter.im/pangeo-data • Install the Pangeo environment on your local computer or HPC • Run a Pangeo JupyterHub on AWS • Rechunk your data with rechunker