Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Dask and Pangeo (A community platform for Big Data Geoscience)

Dask and Pangeo (A community platform for Big Data Geoscience)

Anderson Banihirwe

January 19, 2022

More Decks by Anderson Banihirwe

Other Decks in Science


  1. Dask and Pangeo (A community platform for Big Data Geoscience)

    Rich Signell (USGS) Anderson Banihirwe (NCAR) Ryan Abernathy (Columbia) Joe Hamman (NCAR) Matthew Rocklin (Anaconda->NVIDIA->Coiled Computing) Niall Robinson (UK Met Office Informatics Lab) Jacob Tomlinson (UK Met Office->NVIDIA) Scott Henderson (UW) and the rest of the Pangeo Community! Dask Developer Workshop - Feb 27, 2020
  2. USGS Sediment Transport Modeling Wind and Waves Water Levels Sea

    Bed Stress Sea Bed Erosion COAWST Modeling System (John Warner, USGS) ~200TB of coastal ocean model output data in 4D (T, Z, Y, X) NetCDF files
  3. Pangeo Core Architecture DAT A Cloud-friendly ndarray data dask.distributed dask-jobqueue

    dask-mpi dask-kubernetes dask-cloudprovider dask-gateway LocalCluster() SlurmCluster() KubeCluster() FargateCluster() https://medium.com/pangeo
  4. xarray: N-D labeled arrays time longitude latitude elevation Data variables

    used for computation Coordinates describe data Indexes align data Attributes metadata ignored by operations + land_cover SPARSE
  5. Dask & Xarray • Xarray objects are dask collections ◦

    Xarray variables can include dask arrays ◦ map_blocks allows xarray objects to be the primary dask collections • High-level metadata-aware interfaces to dask: ◦ xr.apply_ufunc() ◦ xr.map_blocks() • File I/O: Dask allows xarray to support parallel read and write functionality via its open_mfdataset(), to_netcdf(), open_zarr(), to_zarr().
  6. Issues affecting Pangeo • Running out of member when rechunking

    with dask (work around using xr.to_zarr(append=True) Memory Backpressure issue (D.E Shaw’s graph manipulation tools!) • Dask-cloudprovider very attractive to orgs like USGS: FargateCluster “rate exceeded” issue • Community understanding of chunking impact on use • Dask Performance challenges, e.g. pangeo/#194, dask/#3595 ◦ More work on graph optimization, high-level graphs, task-fusion, etc.. • Dask-deployment: More work on enabling heterogeneous worker pools, harmonization among systems, etc...