Dask and Pangeo (A community platform for Big Data Geoscience)

Anderson Banihirwe

January 19, 2022

    Rich Signell (USGS) Anderson Banihirwe (NCAR) Ryan Abernathy (Columbia) Joe Hamman (NCAR) Matthew Rocklin (Anaconda->NVIDIA->Coiled Computing) Niall Robinson (UK Met Office Informatics Lab) Jacob Tomlinson (UK Met Office->NVIDIA) Scott Henderson (UW) and the rest of the Pangeo Community! Dask Developer Workshop - Feb 27, 2020
  3. USGS Sediment Transport Modeling Wind and Waves Water Levels Sea

    Bed Stress Sea Bed Erosion COAWST Modeling System (John Warner, USGS) ~200TB of coastal ocean model output data in 4D (T, Z, Y, X) NetCDF files
  4. Pangeo is a Community

  5. Pangeo Core Architecture DAT A Cloud-friendly ndarray data dask.distributed dask-jobqueue

    dask-mpi dask-kubernetes dask-cloudprovider dask-gateway LocalCluster() SlurmCluster() KubeCluster() FargateCluster() https://medium.com/pangeo
  6. Reading cloud-optimized data with Dask is fast

  7. Holoviz trimesh rendering

  8. Dask trimesh rendering

  9. Dask trimesh rendering

  10. Datashader library support

  11. xarray: N-D labeled arrays time longitude latitude elevation Data variables

    used for computation Coordinates describe data Indexes align data Attributes metadata ignored by operations + land_cover SPARSE
  12. Dask & Xarray: Enabling Geosciences

  13. Dask & Xarray • Xarray objects are dask collections ◦

    Xarray variables can include dask arrays ◦ map_blocks allows xarray objects to be the primary dask collections • High-level metadata-aware interfaces to dask: ◦ xr.apply_ufunc() ◦ xr.map_blocks() • File I/O: Dask allows xarray to support parallel read and write functionality via its open_mfdataset(), to_netcdf(), open_zarr(), to_zarr().
  14. Issues affecting Pangeo • Running out of member when rechunking

    with dask (work around using xr.to_zarr(append=True) Memory Backpressure issue (D.E Shaw’s graph manipulation tools!) • Dask-cloudprovider very attractive to orgs like USGS: FargateCluster “rate exceeded” issue • Community understanding of chunking impact on use • Dask Performance challenges, e.g. pangeo/#194, dask/#3595 ◦ More work on graph optimization, high-level graphs, task-fusion, etc.. • Dask-deployment: More work on enabling heterogeneous worker pools, harmonization among systems, etc...
  15. Pangeo is award winning! https://medium.com/informatics-lab/pangeo-the-award-wining-data-platform-ddf0b55185fa