Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Dask and Pangeo (A community platform for Big Data Geoscience)

Dask and Pangeo (A community platform for Big Data Geoscience)

Anderson Banihirwe

January 19, 2022
Tweet

More Decks by Anderson Banihirwe

Other Decks in Science

Transcript

  1. Dask and Pangeo (A community
    platform for Big Data
    Geoscience)
    Rich Signell (USGS)
    Anderson Banihirwe (NCAR)
    Ryan Abernathy (Columbia)
    Joe Hamman (NCAR)
    Matthew Rocklin (Anaconda->NVIDIA->Coiled Computing)
    Niall Robinson (UK Met Office Informatics Lab)
    Jacob Tomlinson (UK Met Office->NVIDIA)
    Scott Henderson (UW)
    and the rest of the Pangeo Community!
    Dask Developer Workshop - Feb 27, 2020

    View Slide

  2. View Slide

  3. USGS Sediment Transport Modeling
    Wind and Waves
    Water Levels
    Sea Bed Stress
    Sea Bed Erosion
    COAWST Modeling System (John Warner, USGS)
    ~200TB of coastal
    ocean model output
    data in 4D (T, Z, Y, X)
    NetCDF files

    View Slide

  4. Pangeo is a Community

    View Slide

  5. Pangeo Core Architecture
    DAT
    A
    Cloud-friendly ndarray data
    dask.distributed dask-jobqueue dask-mpi dask-kubernetes dask-cloudprovider dask-gateway
    LocalCluster() SlurmCluster() KubeCluster() FargateCluster()
    https://medium.com/pangeo

    View Slide

  6. Reading cloud-optimized data with Dask is fast

    View Slide

  7. Holoviz trimesh rendering

    View Slide

  8. Dask trimesh rendering

    View Slide

  9. Dask trimesh rendering

    View Slide

  10. Datashader library support

    View Slide

  11. xarray: N-D labeled arrays
    time
    longitude
    latitude
    elevation
    Data variables
    used for computation
    Coordinates
    describe data
    Indexes
    align data
    Attributes
    metadata ignored
    by operations
    +
    land_cover
    SPARSE

    View Slide

  12. Dask & Xarray: Enabling Geosciences

    View Slide

  13. Dask & Xarray
    ● Xarray objects are dask collections
    ○ Xarray variables can include dask arrays
    ○ map_blocks allows xarray objects to be
    the primary dask collections
    ● High-level metadata-aware interfaces to
    dask:
    ○ xr.apply_ufunc()
    ○ xr.map_blocks()
    ● File I/O: Dask allows xarray to support
    parallel read and write functionality via its
    open_mfdataset(), to_netcdf(), open_zarr(),
    to_zarr().

    View Slide

  14. Issues affecting Pangeo
    ● Running out of member when rechunking with dask (work around using
    xr.to_zarr(append=True) Memory Backpressure issue (D.E Shaw’s graph
    manipulation tools!)
    ● Dask-cloudprovider very attractive to orgs like USGS: FargateCluster “rate
    exceeded” issue
    ● Community understanding of chunking impact on use
    ● Dask Performance challenges, e.g. pangeo/#194, dask/#3595
    ○ More work on graph optimization, high-level graphs, task-fusion, etc..
    ● Dask-deployment: More work on enabling heterogeneous worker pools,
    harmonization among systems, etc...

    View Slide

  15. Pangeo is award winning!
    https://medium.com/informatics-lab/pangeo-the-award-wining-data-platform-ddf0b55185fa

    View Slide