Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SciPy 2020: Dask Maintainers Update

SciPy 2020: Dask Maintainers Update

We give a brief update on the state of the Dask project, and its ties with other neighbouring projects.


Jacob Tomlinson

July 08, 2020


  1. SciPy 2020 Update

  2. 10,000 Documentation Visitors Unique visitors on a weekly basis

  3. 5% of Python developers (among those who take the Python

    survey) https://www.jetbrains.com/lp/python-developers-survey-2019/
  4. Participate: dask.org/survey

  5. Events Community

  6. Dask Summit • February 2020 • Hosted by Capital One

    Labs • 50 People • 3 Days • 50% Open Source Maintainers • 50% Institutional Users https://blog.dask.org/2020/04/28/dask-summit
  7. Weekly Meeting • Every Tuesday • Core maintainers meet •

    Plan work • Discuss long term goals https://docs.dask.org/en/latest/support.html
  8. Monthly Meeting • First Thursday of each month • Community

    meets • Maintainers make announcements • Community shares work https://docs.dask.org/en/latest/support.html
  9. Tools Community

  10. Real-world datasets are usually more than just raw numbers; they

    have labels which encode information about how the array values map to locations in space, time, etc. Xarray doesn’t just keep track of labels on arrays – it uses them to provide a powerful and concise interface
  11. • Open source GPU Accelerated libraries with familiar APIs •

    Scaling onto multiple GPUs with Dask • Workload visualisation with the Dask Dashboard and NVDashboard cuDF, cuML, cuGraph, nvStrings, CLX, cuxfilter, cuspatial, cusignal...
  12. How BlazingSQL uses Dask 1. SQL on Dask cuDF Dataframes

    2. Output SQL to Dask cuDF Dataframes 3. Launch multi-GPU workers
  13. Dask and XGBoost can work together to train gradient boosted

    trees in parallel. 1. Prepare and clean our possibly large data, probably with a lot of Pandas wrangling 2. Set up XGBoost master and workers 3. Hand data our cleaned data from a bunch of distributed Pandas dataframes to XGBoost workers across our cluster
  14. Prefect Prefect is a new workflow management system, designed for

    modern infrastructure and powered by the open-source Prefect Core workflow engine. Users organize Tasks into Flows, and Prefect takes care of the rest.
  15. Iris Iris implements a data model based on the CF

    conventions giving you a powerful, format-agnostic interface for working with your data. It excels when working with multi-dimensional Earth Science data, where tabular representations become unwieldy and inefficient.
  16. And many more ... Dask is used very broadly within

    the Python ecosystem
  17. User Groups Community

  18. Dask is used wherever Python is used 1. Life sciences

    like Harvard medical school, Chan Zuckerberg, and Novartis 2. Finance like Barclays and Capital One 3. Geophysical Sciences like NASA, LANL, and the UK Met Office 4. Beamline facilities like Brookhaven 5. Retail like Walmart, JDA, and Grubhub` (which, as you know, is everywhere) https://youtu.be/t_GRK4L-bnw
  19. Enables lab scientists to view and process large volumes of

    images. Built with Python and Dask, but aimed at non-Python users “This makes it easy to quickly evaluate large datasets, which in turn helps us make analysis decisions faster” Napari
  20. For Profit Community

  21. For Profit Coiled Computing Coiled allows data scientists to seamlessly

    move complex workflows, code, and data between the cloud and their local workstation https://coiled.io/ Prefect Prefect is a new workflow management system, designed for modern infrastructure and powered by open-source software. https://www.prefect.io/ Saturn Cloud Saturn Cloud enables data scientists to work at scale using the tools they know best: Python, Jupyter, and Dask https://www.saturncloud.io/
  22. Cloud support

  23. Cloud support

  24. Recent Improvements Development

  25. Collaboration between industry, laboratories, and academia to create an open-source,

    production-grade communication framework for data-centric and high-performance applications. UCX Unified Communication X https://www.openucx.org/ https://www.openucx.org/
  26. TPCx-BB NVIDIA outperformed by nearly 20x the record for running

    the standard big data analytics benchmark, known as TPCx-BB. Dask scaled the workload onto 16 DGX A100 machines with a total of 128 NVIDIA A100 GPUs. https://github.com/rapidsai/tpcx-bb
  27. Dask Gateway • Multiple active Dask Clusters (potentially more than

    one per user) • A Proxy for proxying both the connection between the user’s client and their respective scheduler, and the Dask Web UI for each cluster • A central Gateway that manages authentication and cluster startup/shutdown
  28. Cluster Map Plot (aka Pew Pew Pew)

  29. Next Steps Development

  30. High level graph optimization

  31. Scheduler performance Rust scheduler implementation https://github.com/spirali/rsds Dynamic Tasks https://github.com/dask/distributed/pull/3879 Explore

    Cython, PyPy or C https://github.com/dask/distributed/issues/854 Performance Benchmarking https://pandas.pydata.org/speed/distributed/
  32. Chan Zuckerberg Initiative Dask won a CZI grant We will

    be hiring someone to focus on growth of Dask in the biological sciences field. If that is of interest keep an eye on our Twitter account for more updates. @dask_dev
  33. Summary: 2020 Overview • 10,000 documentation visitors • Many projects

    built on Dask • Dask used to beat big data benchmarks • More info on CZI funded maintainer position coming soon... Learn More Jacob Tomlinson @_jacobtomlinson Dask Website dask.org Dask Twitter @dask_dev Take the Dask 2020 survey at dask.org/survey