Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SciPy 2020: Dask Maintainers Update

SciPy 2020: Dask Maintainers Update

We give a brief update on the state of the Dask project, and its ties with other neighbouring projects.

Jacob Tomlinson

July 08, 2020
Tweet

More Decks by Jacob Tomlinson

Other Decks in Technology

Transcript

  1. SciPy 2020 Update

    View Slide

  2. 10,000 Documentation Visitors
    Unique visitors on a weekly basis

    View Slide

  3. 5% of Python developers
    (among those who take the Python survey)
    https://www.jetbrains.com/lp/python-developers-survey-2019/

    View Slide

  4. Participate: dask.org/survey

    View Slide

  5. Events
    Community

    View Slide

  6. Dask Summit

    February 2020

    Hosted by Capital One Labs

    50 People

    3 Days

    50% Open Source Maintainers

    50% Institutional Users
    https://blog.dask.org/2020/04/28/dask-summit

    View Slide

  7. Weekly Meeting

    Every Tuesday

    Core maintainers meet

    Plan work

    Discuss long term goals
    https://docs.dask.org/en/latest/support.html

    View Slide

  8. Monthly Meeting

    First Thursday of each month

    Community meets

    Maintainers make announcements

    Community shares work
    https://docs.dask.org/en/latest/support.html

    View Slide

  9. Tools
    Community

    View Slide

  10. Real-world datasets are usually more than just
    raw numbers; they have labels which encode
    information about how the array values map to
    locations in space, time, etc.
    Xarray doesn’t just keep track of labels on arrays –
    it uses them to provide a powerful and concise
    interface

    View Slide

  11. ● Open source GPU
    Accelerated
    libraries with
    familiar APIs
    ● Scaling onto
    multiple GPUs
    with Dask
    ● Workload
    visualisation with
    the Dask
    Dashboard and
    NVDashboard
    cuDF, cuML, cuGraph, nvStrings, CLX, cuxfilter, cuspatial, cusignal...

    View Slide

  12. How BlazingSQL uses Dask
    1. SQL on Dask cuDF Dataframes
    2. Output SQL to Dask cuDF Dataframes
    3. Launch multi-GPU workers

    View Slide

  13. Dask and XGBoost can work together to train
    gradient boosted trees in parallel.
    1. Prepare and clean our possibly large data,
    probably with a lot of Pandas wrangling
    2. Set up XGBoost master and workers
    3. Hand data our cleaned data from a bunch
    of distributed Pandas dataframes to
    XGBoost workers across our cluster

    View Slide

  14. Prefect
    Prefect is a new workflow management system,
    designed for modern infrastructure and powered by the
    open-source Prefect Core workflow engine.
    Users organize Tasks into Flows, and Prefect takes care
    of the rest.

    View Slide

  15. Iris
    Iris implements a data model based on the CF conventions
    giving you a powerful, format-agnostic interface for
    working with your data.
    It excels when working with multi-dimensional Earth
    Science data, where tabular representations become
    unwieldy and inefficient.

    View Slide

  16. And many more ...
    Dask is used very broadly within the Python ecosystem

    View Slide

  17. User Groups
    Community

    View Slide

  18. Dask is used wherever Python is used
    1. Life sciences like Harvard medical school, Chan Zuckerberg, and Novartis
    2. Finance like Barclays and Capital One
    3. Geophysical Sciences like NASA, LANL, and the UK Met Office
    4. Beamline facilities like Brookhaven
    5. Retail like Walmart, JDA, and Grubhub`
    (which, as you know, is everywhere)
    https://youtu.be/t_GRK4L-bnw

    View Slide

  19. Enables lab scientists to view and process
    large volumes of images.
    Built with Python and Dask, but aimed at
    non-Python users
    “This makes it easy to quickly evaluate
    large datasets, which in turn helps us
    make analysis decisions faster”
    Napari

    View Slide

  20. For Profit
    Community

    View Slide

  21. For Profit
    Coiled Computing
    Coiled allows data scientists to seamlessly move complex
    workflows, code, and data between the cloud and their local
    workstation
    https://coiled.io/
    Prefect
    Prefect is a new workflow management system, designed for
    modern infrastructure and powered by open-source software.
    https://www.prefect.io/
    Saturn Cloud
    Saturn Cloud enables data scientists to work at scale using the
    tools they know best: Python, Jupyter, and Dask
    https://www.saturncloud.io/

    View Slide

  22. Cloud support

    View Slide

  23. Cloud support

    View Slide

  24. Recent Improvements
    Development

    View Slide

  25. Collaboration between industry, laboratories, and
    academia to create an open-source,
    production-grade communication framework for
    data-centric and high-performance applications.
    UCX
    Unified Communication X
    https://www.openucx.org/
    https://www.openucx.org/

    View Slide

  26. TPCx-BB
    NVIDIA outperformed by nearly 20x the record
    for running the standard big data analytics
    benchmark, known as TPCx-BB.
    Dask scaled the workload onto 16 DGX A100
    machines with a total of 128 NVIDIA A100
    GPUs.
    https://github.com/rapidsai/tpcx-bb

    View Slide

  27. Dask Gateway
    ● Multiple active Dask Clusters (potentially
    more than one per user)
    ● A Proxy for proxying both the connection
    between the user’s client and their
    respective scheduler, and the Dask Web
    UI for each cluster
    ● A central Gateway that manages
    authentication and cluster
    startup/shutdown

    View Slide

  28. Cluster Map Plot (aka Pew Pew Pew)

    View Slide

  29. Next Steps
    Development

    View Slide

  30. High level graph optimization

    View Slide

  31. Scheduler performance
    Rust scheduler implementation
    https://github.com/spirali/rsds
    Dynamic Tasks
    https://github.com/dask/distributed/pull/3879
    Explore Cython, PyPy or C
    https://github.com/dask/distributed/issues/854
    Performance Benchmarking
    https://pandas.pydata.org/speed/distributed/

    View Slide

  32. Chan Zuckerberg
    Initiative
    Dask won a CZI grant
    We will be hiring someone to focus on growth
    of Dask in the biological sciences field.
    If that is of interest keep an eye on our Twitter
    account for more updates.
    @dask_dev

    View Slide

  33. Summary: 2020 Overview
    ● 10,000 documentation visitors
    ● Many projects built on Dask
    ● Dask used to beat big data benchmarks
    ● More info on CZI funded maintainer position coming soon...
    Learn More
    Jacob Tomlinson @_jacobtomlinson
    Dask Website dask.org
    Dask Twitter @dask_dev
    Take the Dask 2020 survey at dask.org/survey

    View Slide