Dask Summit ● February 2020 ● Hosted by Capital One Labs ● 50 People ● 3 Days ● 50% Open Source Maintainers ● 50% Institutional Users https://blog.dask.org/2020/04/28/dask-summit
Monthly Meeting ● First Thursday of each month ● Community meets ● Maintainers make announcements ● Community shares work https://docs.dask.org/en/latest/support.html
Real-world datasets are usually more than just raw numbers; they have labels which encode information about how the array values map to locations in space, time, etc. Xarray doesn’t just keep track of labels on arrays – it uses them to provide a powerful and concise interface
Dask and XGBoost can work together to train gradient boosted trees in parallel. 1. Prepare and clean our possibly large data, probably with a lot of Pandas wrangling 2. Set up XGBoost master and workers 3. Hand data our cleaned data from a bunch of distributed Pandas dataframes to XGBoost workers across our cluster
Prefect Prefect is a new workflow management system, designed for modern infrastructure and powered by the open-source Prefect Core workflow engine. Users organize Tasks into Flows, and Prefect takes care of the rest.
Iris Iris implements a data model based on the CF conventions giving you a powerful, format-agnostic interface for working with your data. It excels when working with multi-dimensional Earth Science data, where tabular representations become unwieldy and inefficient.
Dask is used wherever Python is used 1. Life sciences like Harvard medical school, Chan Zuckerberg, and Novartis 2. Finance like Barclays and Capital One 3. Geophysical Sciences like NASA, LANL, and the UK Met Office 4. Beamline facilities like Brookhaven 5. Retail like Walmart, JDA, and Grubhub` (which, as you know, is everywhere) https://youtu.be/t_GRK4L-bnw
Enables lab scientists to view and process large volumes of images. Built with Python and Dask, but aimed at non-Python users “This makes it easy to quickly evaluate large datasets, which in turn helps us make analysis decisions faster” Napari
For Profit Coiled Computing Coiled allows data scientists to seamlessly move complex workflows, code, and data between the cloud and their local workstation https://coiled.io/ Prefect Prefect is a new workflow management system, designed for modern infrastructure and powered by open-source software. https://www.prefect.io/ Saturn Cloud Saturn Cloud enables data scientists to work at scale using the tools they know best: Python, Jupyter, and Dask https://www.saturncloud.io/
Collaboration between industry, laboratories, and academia to create an open-source, production-grade communication framework for data-centric and high-performance applications. UCX Unified Communication X https://www.openucx.org/ https://www.openucx.org/
TPCx-BB NVIDIA outperformed by nearly 20x the record for running the standard big data analytics benchmark, known as TPCx-BB. Dask scaled the workload onto 16 DGX A100 machines with a total of 128 NVIDIA A100 GPUs. https://github.com/rapidsai/tpcx-bb
Dask Gateway ● Multiple active Dask Clusters (potentially more than one per user) ● A Proxy for proxying both the connection between the user’s client and their respective scheduler, and the Dask Web UI for each cluster ● A central Gateway that manages authentication and cluster startup/shutdown
Chan Zuckerberg Initiative Dask won a CZI grant We will be hiring someone to focus on growth of Dask in the biological sciences field. If that is of interest keep an eye on our Twitter account for more updates. @dask_dev
Summary: 2020 Overview ● 10,000 documentation visitors ● Many projects built on Dask ● Dask used to beat big data benchmarks ● More info on CZI funded maintainer position coming soon... Learn More Jacob Tomlinson @_jacobtomlinson Dask Website dask.org Dask Twitter @dask_dev Take the Dask 2020 survey at dask.org/survey