Dask Overview - Stanford Legion all-hands

An overview Jacob Tomlinson Dask developer Senior Software Engineer at
NVIDIA

Jacob Tomlinson Senior Software Engineer NVIDIA

General purpose Python library for parallelism Scales existing libraries, like
Numpy, Pandas, and Scikit-Learn Flexible enough to build complex and custom systems Accessible for beginners, secure and trusted for institutions

The early years A brief history and some context

Powerful: Leading platform today for analytics Limited: Fails for big
data or scalable computing Frustration: Alternatives fracture development Problem: Python is powerful, but doesn’t scale well Python is great for medium data >>> import pandas as pd >>> df = pd.read_parquet(“accounts”) MemoryError

2014: Blaze and the birth of Dask “Blaze was an
ambitious project that tried to redefine computation, storage, compression, and data science APIs for Python, led originally by Travis Oliphant and Peter Wang, the co-founders of Anaconda. However, Blaze’s approach of being an ecosystem-in-a-package meant that it was harder for new users to easily adopt. As a result, we started to intentionally develop new components of Blaze outside the project … [and dask was designed to be] the simplest way to do parallel NumPy operations” Matthew Rocklin Dask Creator Source https://coiled.io/blog/history-dask/

PyData Community adoption “Once Dask was working properly with NumPy,
it became clear that there was huge demand for a lightweight parallelism solution for Pandas DataFrames and machine learning tools, such as Scikit-Learn. Dask then evolved quickly to support these other projects where appropriate.” Matthew Rocklin Dask Creator Source https://coiled.io/blog/history-dask/ Image from Jake VanderPlas’ keynote, PyCon 2017

Dask’s distributed scheduler “For the first year of Dask’s life
it was focused on single-machine parallelism. But inevitably, Dask was used on problems that didn’t fit on a single machine. This led us to develop a distributed-memory scheduler for Dask that supported the same API as the existing single-machine scheduler. For Dask users this was like magic. Suddenly their existing workloads on 50GB datasets could run comfortably on 5TB (and then 50TB a bit later).” Matthew Rocklin Dask Creator Source https://coiled.io/blog/history-dask/

Dask’s Features Overview

Out-of-core computation Dask’s data structures are chunked or partitioned allowing
them to be swapped in and out of memory. Operations run on chunks independently and only communicate intermediate results when necessary

Distributed Task Graphs Constructing tasks in a DAG allows tasks
to executed by a selection of schedulers. The distributed scheduler allows a DAG to be shared by many workers running over many machines to spread out work.

Dask accelerates the existing Python ecosystem Built alongside with the
current community import numpy as np x = np.ones((1000, 1000)) x + x.T - x.mean(axis=0 import pandas as pd df = pd.read_csv(“file.csv”) df.groupby(“x”).y.mean() from scikit_learn.linear_model \ import LogisticRegression lr = LogisticRegression() lr.fit(data, labels) Numpy Pandas Scikit-Learn

Custom code with delayed and futures import dask @dask.delayed def
inc(x): return x + 1 @dask.delayed def double(x): return x * 2 @dask.delayed def add(x, y): return x + y data = [1, 2, 3, 4, 5] output = [] for x in data: a = inc(x) b = double(x) c = add(a, b) output.append(c) total = dask.delayed(sum)(output) Dask also allows users to construct custom graphs with the delayed and futures APIs.

cluster = KubeCluster() cluster = ECSCluster() df = dd.read_parquet(...) cluster
= PBSCluster() cluster = LSFCluster() cluster = SLURMCluster() … df = dd.read_parquet(...) cluster = YarnCluster() df = dd.read_parquet(...) Dask deploys on all major resource managers Cloud HPC Hadoop/Spark Cloud, HPC, or Yarn, it’s all the same to Dask

Create Dask Clusters within workflows # Install dask-kubernetes $ pip
install dask-kubernetes # Launch a cluster >>> from dask_kubernetes.experimental \ import KubeCluster >>> cluster = KubeCluster(name="demo") # List the DaskCluster custom resource that was created for us under the hood $ kubectl get daskclusters NAME AGE demo-cluster 6m3s

Monitoring our work # Connect a Dask client >>> from
dask.distributed import Client >>> client = Client(cluster) # Do come computation >>> import dask.array as da >>> arr = da.random.random((10_000, 1_000, 1_000), chunks=(1000, 1000, 100)) >>> result = arr.mean().compute()

Why folks use Dask Overview

cluster = LocalCluster() df = dd.read_parquet(...) Dask is already deployed
on your laptop Laptops Easy to start locally… … and then burst out to arbitrary hardware Conda or pip installable, included by default in Anaconda

Low barrier to distributed computing https://docs.dask.org/en/stable/10-minutes-to-dask.html Dask allows Data Scientists
and other users of the PyData ecosystem to get up and running with distributed computing. Taking your first steps from Pandas to Dask Dataframe just requires the introduction of a few distributed computing concepts. No additional hardware or platforms necessary.

Dashboard Dask’s dashboard gives you key insights into how your
cluster is performing. You can view it in a browser or directly within Jupyter Lab to see how your graphs are executing. You can also use the built in profiler to understand where the slow parts of your code are.

Security and zero-trust Dask allows communication between components to be
TLS encrypted. While you may pay a small performance overhead this is a must have for some folks running on shared infrastructure like multi-tenant Kubernetes clusters.

Resilience If a task fails to execute Dask workers will
retry multiple times before giving up. This can allow Dask to complete tasks even on unreliable infrastructure. Also the centralised scheduler allows whole DAG sections to be recomputed if intermediate results are lost due to hardware failure. Folks commonly run Dask on ephemeral cloud infrastructure which is cheaper but can disappear suddenly. Thanks to Dask’s auto-healing interrupted DAGs will still complete after the lost computations have been repeated.

Elastic scaling Dask’s adaptive scaling allows a Dask scheduler to
request additional workers via whatever resource manager you are using (Kubernetes, Cloud, etc). This allows computations to burst out onto more machines and complete the overall graph in less time. This is particularly effective when you have multiple people running interactive and embarrassingly parallel workloads on shared resources.

Customisable Exposing Dask’s users to distributed computing concepts is a
double-edged sword. It increases the learning curve but our community has shown this is not a deal breaking problem. Instead allowing folks access to internals means they can create custom graphs, modify autoscaling or scheduler heuristics and make Dask work in their existing workflows. While it’s totally possible to tie yourself in knots or produce poorly performing graphs it also allows our users to push Dask beyond what is possible out of the box. Before/after cluster utilization of a bespoke radio astronomy workload using the AutoRestrictor scheduler plugin (whitespace and red blocks is bad). https://github.com/dask/distributed/pull/4864

Pleasant to use and adopt Beautiful interactive dashboard Builds intuition
on parallel performance Familiar APIs and data models Dask looks and feels like well known libraries in the PyData ecosystem Co-developed with the ecosystem Built by NumPy, Pandas, and Scikit-Learn devs Dask complements the existing ecosystem Dask is designed for experts and novices alike

The Dask Ecosystem Community

Software Community Developed: 300 contributors, 20 active maintainers From: Numpy,
Pandas, Scikit-Learn, Jupyter, and more Run by people you know, built by people you trust Safe: BSD-3 Licensed, fiscally sponsored by NumFOCUS, community governed Discussed: Dask is the most common parallel framework at PyData/SciPy/PyCon conferences today. Used: 10k weekly visitors to documentation 28

Dask’s users

Use Case: Banking with Capital One Capital One engineers analyze
financial and credit data to build cloud-based machine learning models • Datasets range in scale from 1 GB - 100 TB • CSV and Parquet datasets 100s - 1000s of columns • Data Cleaning, Feature Selection, Engineering, Training, Validation, and Governance • Dask DataFrames used to train with dask-ml and dask-xgboost • 10x speed up in computational performance with Dask • Faster development and improved accuracy for credit risk models • Deployments on AWS can be optimized to reduce overall computing costs or faster development iterations Data cleaning, Feature engineering, and Machine learning

Use Case: Analyze Sea Levels Pangeo researchers analyze simulated and
observed climate and imaging data 1 GB - 100 TB HPC and Cloud HDF5/NetCDF/Zarr storage Interactive computing with notebooks Includes collaborators NASA, NOAA, USGS, UK-Met, CSIRO, and various industries Learn more about Pangeo in this talk Sea level altitude variability over 30 years Columbia/NCAR leverage Dask Array to understand our planet

What’s next for Dask? Direction

Performance While adding more more workers generally results in performance
moving up and to the left Dask makes many compromises in terms of user experience and platform support that mean it does not scale as we would like. Engineers from organisations continue to work to reduce the impact of these compromises as much as is possible. For many users Dask has been about making the impossible possible, but are now starting to look towards making the impossible fast.

Improving deployment Dask can be deployed in many places but
requires users to have knowledge of how to deploy it and to manage their clusters. There are many tools in Dask that automate this and allow less experienced users to run large clusters on the cloud and beyond. A key next step is making these deployment mechanisms more robust, flexible and usable. Whether that is improving the tools themselves or creating new tools that give more context and information to users about their clusters.

Read Documentation: docs.dask.org See Examples: examples.dask.org Engage Community: github.com/dask

Dask Overview - Stanford Legion all-hands

Dask Overview - Stanford Legion all-hands

More Decks by Jacob Tomlinson

Other Decks in Technology

Featured

Transcript