Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Dask Overview - Stanford Legion all-hands

Dask Overview - Stanford Legion all-hands

Jacob Tomlinson

September 29, 2022
Tweet

More Decks by Jacob Tomlinson

Other Decks in Technology

Transcript

  1. General purpose Python library for parallelism Scales existing libraries, like

    Numpy, Pandas, and Scikit-Learn Flexible enough to build complex and custom systems Accessible for beginners, secure and trusted for institutions
  2. Powerful: Leading platform today for analytics Limited: Fails for big

    data or scalable computing Frustration: Alternatives fracture development Problem: Python is powerful, but doesn’t scale well Python is great for medium data >>> import pandas as pd >>> df = pd.read_parquet(“accounts”) MemoryError
  3. 2014: Blaze and the birth of Dask “Blaze was an

    ambitious project that tried to redefine computation, storage, compression, and data science APIs for Python, led originally by Travis Oliphant and Peter Wang, the co-founders of Anaconda. However, Blaze’s approach of being an ecosystem-in-a-package meant that it was harder for new users to easily adopt. As a result, we started to intentionally develop new components of Blaze outside the project … [and dask was designed to be] the simplest way to do parallel NumPy operations” Matthew Rocklin Dask Creator Source https://coiled.io/blog/history-dask/
  4. PyData Community adoption “Once Dask was working properly with NumPy,

    it became clear that there was huge demand for a lightweight parallelism solution for Pandas DataFrames and machine learning tools, such as Scikit-Learn. Dask then evolved quickly to support these other projects where appropriate.” Matthew Rocklin Dask Creator Source https://coiled.io/blog/history-dask/ Image from Jake VanderPlas’ keynote, PyCon 2017
  5. Dask’s distributed scheduler “For the first year of Dask’s life

    it was focused on single-machine parallelism. But inevitably, Dask was used on problems that didn’t fit on a single machine. This led us to develop a distributed-memory scheduler for Dask that supported the same API as the existing single-machine scheduler. For Dask users this was like magic. Suddenly their existing workloads on 50GB datasets could run comfortably on 5TB (and then 50TB a bit later).” Matthew Rocklin Dask Creator Source https://coiled.io/blog/history-dask/
  6. Out-of-core computation Dask’s data structures are chunked or partitioned allowing

    them to be swapped in and out of memory. Operations run on chunks independently and only communicate intermediate results when necessary
  7. Distributed Task Graphs Constructing tasks in a DAG allows tasks

    to executed by a selection of schedulers. The distributed scheduler allows a DAG to be shared by many workers running over many machines to spread out work.
  8. Dask accelerates the existing Python ecosystem Built alongside with the

    current community import numpy as np x = np.ones((1000, 1000)) x + x.T - x.mean(axis=0 import pandas as pd df = pd.read_csv(“file.csv”) df.groupby(“x”).y.mean() from scikit_learn.linear_model \ import LogisticRegression lr = LogisticRegression() lr.fit(data, labels) Numpy Pandas Scikit-Learn
  9. Custom code with delayed and futures import dask @dask.delayed def

    inc(x): return x + 1 @dask.delayed def double(x): return x * 2 @dask.delayed def add(x, y): return x + y data = [1, 2, 3, 4, 5] output = [] for x in data: a = inc(x) b = double(x) c = add(a, b) output.append(c) total = dask.delayed(sum)(output) Dask also allows users to construct custom graphs with the delayed and futures APIs.
  10. cluster = KubeCluster() cluster = ECSCluster() df = dd.read_parquet(...) cluster

    = PBSCluster() cluster = LSFCluster() cluster = SLURMCluster() … df = dd.read_parquet(...) cluster = YarnCluster() df = dd.read_parquet(...) Dask deploys on all major resource managers Cloud HPC Hadoop/Spark Cloud, HPC, or Yarn, it’s all the same to Dask
  11. Create Dask Clusters within workflows # Install dask-kubernetes $ pip

    install dask-kubernetes # Launch a cluster >>> from dask_kubernetes.experimental \ import KubeCluster >>> cluster = KubeCluster(name="demo") # List the DaskCluster custom resource that was created for us under the hood $ kubectl get daskclusters NAME AGE demo-cluster 6m3s
  12. Monitoring our work # Connect a Dask client >>> from

    dask.distributed import Client >>> client = Client(cluster) # Do come computation >>> import dask.array as da >>> arr = da.random.random((10_000, 1_000, 1_000), chunks=(1000, 1000, 100)) >>> result = arr.mean().compute()
  13. cluster = LocalCluster() df = dd.read_parquet(...) Dask is already deployed

    on your laptop Laptops Easy to start locally… … and then burst out to arbitrary hardware Conda or pip installable, included by default in Anaconda
  14. Low barrier to distributed computing https://docs.dask.org/en/stable/10-minutes-to-dask.html Dask allows Data Scientists

    and other users of the PyData ecosystem to get up and running with distributed computing. Taking your first steps from Pandas to Dask Dataframe just requires the introduction of a few distributed computing concepts. No additional hardware or platforms necessary.
  15. Dashboard Dask’s dashboard gives you key insights into how your

    cluster is performing. You can view it in a browser or directly within Jupyter Lab to see how your graphs are executing. You can also use the built in profiler to understand where the slow parts of your code are.
  16. Security and zero-trust Dask allows communication between components to be

    TLS encrypted. While you may pay a small performance overhead this is a must have for some folks running on shared infrastructure like multi-tenant Kubernetes clusters.
  17. Resilience If a task fails to execute Dask workers will

    retry multiple times before giving up. This can allow Dask to complete tasks even on unreliable infrastructure. Also the centralised scheduler allows whole DAG sections to be recomputed if intermediate results are lost due to hardware failure. Folks commonly run Dask on ephemeral cloud infrastructure which is cheaper but can disappear suddenly. Thanks to Dask’s auto-healing interrupted DAGs will still complete after the lost computations have been repeated.
  18. Elastic scaling Dask’s adaptive scaling allows a Dask scheduler to

    request additional workers via whatever resource manager you are using (Kubernetes, Cloud, etc). This allows computations to burst out onto more machines and complete the overall graph in less time. This is particularly effective when you have multiple people running interactive and embarrassingly parallel workloads on shared resources.
  19. Customisable Exposing Dask’s users to distributed computing concepts is a

    double-edged sword. It increases the learning curve but our community has shown this is not a deal breaking problem. Instead allowing folks access to internals means they can create custom graphs, modify autoscaling or scheduler heuristics and make Dask work in their existing workflows. While it’s totally possible to tie yourself in knots or produce poorly performing graphs it also allows our users to push Dask beyond what is possible out of the box. Before/after cluster utilization of a bespoke radio astronomy workload using the AutoRestrictor scheduler plugin (whitespace and red blocks is bad). https://github.com/dask/distributed/pull/4864
  20. Pleasant to use and adopt Beautiful interactive dashboard Builds intuition

    on parallel performance Familiar APIs and data models Dask looks and feels like well known libraries in the PyData ecosystem Co-developed with the ecosystem Built by NumPy, Pandas, and Scikit-Learn devs Dask complements the existing ecosystem Dask is designed for experts and novices alike
  21. Software Community Developed: 300 contributors, 20 active maintainers From: Numpy,

    Pandas, Scikit-Learn, Jupyter, and more Run by people you know, built by people you trust Safe: BSD-3 Licensed, fiscally sponsored by NumFOCUS, community governed Discussed: Dask is the most common parallel framework at PyData/SciPy/PyCon conferences today. Used: 10k weekly visitors to documentation 28
  22. Use Case: Banking with Capital One Capital One engineers analyze

    financial and credit data to build cloud-based machine learning models • Datasets range in scale from 1 GB - 100 TB • CSV and Parquet datasets 100s - 1000s of columns • Data Cleaning, Feature Selection, Engineering, Training, Validation, and Governance • Dask DataFrames used to train with dask-ml and dask-xgboost • 10x speed up in computational performance with Dask • Faster development and improved accuracy for credit risk models • Deployments on AWS can be optimized to reduce overall computing costs or faster development iterations Data cleaning, Feature engineering, and Machine learning
  23. Use Case: Analyze Sea Levels Pangeo researchers analyze simulated and

    observed climate and imaging data 1 GB - 100 TB HPC and Cloud HDF5/NetCDF/Zarr storage Interactive computing with notebooks Includes collaborators NASA, NOAA, USGS, UK-Met, CSIRO, and various industries Learn more about Pangeo in this talk Sea level altitude variability over 30 years Columbia/NCAR leverage Dask Array to understand our planet
  24. Performance While adding more more workers generally results in performance

    moving up and to the left Dask makes many compromises in terms of user experience and platform support that mean it does not scale as we would like. Engineers from organisations continue to work to reduce the impact of these compromises as much as is possible. For many users Dask has been about making the impossible possible, but are now starting to look towards making the impossible fast.
  25. Improving deployment Dask can be deployed in many places but

    requires users to have knowledge of how to deploy it and to manage their clusters. There are many tools in Dask that automate this and allow less experienced users to run large clusters on the cloud and beyond. A key next step is making these deployment mechanisms more robust, flexible and usable. Whether that is improving the tools themselves or creating new tools that give more context and information to users about their clusters.