Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Dask Overview - Stanford Legion all-hands

Dask Overview - Stanford Legion all-hands

Jacob Tomlinson

September 29, 2022
Tweet

More Decks by Jacob Tomlinson

Other Decks in Technology

Transcript

  1. An overview
    Jacob Tomlinson
    Dask developer
    Senior Software Engineer at NVIDIA

    View full-size slide

  2. Jacob Tomlinson
    Senior Software Engineer
    NVIDIA

    View full-size slide

  3. General purpose Python library for parallelism
    Scales existing libraries, like Numpy, Pandas, and Scikit-Learn
    Flexible enough to build complex and custom systems
    Accessible for beginners, secure and trusted for institutions

    View full-size slide

  4. The early years
    A brief history and some context

    View full-size slide

  5. Powerful: Leading platform
    today for analytics
    Limited: Fails for big data or
    scalable computing
    Frustration: Alternatives fracture
    development
    Problem: Python is powerful, but doesn’t scale well
    Python is great for medium data
    >>> import pandas as pd
    >>> df = pd.read_parquet(“accounts”)
    MemoryError

    View full-size slide

  6. 2014: Blaze and the birth of Dask
    “Blaze was an ambitious project that tried
    to redefine computation, storage,
    compression, and data science APIs for
    Python, led originally by Travis Oliphant
    and Peter Wang, the co-founders of
    Anaconda.
    However, Blaze’s approach of being an
    ecosystem-in-a-package meant that it was
    harder for new users to easily adopt.
    As a result, we started to intentionally
    develop new components of Blaze outside
    the project … [and dask was designed to
    be] the simplest way to do parallel NumPy
    operations”
    Matthew Rocklin
    Dask Creator
    Source https://coiled.io/blog/history-dask/

    View full-size slide

  7. PyData Community adoption
    “Once Dask was working properly with
    NumPy, it became clear that there was
    huge demand for a lightweight parallelism
    solution for Pandas DataFrames and
    machine learning tools, such as
    Scikit-Learn.
    Dask then evolved quickly to support
    these other projects where appropriate.”
    Matthew Rocklin
    Dask Creator
    Source https://coiled.io/blog/history-dask/
    Image from Jake VanderPlas’ keynote, PyCon 2017

    View full-size slide

  8. Dask’s distributed scheduler
    “For the first year of Dask’s life it was
    focused on single-machine parallelism.
    But inevitably, Dask was used on
    problems that didn’t fit on a single
    machine. This led us to develop a
    distributed-memory scheduler for Dask
    that supported the same API as the
    existing single-machine scheduler.
    For Dask users this was like magic.
    Suddenly their existing workloads on
    50GB datasets could run comfortably on
    5TB (and then 50TB a bit later).”
    Matthew Rocklin
    Dask Creator
    Source https://coiled.io/blog/history-dask/

    View full-size slide

  9. Dask’s Features
    Overview

    View full-size slide

  10. Out-of-core computation
    Dask’s data structures are chunked or
    partitioned allowing them to be swapped in
    and out of memory.
    Operations run on chunks independently
    and only communicate intermediate
    results when necessary

    View full-size slide

  11. Distributed Task Graphs
    Constructing tasks in a DAG allows
    tasks to executed by a selection of
    schedulers.
    The distributed scheduler allows a
    DAG to be shared by many workers
    running over many machines to
    spread out work.

    View full-size slide

  12. Dask accelerates the existing Python ecosystem
    Built alongside with the current community
    import numpy as np
    x = np.ones((1000, 1000))
    x + x.T - x.mean(axis=0
    import pandas as pd
    df = pd.read_csv(“file.csv”)
    df.groupby(“x”).y.mean()
    from scikit_learn.linear_model \
    import LogisticRegression
    lr = LogisticRegression()
    lr.fit(data, labels)
    Numpy Pandas Scikit-Learn

    View full-size slide

  13. Custom code with delayed and futures
    import dask
    @dask.delayed
    def inc(x):
    return x + 1
    @dask.delayed
    def double(x):
    return x * 2
    @dask.delayed
    def add(x, y):
    return x + y
    data = [1, 2, 3, 4, 5]
    output = []
    for x in data:
    a = inc(x)
    b = double(x)
    c = add(a, b)
    output.append(c)
    total = dask.delayed(sum)(output)
    Dask also allows users to construct
    custom graphs with the delayed and
    futures APIs.

    View full-size slide

  14. cluster = KubeCluster()
    cluster = ECSCluster()
    df = dd.read_parquet(...)
    cluster = PBSCluster()
    cluster = LSFCluster()
    cluster = SLURMCluster()

    df = dd.read_parquet(...)
    cluster = YarnCluster()
    df = dd.read_parquet(...)
    Dask deploys on all major resource managers
    Cloud HPC Hadoop/Spark
    Cloud, HPC, or Yarn, it’s all the same to Dask

    View full-size slide

  15. Create Dask Clusters within workflows
    # Install dask-kubernetes
    $ pip install dask-kubernetes
    # Launch a cluster
    >>> from dask_kubernetes.experimental \
    import KubeCluster
    >>> cluster = KubeCluster(name="demo")
    # List the DaskCluster custom resource that
    was created for us under the hood
    $ kubectl get daskclusters
    NAME AGE
    demo-cluster 6m3s

    View full-size slide

  16. Monitoring our work
    # Connect a Dask client
    >>> from dask.distributed import Client
    >>> client = Client(cluster)
    # Do come computation
    >>> import dask.array as da
    >>> arr = da.random.random((10_000, 1_000, 1_000),
    chunks=(1000, 1000, 100))
    >>> result = arr.mean().compute()

    View full-size slide

  17. Why folks use Dask
    Overview

    View full-size slide

  18. cluster = LocalCluster()
    df = dd.read_parquet(...)
    Dask is already deployed on your laptop
    Laptops
    Easy to start locally…
    … and then burst out to arbitrary hardware
    Conda or pip installable, included by default in Anaconda

    View full-size slide

  19. Low barrier to distributed computing
    https://docs.dask.org/en/stable/10-minutes-to-dask.html
    Dask allows Data Scientists and other
    users of the PyData ecosystem to get up
    and running with distributed computing.
    Taking your first steps from Pandas to
    Dask Dataframe just requires the
    introduction of a few distributed computing
    concepts. No additional hardware or
    platforms necessary.

    View full-size slide

  20. Dashboard
    Dask’s dashboard gives you key
    insights into how your cluster is
    performing.
    You can view it in a browser or
    directly within Jupyter Lab to see
    how your graphs are executing.
    You can also use the built in
    profiler to understand where the
    slow parts of your code are.

    View full-size slide

  21. Security and zero-trust
    Dask allows communication between
    components to be TLS encrypted.
    While you may pay a small performance
    overhead this is a must have for some
    folks running on shared infrastructure like
    multi-tenant Kubernetes clusters.

    View full-size slide

  22. Resilience
    If a task fails to execute Dask workers will
    retry multiple times before giving up. This
    can allow Dask to complete tasks even on
    unreliable infrastructure.
    Also the centralised scheduler allows
    whole DAG sections to be recomputed if
    intermediate results are lost due to
    hardware failure.
    Folks commonly run Dask on ephemeral
    cloud infrastructure which is cheaper but
    can disappear suddenly. Thanks to Dask’s
    auto-healing interrupted DAGs will still
    complete after the lost computations have
    been repeated.

    View full-size slide

  23. Elastic scaling
    Dask’s adaptive scaling allows a Dask
    scheduler to request additional workers
    via whatever resource manager you are
    using (Kubernetes, Cloud, etc).
    This allows computations to burst out onto
    more machines and complete the overall
    graph in less time.
    This is particularly effective when you
    have multiple people running interactive
    and embarrassingly parallel workloads on
    shared resources.

    View full-size slide

  24. Customisable
    Exposing Dask’s users to distributed computing
    concepts is a double-edged sword.
    It increases the learning curve but our community
    has shown this is not a deal breaking problem.
    Instead allowing folks access to internals means
    they can create custom graphs, modify autoscaling
    or scheduler heuristics and make Dask work in their
    existing workflows.
    While it’s totally possible to tie yourself in knots or
    produce poorly performing graphs it also allows our
    users to push Dask beyond what is possible out of
    the box.
    Before/after cluster utilization of a bespoke radio astronomy workload using the
    AutoRestrictor scheduler plugin (whitespace and red blocks is bad).
    https://github.com/dask/distributed/pull/4864

    View full-size slide

  25. Pleasant to use and adopt
    Beautiful interactive dashboard
    Builds intuition on parallel performance
    Familiar APIs and data models
    Dask looks and feels like well known libraries in the
    PyData ecosystem
    Co-developed with the ecosystem
    Built by NumPy, Pandas, and Scikit-Learn devs
    Dask complements the existing ecosystem
    Dask is designed for experts and novices alike

    View full-size slide

  26. The Dask Ecosystem
    Community

    View full-size slide

  27. Software Community
    Developed:
    300 contributors,
    20 active maintainers
    From:
    Numpy, Pandas, Scikit-Learn,
    Jupyter, and more
    Run by people you know, built by people you trust
    Safe:
    BSD-3 Licensed,
    fiscally sponsored by
    NumFOCUS,
    community governed
    Discussed:
    Dask is the most
    common parallel
    framework at
    PyData/SciPy/PyCon
    conferences today.
    Used:
    10k
    weekly visitors
    to
    documentation
    28

    View full-size slide

  28. Dask’s users

    View full-size slide

  29. Use Case: Banking with Capital One
    Capital One engineers analyze financial and credit data to build cloud-based machine learning
    models
    ● Datasets range in scale from 1 GB - 100 TB
    ● CSV and Parquet datasets 100s - 1000s of columns
    ● Data Cleaning, Feature Selection, Engineering, Training, Validation, and Governance
    ● Dask DataFrames used to train with dask-ml and dask-xgboost
    ● 10x speed up in computational performance with Dask
    ● Faster development and improved accuracy for credit risk models
    ● Deployments on AWS can be optimized to reduce overall computing costs or faster
    development iterations
    Data cleaning, Feature engineering, and Machine learning

    View full-size slide

  30. Use Case: Analyze Sea Levels
    Pangeo researchers analyze simulated and
    observed climate and imaging data
    1 GB - 100 TB
    HPC and Cloud
    HDF5/NetCDF/Zarr storage
    Interactive computing with notebooks
    Includes collaborators NASA, NOAA, USGS,
    UK-Met, CSIRO, and various industries
    Learn more about Pangeo in this talk
    Sea level altitude variability over 30 years
    Columbia/NCAR leverage Dask Array to understand our planet

    View full-size slide

  31. What’s next for Dask?
    Direction

    View full-size slide

  32. Performance
    While adding more more workers
    generally results in performance moving
    up and to the left Dask makes many
    compromises in terms of user experience
    and platform support that mean it does not
    scale as we would like.
    Engineers from organisations continue to
    work to reduce the impact of these
    compromises as much as is possible.
    For many users Dask has been about
    making the impossible possible, but are
    now starting to look towards making the
    impossible fast.

    View full-size slide

  33. Improving deployment
    Dask can be deployed in many places
    but requires users to have knowledge of
    how to deploy it and to manage their
    clusters.
    There are many tools in Dask that
    automate this and allow less
    experienced users to run large clusters
    on the cloud and beyond.
    A key next step is making these
    deployment mechanisms more robust,
    flexible and usable. Whether that is
    improving the tools themselves or
    creating new tools that give more
    context and information to users about
    their clusters.

    View full-size slide

  34. Read Documentation: docs.dask.org
    See Examples: examples.dask.org
    Engage Community: github.com/dask

    View full-size slide