Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Parallelizing Your ETL with Dask on Kubeflow

Parallelizing Your ETL with Dask on Kubeflow

Kubeflow is a popular MLOps platform built on Kubernetes for designing and running Machine Learning pipelines for training models and providing inference services. Kubeflow has a notebook service that lets you launch interactive Jupyter servers (and more) on your Kubernetes cluster. Kubeflow also has a pipelines service with a DSL library written in Python for designing and building repeatable workflows that can be executed on your cluster, either ad-hoc or on a schedule. It also has tools for hyperparameter tuning and running model inference servers, everything you need to build a robust ML service.

Dask provides advanced parallelism for Python by breaking functions into a task graph that can be evaluated by a task scheduler that has many workers. This allows you to utilize many processors on a single machine, or many machines in a cluster. Dask’s many high-level collections APIs including dask.dataframe and dask.array provide familiar APIs that match Pandas, NumPy and more to enable folks to parallelize their existing workloads and work with larger than memory datasets.

The Kubeflow Pipelines DSL provides the ability to parallelize your workload and run many steps concurrently. But what about parallelism in your interactive sessions? Or leveraging existing parallelism capabilities from Dask at the Python level? Can Dask help users leverage all of the hardware resources in their Kubeflow cluster?

These questions lead the maintainers of Dask’s Kubernetes tooling to build a new cluster manager to empower folks to get the best out of Dask on their Kubeflow clusters, both interactively and within pipelines.

With the new Dask Operator installed on your Kubeflow cluster, users can conveniently launch Dask clusters from within their interactive Jupyter sessions and burst beyond the resources of the Jupyter container. Dask clusters can also be launched as part of a pipeline workflow where each step of the pipeline can utilize the resources provided by Dask, even persisting data in memory between steps for powerful performance gains.

In this talk, we will cover Dask’s new Kubernetes Operator, installing it on your Kubeflow cluster, and show examples of leveraging it in interactive sessions and scheduled workflows.

What You Will Learn:

Data Scientists commonly use Python tools like Pandas on their laptops with CPU compute. Production systems are usually distributed multi-node GPU setups. Dask is an open source Python library that takes the pain out of scaling up from laptop to production.

Technical Level: 5

Jacob Tomlinson

June 08, 2022
Tweet

More Decks by Jacob Tomlinson

Other Decks in Technology

Transcript

  1. Parallelizing Your ETL with Dask
    on Kubeflow
    Jacob Tomlinson
    Dask core maintainer
    Senior Software Engineer at NVIDIA

    View Slide

  2. Jacob Tomlinson
    Senior Software Engineer
    NVIDIA

    View Slide

  3. Session outline
    Introduction: Dask on KubeFlow
    What is Dask?
    Enhancing KubeFlow with Dask
    Dask Kubernetes Operator
    Break
    Deep-Dive: Dask Fundamentals
    DataFrames
    Dashboard
    Break
    Arrays
    Machine Learning
    Break
    Bags and Futures
    Distributed and deployment
    Wrap up
    25 mins
    5 mins
    25 mins
    5 mins
    25 mins
    5 mins
    25 mins
    5 mins

    View Slide

  4. Dask on KubeFlow
    Introduction

    View Slide

  5. What is Dask?

    View Slide

  6. Powerful: Leading platform
    today for analytics
    Limited: Fails for big data or
    scalable computing
    Frustration: Alternatives fracture
    development
    Problem: Python is powerful, but doesn’t scale well
    Python is great for medium data
    >>> import pandas as pd
    >>> df = pd.read_parquet(“accounts”)
    MemoryError

    View Slide

  7. General purpose Python library for parallelism
    Scales existing libraries, like Numpy, Pandas, and Scikit-Learn
    Flexible enough to build complex and custom systems
    Accessible for beginners, secure and trusted for institutions

    View Slide

  8. Dask accelerates the existing Python ecosystem
    Built alongside with the current community
    import numpy as np
    x = np.ones((1000, 1000))
    x + x.T - x.mean(axis=0
    import pandas as pd
    df = pd.read_csv(“file.csv”)
    df.groupby(“x”).y.mean()
    from scikit_learn.linear_model \
    import LogisticRegression
    lr = LogisticRegression()
    lr.fit(data, labels)
    Numpy Pandas Scikit-Learn

    View Slide

  9. Use Case: Banking with Capital One
    Capital One engineers analyze financial and credit data to build cloud-based machine learning
    models
    ● Datasets range in scale from 1 GB - 100 TB
    ● CSV and Parquet datasets 100s - 1000s of columns
    ● Data Cleaning, Feature Selection, Engineering, Training, Validation, and Governance
    ● Dask DataFrames used to train with dask-ml and dask-xgboost
    ● 10x speed up in computational performance with Dask
    ● Faster development and improved accuracy for credit risk models
    ● Deployments on AWS can be optimized to reduce overall computing costs or faster
    development iterations
    Data cleaning, Feature engineering, and Machine learning

    View Slide

  10. Use Case: Analyze Sea Levels
    Pangeo researchers analyze simulated and
    observed climate and imaging data
    1 GB - 100 TB
    HPC and Cloud
    HDF5/NetCDF/Zarr storage
    Interactive computing with notebooks
    Includes collaborators NASA, NOAA, USGS,
    UK-Met, CSIRO, and various industries
    Learn more about Pangeo in this talk
    Sea level altitude variability over 30 years
    Columbia/NCAR leverage Dask Array to understand our planet

    View Slide

  11. cluster = KubeCluster()
    cluster = ECSCluster()
    df = dd.read_parquet(...)
    cluster = PBSCluster()
    cluster = LSFCluster()
    cluster = SLURMCluster()

    df = dd.read_parquet(...)
    cluster = YarnCluster()
    df = dd.read_parquet(...)
    Dask deploys on all major resource managers
    Cloud HPC Hadoop/Spark
    Cloud, HPC, or Yarn, it’s all the same to Dask

    View Slide

  12. cluster = LocalCluster()
    df = dd.read_parquet(...)
    Dask is already deployed on your laptop
    Laptops
    Easy to start locally…
    … and then scale out to arbitrary hardware
    Conda or pip installable, included by default in Anaconda

    View Slide

  13. Pleasant to use and adopt
    Beautiful interactive dashboard
    Builds intuition on parallel performance
    Familiar APIs and data models
    Dask looks and feels like well known libraries
    Co-developed with the ecosystem
    Built by NumPy, Pandas, and Scikit-Learn devs
    Dask complements the existing ecosystem
    Dask is designed for experts and novices alike

    View Slide

  14. Software Community
    Developed:
    300 contributors,
    20 active maintainers
    From:
    Numpy, Pandas, Scikit-Learn,
    Jupyter, and more
    Run by people you know, built by people you trust
    Safe:
    BSD-3 Licensed,
    fiscally sponsored by
    NumFOCUS,
    community governed
    Discussed:
    Dask is the most
    common parallel
    framework at
    PyData/SciPy/PyCon
    conferences today.
    Used:
    10k
    weekly visitors
    to
    documentation
    14

    View Slide

  15. Enhancing KubeFlow
    with Dask

    View Slide

  16. Typical ML workflows

    View Slide

  17. ETL is commonly done in notebooks
    This means your ETL is
    confined to a single pod.
    But what if it wasn’t?

    View Slide

  18. Dask Kubernetes
    Operator

    View Slide

  19. The Dask Operator runs on your
    Kubernetes cluster and allows you
    to create and manage your Dask
    clusters as Kubernetes resources.

    View Slide

  20. # Install the Custom Resource Definitions and Operator Deployment
    $ kubectl apply -f https://raw.githubusercontent.com/dask/dask-kubernetes/main/dask_kubernetes/operator/deployment/manifests/daskcluster.yaml
    $ kubectl apply -f https://raw.githubusercontent.com/dask/dask-kubernetes/main/dask_kubernetes/operator/deployment/manifests/daskworkergroup.yaml
    $ kubectl apply -f https://raw.githubusercontent.com/dask/dask-kubernetes/main/dask_kubernetes/operator/deployment/manifests/daskjob.yaml
    $ kubectl apply -f https://raw.githubusercontent.com/dask/dask-kubernetes/main/dask_kubernetes/operator/deployment/manifests/operator.yaml
    # Patch KubeFlow permissions to allow users to create Dask clusters
    $ kubectl patch clusterrole kubeflow-kubernetes-edit --patch '{"rules": [{"apiGroups": ["kubernetes.dask.org"],"resources": ["*"],"verbs": ["*"]}, …]}'
    # Check that we can list daskcluster resources
    $ kubectl get daskclusters
    No resources found in default namespace.
    # Check that the operator pod is running
    $ kubectl get pods -A -l application=dask-kubernetes-operator
    NAMESPACE NAME READY STATUS RESTARTS AGE
    dask-operator dask-kubernetes-operator-775b8bbbd5-zdrf7 1/1 Running 0 74s
    # 🚀 done!
    Installing the operator

    View Slide

  21. Creating Dask Clusters within notebooks
    # Install dask-kubernetes
    $ pip install dask-kubernetes
    # Launch a cluster
    >>> from dask_kubernetes.experimental \
    import KubeCluster
    >>> cluster = KubeCluster(name="demo")
    # List the DaskCluster custom resource that
    was created for us under the hood
    $ kubectl get daskclusters
    NAME AGE
    demo-cluster 6m3s

    View Slide

  22. Scaling our cluster
    # Scale up our cluster
    >>> cluster.scale(5)

    View Slide

  23. Doing some work
    # Connect a Dask client
    >>> from dask.distributed import Client
    >>> client = Client(cluster)
    # Do come computation
    >>> import dask.array as da
    >>> arr = da.random.random((10_000, 1_000, 1_000),
    chunks=(1000, 1000, 100))
    >>> result = arr.mean().compute()

    View Slide

  24. YAML resources
    # cluster.yaml
    apiVersion: kubernetes.dask.org/v1
    kind: DaskCluster
    metadata:
    name: simple-cluster
    spec:
    worker:
    replicas: 3
    spec:
    containers:
    - name: worker
    image: "ghcr.io/dask/dask:latest"
    imagePullPolicy: "IfNotPresent"
    args:
    - dask-worker
    - --name
    - $(DASK_WORKER_NAME)
    scheduler:
    spec:
    containers:
    - name: scheduler
    image: "ghcr.io/dask/dask:latest"
    imagePullPolicy: "IfNotPresent"
    args:
    - dask-scheduler
    ports:
    - name: tcp-comm
    containerPort: 8786
    protocol: TCP
    - name: http-dashboard
    containerPort: 8787
    protocol: TCP
    readinessProbe:
    httpGet:
    port: http-dashboard
    path: /health
    initialDelaySeconds: 5

    The Dask Operator has three custom
    resource types that you can create via
    kubectl.
    ● DaskCluster to create whole
    clusters.
    ● DaskWorkerGroup to create
    additional groups of workers with
    various configurations (high
    memory, GPUs, etc).
    ● DaskJob to run end-to-end tasks
    like a Kubernetes Job but with an
    adjacent Dask Cluster.

    View Slide

  25. ● Create Dask clusters in Python or with YAML
    ● Create multiple worker groups with different shape Pods
    ● Run batch style jobs with DaskJob resources
    ● Scale workers up and down (autoscaling coming very soon)
    ● See your Dask clusters at a glance with kubectl
    ● Quickly and easily clean up unused resources
    Dask Operator Features

    View Slide

  26. Quick break
    Back soon

    View Slide

  27. Dask Fundamentals
    Deep Dive

    View Slide

  28. Resources
    github.com/jacobtomlinson/dask-video-tutorial

    View Slide

  29. Quick break
    Back soon

    View Slide

  30. Read Documentation: docs.dask.org
    See Examples: examples.dask.org
    Engage Community: github.com/dask

    View Slide