Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Parallelizing Your ETL with Dask on Kubeflow

Parallelizing Your ETL with Dask on Kubeflow

Kubeflow is a popular MLOps platform built on Kubernetes for designing and running Machine Learning pipelines for training models and providing inference services. Kubeflow has a notebook service that lets you launch interactive Jupyter servers (and more) on your Kubernetes cluster. Kubeflow also has a pipelines service with a DSL library written in Python for designing and building repeatable workflows that can be executed on your cluster, either ad-hoc or on a schedule. It also has tools for hyperparameter tuning and running model inference servers, everything you need to build a robust ML service.

Dask provides advanced parallelism for Python by breaking functions into a task graph that can be evaluated by a task scheduler that has many workers. This allows you to utilize many processors on a single machine, or many machines in a cluster. Dask’s many high-level collections APIs including dask.dataframe and dask.array provide familiar APIs that match Pandas, NumPy and more to enable folks to parallelize their existing workloads and work with larger than memory datasets.

The Kubeflow Pipelines DSL provides the ability to parallelize your workload and run many steps concurrently. But what about parallelism in your interactive sessions? Or leveraging existing parallelism capabilities from Dask at the Python level? Can Dask help users leverage all of the hardware resources in their Kubeflow cluster?

These questions lead the maintainers of Dask’s Kubernetes tooling to build a new cluster manager to empower folks to get the best out of Dask on their Kubeflow clusters, both interactively and within pipelines.

With the new Dask Operator installed on your Kubeflow cluster, users can conveniently launch Dask clusters from within their interactive Jupyter sessions and burst beyond the resources of the Jupyter container. Dask clusters can also be launched as part of a pipeline workflow where each step of the pipeline can utilize the resources provided by Dask, even persisting data in memory between steps for powerful performance gains.

In this talk, we will cover Dask’s new Kubernetes Operator, installing it on your Kubeflow cluster, and show examples of leveraging it in interactive sessions and scheduled workflows.

What You Will Learn:

Data Scientists commonly use Python tools like Pandas on their laptops with CPU compute. Production systems are usually distributed multi-node GPU setups. Dask is an open source Python library that takes the pain out of scaling up from laptop to production.

Technical Level: 5

Jacob Tomlinson

June 08, 2022

More Decks by Jacob Tomlinson

Other Decks in Technology


  1. Parallelizing Your ETL with Dask on Kubeflow Jacob Tomlinson Dask

    core maintainer Senior Software Engineer at NVIDIA
  2. Session outline Introduction: Dask on KubeFlow What is Dask? Enhancing

    KubeFlow with Dask Dask Kubernetes Operator Break Deep-Dive: Dask Fundamentals DataFrames Dashboard Break Arrays Machine Learning Break Bags and Futures Distributed and deployment Wrap up 25 mins 5 mins 25 mins 5 mins 25 mins 5 mins 25 mins 5 mins
  3. Powerful: Leading platform today for analytics Limited: Fails for big

    data or scalable computing Frustration: Alternatives fracture development Problem: Python is powerful, but doesn’t scale well Python is great for medium data >>> import pandas as pd >>> df = pd.read_parquet(“accounts”) MemoryError
  4. General purpose Python library for parallelism Scales existing libraries, like

    Numpy, Pandas, and Scikit-Learn Flexible enough to build complex and custom systems Accessible for beginners, secure and trusted for institutions
  5. Dask accelerates the existing Python ecosystem Built alongside with the

    current community import numpy as np x = np.ones((1000, 1000)) x + x.T - x.mean(axis=0 import pandas as pd df = pd.read_csv(“file.csv”) df.groupby(“x”).y.mean() from scikit_learn.linear_model \ import LogisticRegression lr = LogisticRegression() lr.fit(data, labels) Numpy Pandas Scikit-Learn
  6. Use Case: Banking with Capital One Capital One engineers analyze

    financial and credit data to build cloud-based machine learning models • Datasets range in scale from 1 GB - 100 TB • CSV and Parquet datasets 100s - 1000s of columns • Data Cleaning, Feature Selection, Engineering, Training, Validation, and Governance • Dask DataFrames used to train with dask-ml and dask-xgboost • 10x speed up in computational performance with Dask • Faster development and improved accuracy for credit risk models • Deployments on AWS can be optimized to reduce overall computing costs or faster development iterations Data cleaning, Feature engineering, and Machine learning
  7. Use Case: Analyze Sea Levels Pangeo researchers analyze simulated and

    observed climate and imaging data 1 GB - 100 TB HPC and Cloud HDF5/NetCDF/Zarr storage Interactive computing with notebooks Includes collaborators NASA, NOAA, USGS, UK-Met, CSIRO, and various industries Learn more about Pangeo in this talk Sea level altitude variability over 30 years Columbia/NCAR leverage Dask Array to understand our planet
  8. cluster = KubeCluster() cluster = ECSCluster() df = dd.read_parquet(...) cluster

    = PBSCluster() cluster = LSFCluster() cluster = SLURMCluster() … df = dd.read_parquet(...) cluster = YarnCluster() df = dd.read_parquet(...) Dask deploys on all major resource managers Cloud HPC Hadoop/Spark Cloud, HPC, or Yarn, it’s all the same to Dask
  9. cluster = LocalCluster() df = dd.read_parquet(...) Dask is already deployed

    on your laptop Laptops Easy to start locally… … and then scale out to arbitrary hardware Conda or pip installable, included by default in Anaconda
  10. Pleasant to use and adopt Beautiful interactive dashboard Builds intuition

    on parallel performance Familiar APIs and data models Dask looks and feels like well known libraries Co-developed with the ecosystem Built by NumPy, Pandas, and Scikit-Learn devs Dask complements the existing ecosystem Dask is designed for experts and novices alike
  11. Software Community Developed: 300 contributors, 20 active maintainers From: Numpy,

    Pandas, Scikit-Learn, Jupyter, and more Run by people you know, built by people you trust Safe: BSD-3 Licensed, fiscally sponsored by NumFOCUS, community governed Discussed: Dask is the most common parallel framework at PyData/SciPy/PyCon conferences today. Used: 10k weekly visitors to documentation 14
  12. ETL is commonly done in notebooks This means your ETL

    is confined to a single pod. But what if it wasn’t?
  13. The Dask Operator runs on your Kubernetes cluster and allows

    you to create and manage your Dask clusters as Kubernetes resources.
  14. # Install the Custom Resource Definitions and Operator Deployment $

    kubectl apply -f https://raw.githubusercontent.com/dask/dask-kubernetes/main/dask_kubernetes/operator/deployment/manifests/daskcluster.yaml $ kubectl apply -f https://raw.githubusercontent.com/dask/dask-kubernetes/main/dask_kubernetes/operator/deployment/manifests/daskworkergroup.yaml $ kubectl apply -f https://raw.githubusercontent.com/dask/dask-kubernetes/main/dask_kubernetes/operator/deployment/manifests/daskjob.yaml $ kubectl apply -f https://raw.githubusercontent.com/dask/dask-kubernetes/main/dask_kubernetes/operator/deployment/manifests/operator.yaml # Patch KubeFlow permissions to allow users to create Dask clusters $ kubectl patch clusterrole kubeflow-kubernetes-edit --patch '{"rules": [{"apiGroups": ["kubernetes.dask.org"],"resources": ["*"],"verbs": ["*"]}, …]}' # Check that we can list daskcluster resources $ kubectl get daskclusters No resources found in default namespace. # Check that the operator pod is running $ kubectl get pods -A -l application=dask-kubernetes-operator NAMESPACE NAME READY STATUS RESTARTS AGE dask-operator dask-kubernetes-operator-775b8bbbd5-zdrf7 1/1 Running 0 74s # 🚀 done! Installing the operator
  15. Creating Dask Clusters within notebooks # Install dask-kubernetes $ pip

    install dask-kubernetes # Launch a cluster >>> from dask_kubernetes.experimental \ import KubeCluster >>> cluster = KubeCluster(name="demo") # List the DaskCluster custom resource that was created for us under the hood $ kubectl get daskclusters NAME AGE demo-cluster 6m3s
  16. Doing some work # Connect a Dask client >>> from

    dask.distributed import Client >>> client = Client(cluster) # Do come computation >>> import dask.array as da >>> arr = da.random.random((10_000, 1_000, 1_000), chunks=(1000, 1000, 100)) >>> result = arr.mean().compute()
  17. YAML resources # cluster.yaml apiVersion: kubernetes.dask.org/v1 kind: DaskCluster metadata: name:

    simple-cluster spec: worker: replicas: 3 spec: containers: - name: worker image: "ghcr.io/dask/dask:latest" imagePullPolicy: "IfNotPresent" args: - dask-worker - --name - $(DASK_WORKER_NAME) scheduler: spec: containers: - name: scheduler image: "ghcr.io/dask/dask:latest" imagePullPolicy: "IfNotPresent" args: - dask-scheduler ports: - name: tcp-comm containerPort: 8786 protocol: TCP - name: http-dashboard containerPort: 8787 protocol: TCP readinessProbe: httpGet: port: http-dashboard path: /health initialDelaySeconds: 5 … The Dask Operator has three custom resource types that you can create via kubectl. • DaskCluster to create whole clusters. • DaskWorkerGroup to create additional groups of workers with various configurations (high memory, GPUs, etc). • DaskJob to run end-to-end tasks like a Kubernetes Job but with an adjacent Dask Cluster.
  18. • Create Dask clusters in Python or with YAML •

    Create multiple worker groups with different shape Pods • Run batch style jobs with DaskJob resources • Scale workers up and down (autoscaling coming very soon) • See your Dask clusters at a glance with kubectl • Quickly and easily clean up unused resources Dask Operator Features