Parallelizing Your ETL with Dask on Kubeflow

Parallelizing Your ETL with Dask on Kubeflow Jacob Tomlinson Dask
core maintainer Senior Software Engineer at NVIDIA

Jacob Tomlinson Senior Software Engineer NVIDIA

Session outline Introduction: Dask on KubeFlow What is Dask? Enhancing
KubeFlow with Dask Dask Kubernetes Operator Break Deep-Dive: Dask Fundamentals DataFrames Dashboard Break Arrays Machine Learning Break Bags and Futures Distributed and deployment Wrap up 25 mins 5 mins 25 mins 5 mins 25 mins 5 mins 25 mins 5 mins

Dask on KubeFlow Introduction

What is Dask?

Powerful: Leading platform today for analytics Limited: Fails for big
data or scalable computing Frustration: Alternatives fracture development Problem: Python is powerful, but doesn’t scale well Python is great for medium data >>> import pandas as pd >>> df = pd.read_parquet(“accounts”) MemoryError

General purpose Python library for parallelism Scales existing libraries, like
Numpy, Pandas, and Scikit-Learn Flexible enough to build complex and custom systems Accessible for beginners, secure and trusted for institutions

Dask accelerates the existing Python ecosystem Built alongside with the
current community import numpy as np x = np.ones((1000, 1000)) x + x.T - x.mean(axis=0 import pandas as pd df = pd.read_csv(“file.csv”) df.groupby(“x”).y.mean() from scikit_learn.linear_model \ import LogisticRegression lr = LogisticRegression() lr.fit(data, labels) Numpy Pandas Scikit-Learn

Use Case: Banking with Capital One Capital One engineers analyze
financial and credit data to build cloud-based machine learning models • Datasets range in scale from 1 GB - 100 TB • CSV and Parquet datasets 100s - 1000s of columns • Data Cleaning, Feature Selection, Engineering, Training, Validation, and Governance • Dask DataFrames used to train with dask-ml and dask-xgboost • 10x speed up in computational performance with Dask • Faster development and improved accuracy for credit risk models • Deployments on AWS can be optimized to reduce overall computing costs or faster development iterations Data cleaning, Feature engineering, and Machine learning

Use Case: Analyze Sea Levels Pangeo researchers analyze simulated and
observed climate and imaging data 1 GB - 100 TB HPC and Cloud HDF5/NetCDF/Zarr storage Interactive computing with notebooks Includes collaborators NASA, NOAA, USGS, UK-Met, CSIRO, and various industries Learn more about Pangeo in this talk Sea level altitude variability over 30 years Columbia/NCAR leverage Dask Array to understand our planet

cluster = KubeCluster() cluster = ECSCluster() df = dd.read_parquet(...) cluster
= PBSCluster() cluster = LSFCluster() cluster = SLURMCluster() … df = dd.read_parquet(...) cluster = YarnCluster() df = dd.read_parquet(...) Dask deploys on all major resource managers Cloud HPC Hadoop/Spark Cloud, HPC, or Yarn, it’s all the same to Dask

cluster = LocalCluster() df = dd.read_parquet(...) Dask is already deployed
on your laptop Laptops Easy to start locally… … and then scale out to arbitrary hardware Conda or pip installable, included by default in Anaconda

Pleasant to use and adopt Beautiful interactive dashboard Builds intuition
on parallel performance Familiar APIs and data models Dask looks and feels like well known libraries Co-developed with the ecosystem Built by NumPy, Pandas, and Scikit-Learn devs Dask complements the existing ecosystem Dask is designed for experts and novices alike

Software Community Developed: 300 contributors, 20 active maintainers From: Numpy,
Pandas, Scikit-Learn, Jupyter, and more Run by people you know, built by people you trust Safe: BSD-3 Licensed, fiscally sponsored by NumFOCUS, community governed Discussed: Dask is the most common parallel framework at PyData/SciPy/PyCon conferences today. Used: 10k weekly visitors to documentation 14

Enhancing KubeFlow with Dask

Typical ML workflows

ETL is commonly done in notebooks This means your ETL
is confined to a single pod. But what if it wasn’t?

Dask Kubernetes Operator

The Dask Operator runs on your Kubernetes cluster and allows
you to create and manage your Dask clusters as Kubernetes resources.

# Install the Custom Resource Definitions and Operator Deployment $
kubectl apply -f https://raw.githubusercontent.com/dask/dask-kubernetes/main/dask_kubernetes/operator/deployment/manifests/daskcluster.yaml $ kubectl apply -f https://raw.githubusercontent.com/dask/dask-kubernetes/main/dask_kubernetes/operator/deployment/manifests/daskworkergroup.yaml $ kubectl apply -f https://raw.githubusercontent.com/dask/dask-kubernetes/main/dask_kubernetes/operator/deployment/manifests/daskjob.yaml $ kubectl apply -f https://raw.githubusercontent.com/dask/dask-kubernetes/main/dask_kubernetes/operator/deployment/manifests/operator.yaml # Patch KubeFlow permissions to allow users to create Dask clusters $ kubectl patch clusterrole kubeflow-kubernetes-edit --patch '{"rules": [{"apiGroups": ["kubernetes.dask.org"],"resources": ["*"],"verbs": ["*"]}, …]}' # Check that we can list daskcluster resources $ kubectl get daskclusters No resources found in default namespace. # Check that the operator pod is running $ kubectl get pods -A -l application=dask-kubernetes-operator NAMESPACE NAME READY STATUS RESTARTS AGE dask-operator dask-kubernetes-operator-775b8bbbd5-zdrf7 1/1 Running 0 74s # 🚀 done! Installing the operator

Creating Dask Clusters within notebooks # Install dask-kubernetes $ pip
install dask-kubernetes # Launch a cluster >>> from dask_kubernetes.experimental \ import KubeCluster >>> cluster = KubeCluster(name="demo") # List the DaskCluster custom resource that was created for us under the hood $ kubectl get daskclusters NAME AGE demo-cluster 6m3s

Scaling our cluster # Scale up our cluster >>> cluster.scale(5)

Doing some work # Connect a Dask client >>> from
dask.distributed import Client >>> client = Client(cluster) # Do come computation >>> import dask.array as da >>> arr = da.random.random((10_000, 1_000, 1_000), chunks=(1000, 1000, 100)) >>> result = arr.mean().compute()

YAML resources # cluster.yaml apiVersion: kubernetes.dask.org/v1 kind: DaskCluster metadata: name:
simple-cluster spec: worker: replicas: 3 spec: containers: - name: worker image: "ghcr.io/dask/dask:latest" imagePullPolicy: "IfNotPresent" args: - dask-worker - --name - $(DASK_WORKER_NAME) scheduler: spec: containers: - name: scheduler image: "ghcr.io/dask/dask:latest" imagePullPolicy: "IfNotPresent" args: - dask-scheduler ports: - name: tcp-comm containerPort: 8786 protocol: TCP - name: http-dashboard containerPort: 8787 protocol: TCP readinessProbe: httpGet: port: http-dashboard path: /health initialDelaySeconds: 5 … The Dask Operator has three custom resource types that you can create via kubectl. • DaskCluster to create whole clusters. • DaskWorkerGroup to create additional groups of workers with various configurations (high memory, GPUs, etc). • DaskJob to run end-to-end tasks like a Kubernetes Job but with an adjacent Dask Cluster.

• Create Dask clusters in Python or with YAML •
Create multiple worker groups with different shape Pods • Run batch style jobs with DaskJob resources • Scale workers up and down (autoscaling coming very soon) • See your Dask clusters at a glance with kubectl • Quickly and easily clean up unused resources Dask Operator Features

Quick break Back soon

Dask Fundamentals Deep Dive

Resources github.com/jacobtomlinson/dask-video-tutorial

Quick break Back soon

Read Documentation: docs.dask.org See Examples: examples.dask.org Engage Community: github.com/dask

Parallelizing Your ETL with Dask on Kubeflow

Parallelizing Your ETL with Dask on Kubeflow

Jacob Tomlinson

More Decks by Jacob Tomlinson

Other Decks in Technology

Featured

Transcript