Running Batch Workload on K8s at Scale

Running Batch Workload on K8s at Scale Mofi Rahman @moficodes

Mofi Rahman Developer Relations Engineer Google @moﬁcodes

Batch Workload

Batch vs Real Time Streaming

Who Needs Batch

Who Needs Batch • Data intensive task ML Model Training
ETL Pipeline

Who Needs Batch • Data intensive task • Scientific research
Genomic Sequencing Fluid Simulation

Who Needs Batch • Data intensive task • Scientific research
• Parallelizable workload ML Model Training Data Processing

Type of Batch Workloads

Types of Batch Workload • ETL Pipelines

Types of Batch Workload • ETL Pipelines • ML model
training

training • HPC workload

training • HPC workload • Data Analytics

Why do we use Batch?

Why do we use Batch • Cost

Why do we use Batch • Cost • Speed

Why do we use Batch • Cost • Speed •
Reliability

Why do we use Batch • Cost • Speed •
Reliability • Repeatability

Designing Batch Workload

Components Job A Job is a workload that runs to
completion

Scheduler Worker nodes Worker nodes Worker nodes Worker nodes Worker
nodes Job Components

Scheduler Worker nodes Worker nodes Worker nodes Worker nodes Worker
nodes Job Queue Job Components

Scheduler Monitoring System Worker nodes Worker nodes Worker nodes Worker
nodes Worker nodes Job Queue Job Components

Scheduler Monitoring System Worker nodes Worker nodes Worker nodes Worker
nodes Worker nodes Job Queue OUTput shared Storage Job INput shared Storage Components

Scheduler INput shared Storage Autoscaling Monitoring System Worker nodes Worker
nodes Worker nodes Worker nodes Worker nodes Job Queue OUTput shared Storage Job Components

Why use Kubernetes

Why use Kubernetes • Resource Management

Why use Kubernetes • Resource Management • Scalability

Why use Kubernetes • Resource Management • Scalability • Fault
Tolerance

Tolerance • Monitoring and Logging

Tolerance • Monitoring and Logging • Portability

Tolerance • Monitoring and Logging • Portability • Other workload is already on k8s and you are trying to normalize your platform to one thing.

Running Batch Workload on Kubernetes

What is a Job?

What is a job? • Computations that run to completion

• A group of pods; run independently or collaboratively to process a task

• A group of pods; run independently or collaboratively to process a task • Often flexible on time, location and/or types of resources

Types of Job in Kubernetes • Job • CronJob

apiVersion: batch/v1 kind: Job metadata: name: pi spec: template: spec:
containers: - name: pi image: perl:5.34.0 command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"] restartPolicy: Never backoffLimit: 4 Job / Batch APIs Job A Job creates one or more Pods and will continue to retry execution of the Pods until a specified number of them successfully terminate.

apiVersion: batch/v1 kind: CronJob metadata: name: hello spec: schedule: "*
* * * *" jobTemplate: spec: template: spec: containers: - name: hello image: busybox:1.28 imagePullPolicy: IfNotPresent command: - /bin/sh - -c - date; echo Hello from the Kubernetes cluster restartPolicy: OnFailure CronJob / Batch APIs CronJob A Cron Job creates one or more Pods and will continue to try execution of the Pods at a specified interval until the CronJob object was removed.

Cron Schedule ┌────────── minute (0 - 59) │ ┌───────── hour
(0 - 23) │ │ ┌──────── day of the month (1 - 31) │ │ │ ┌─────── month (1 - 12) │ │ │ │ ┌────── day of the week (0 - 6) (Sunday to Saturday; │ │ │ │ │ 7 is also Sunday on some systems) │ │ │ │ │ OR sun, mon, tue, wed, thu, fri, sat │ │ │ │ │ * * * * *

Parallel Jobs with a fixed completion count: • .spec.completions is
> 0 • the Job represents the overall task, and is complete when there are .spec.completions successful Pods Job / Batch APIs Non-parallel Jobs: • normally, only one Pod is started, unless the Pod fails. • the Job is complete as soon as its Pod terminates successfully. Pod Job

Parallel Jobs with a work queue: • do not specify
.spec.completions, default to .spec.parallelism. • the Pods must coordinate amongst themselves or an external service to determine what each should work on. For example, a Pod might fetch a batch of up to N items from the work queue. • each Pod is independently capable of determining whether or not all its peers are done, and thus that the entire Job is done. • once at least one Pod has terminated with success and all Pods are terminated, then the Job is completed with success Job / Batch APIs Pod Job Pod Job Pod Job Pod Job Pod Job Pod Job Pod Job Pod Job Pod Job Pod Job Pod Job Pod Job Pod Job Pod Job Pod Job Pod Job Pod Job Pod Job

Now you know the building blocks of Kubernetes Batch

Quota and budgeting to control who can use what and
up to what limit 1 batch/v1 missing features

up to what limit 2 1 Fair sharing of resources between tenants batch/v1 missing features

up to what limit 2 1 Fair sharing of resources between tenants 3 Flexible placement of jobs across different resource types based on availability batch/v1 missing features

up to what limit 2 1 Fair sharing of resources between tenants 3 Flexible placement of jobs across different resource types based on availability batch/v1 missing features 4 Support for autoscaled environments where resources can be provisioned on demand

It is a job-level manager that decides when a job
should be admitted to start (as in pods can be created) and when it should stop (as in active pods should be deleted). Kueue: Kubernetes-native Job Queueing Pod Job Pod Job Pod Job Pod Job Pod Job Pod Job Parallel Jobs with a work queue An easy way to fairly and efficiently share resources APIs Kueue Controller Kueue is a set of APIs and controller for job queueing.

• Quotas and policies for fair sharing among tenants. Kueue:
Kubernetes-native Job Queueing An easy way to fairly and efficiently share resources apiVersion: kueue.x-k8s.io/v1alpha2 kind: ResourceFlavor metadata: name: default # This ResourceFlavor will be used for all the resources A ResourceFlavor is an object that represents the variations in the nodes available in your cluster by associating them with node labels and taints. For example, you can use ResourceFlavors to represent VMs with different provisioning guarantees (for example, spot versus on-demand), architectures (for example, x86 versus ARM CPUs), brands and models (for example, Nvidia A100 versus T4 GPUs). • Resource fungibility: if a resource flavor is fully utilized, Kueue can admit the job using a different flavor.

Kueue: Kubernetes-native Job Queueing Ref apiVersion: kueue.x-k8s.io/v1alpha2 kind: ClusterQueue metadata:
name: cq spec: namespaceSelector: {} queueingStrategy: BestEffortFIFO # Default queueing strategy resources: - name: "cpu" flavors: - name: default quota: min: 10 - name: "memory" flavors: - name: default quota: min: 10Gi - name: "nvidia.com/gpu" flavors: - name: default quota: min: 10 - name: "ephemeral-storage" flavors: - name: default quota: min: 10Gi A ClusterQueue is a cluster-scoped object that manages a pool of resources such as CPU, memory, GPU. It manages the ResourceFlavors, and limits the usage and dictates the order in which workloads are admitted.

apiVersion: kueue.x-k8s.io/v1alpha2 kind: ClusterQueue metadata: name: cq spec: namespaceSelector: {}
queueingStrategy: BestEffortFIFO resources: - name: "cpu" flavors: - name: default quota: min: 10 - name: "memory" flavors: - name: default quota: min: 10Gi - name: "nvidia.com/gpu" flavors: - name: default quota: min: 10 Kueue: Kubernetes-native Job Queueing Ref A ClusterQueue is a cluster-scoped object that manages a pool of resources such as CPU, memory, GPU. It manages the ResourceFlavors, and limits the usage and dictates the order in which workloads are admitted. BestEffortFIFO The default queueing strategy configuration. The workload admission follows the first in first out (FIFO) rule, but if there is not enough quota to admit the workload at the head of the queue, the next one in line is tried. StrictFIFO Guarantees FIFO semantics. Workload at the head of the queue can block queueing until the workload can be admitted.

apiVersion: kueue.x-k8s.io/v1alpha2 kind: LocalQueue metadata: namespace: team-a # LocalQueue under
team-a namespace name: team-a-lq spec: clusterQueue: cq # Point to the ClusterQueue cq --- apiVersion: kueue.x-k8s.io/v1alpha2 kind: LocalQueue metadata: namespace: team-b # LocalQueue under team-b namespace name: team-b-lq spec: clusterQueue: cq # Point to the ClusterQueue cq Kueue: Kubernetes-native Job Queueing Ref Each team sends their workloads to the LocalQueue in their own namespace. Which are then allocated resources by the ClusterQueue.

Jobs are created under the namespace team-a. This Job points
to the LocalQueue team-a-lq. To request GPU resources, nodeSelector is set to nvidia-tesla-t4. The Job is composed of three Pods that sleep for 10 seconds in parallel. Jobs are cleaned up after 60 seconds according to ttlSecondsAfterFinished. apiVersion: batch/v1 kind: Job metadata: namespace: team-a # Job under team-a namespace generateName: sample-job- annotations: kueue.x-k8s.io/queue-name: team-a-lq # Point to the LocalQueue spec: ttlSecondsAfterFinished: 60 # Job will be deleted after 60 seconds parallelism: 3 # This Job will have 3 replicas running at the same time completions: 3 # This Job requires 3 completions suspend: true # Set to true and allow Kueue to control the Job template: spec: nodeSelector: cloud.google.com/gke-accelerator: "nvidia-tesla-t4" # Specify the GPU hardware containers: - name: dummy-job image: gcr.io/k8s-staging-perf-tests/sleep:latest args: ["10s"] # Sleep for 10 seconds resources: requests: cpu: "500m" memory: "512Mi" ephemeral-storage: "512Mi" nvidia.com/gpu: "1" limits: cpu: "500m" memory: "512Mi" ephemeral-storage: "512Mi" nvidia.com/gpu: "1" restartPolicy: Never Kueue: Kubernetes-native Job Queueing

Batch Monitoring

Cluster health and usage What do we need to monitor?
Queues and Jobs status

Prometheus is a popular monitoring tool backed by CNCF, widely
considered to be the de-facto standard solution for Kubernetes workloads.

Prometheus Challenges for Batch • Prometheus is pull based •
Batch Jobs can die/end any time. • Chance of missing important metrics

Prometheus architecture • PromQL is the query language for Prometheus
deployments and is increasing in popularity • Grafana is a popular open source dashboard that is closely affiliated with Prometheus • Standardized metrics scraping (i.e. /metrics) in OpenMetrics format Short-lived jobs Short-lived jobs Short-lived jobs Push Gateway Retrieval HTTP server TSDB Prometheus Server Service discovery Kubernetes file_sd Jobs / exporters pull metrics Node HDD/SSD Prometheus UI Grafana API clients Data visualization

Batch Storage

Batch Storage • Can not do all computation in memory.
• Write state as checkpoint for fault tolerance • Write intermediate state as a way to pass data • Write output for further use

Batch Storage • Local Storage (HDD, SSD) Quick access to
storage. Not Shared. E.g. storage on the kubernetes node.

Batch Storage • Local Storage (HDD, SSD) • File Storage
File storage stores data as a single piece of information in a folder to help organize it among other data. E.g. Google Filestore, Amazon EFS, Azure File System etc.

• Object Storage Object storage takes each piece of data and designates it as an object. Google Cloud Storage, Amazon S3 etc.

• Object Storage • Block Storage Block storage takes a file apart into singular blocks of data and then stores these blocks as separate pieces of data. E.g. Google Cloud Persistent Disk, Amazon Elastic Block Storage etc.

Scheduler INput shared Storage Autoscaling Monitoring System Worker nodes Worker
nodes Worker nodes Worker nodes Worker nodes Job Queue OUTput shared Storage Job Components

Components K8s Scheduler K8s Autoscaling Prometheus Worker nodes Worker nodes
Worker nodes Worker nodes Node pools Kueue GCS/Filestore/etc. Job GCS/Filestore/etc.

Batch Job Types

Batch Job Types • Ad Hoc runs ML Model Training
Drug Discovery Genomics Data Analytics

Kubectl API Kubernetes Job

Batch Job Types • Ad Hoc runs • React to
events File Processing User Email sending CI/CD Pipelines

Object Storage File Kubernetes Job Upload Event

events • Run on schedule Report Generation Analyze periodic data

Kubernetes Cronjob * * * * *

events • Run on schedule • Fan Out Fan In Work Pipeline

Job Kubernetes Job Job Job Job Job Job

events • Run on schedule • Fan Out Fan In • Message Passing Distributed Work Structural Simulation Weather Modeling Fluid Dynamics

Kubernetes Job Job Job Job Job Job Job Job Job
Job Output Output Output Output Output

Let’s Design a Batch Workload

Sort a file of trillion numbers

Why do we need batch for this?

Sorting

Trillion number in Hexadecimal == ~15TB

Possible Solutions • Put all the numbers in a database
and query sorted.

and query sorted. • Push the data through a message queue and put them in buckets on range. Sort individual ranges and concatenate the files.

and query sorted. • Push the data through a message queue and put them in buckets on range. Sort individual ranges and concatenate the files. • Use some sort of external sort.

and query sorted. • Push the data through a message queue and put them in buckets on range. Sort individual ranges and concatenate the files. • Use some sort of external sort. • Map-Reduce Sort

Database solution • If we have a Database that can
handle such a big dataset, actual work is quite simple. • Overkill • Insertion would be slow and expensive • Getting the data out would be slow and expensive

Database Input

Database Database Input

Database Database Input Output

Database Database Input Output Expensive insert operation Expensive to run
and maintain Slow read operation

Message Queue -> Bucket • Combining sorted file would be
fairly quick • Pushing ~15 TB of data through a message queue would be challenging • We will surely overwhelm our network bandwidth. • It will be slow.

Message Queue Input

Message Queue Message Queue Input

Message Queue Message Queue Input Worker Worker Worker Worker Worker

Output Output Output Output Output

Output Output Output Output Output This is our bottleneck

External Sort • Room for parallelization • Need to synchronize
work at some point which will be the bottleneck for peak performance.

External Sort Input

External Sort Split Input

External Sort Split Input Worker Worker Worker Worker Worker

External Sort Split Input Worker Worker Worker Worker Worker Merge

External Sort Split Input Worker Worker Worker Worker Worker Merge
This is our bottleneck

Map Reduce • Works really well for parallelizable workload. •
Map a bunch of data and send them to reducer. • Sort them and write to disk. • Conceptually similar to the message queue system.

Map Reduce Input

Map Reduce Input Mapper Mapper Mapper Mapper Mapper

Map Reduce Input Mapper Mapper Mapper Mapper Mapper Reducer Reducer
Reducer Reducer Reducer

Reducer Reducer Reducer Output Output Output Output Output

Reducer Reducer Reducer Output Output Output Output Output This is our bottleneck

Running Batch Workload on K8s at Scale

Design

Design • Design loosely coupled workload if possible. • Network
is not unlimited. • Parallelization is your friend. • Limited communication between workers scales best.

Optimize

Optimize • Optimization is money. • You only need resources
when the work is running. • Try to run jobs in colocation for faster network throughput.

Scale K8s

Scale K8s • Scale down when not needed • Scale
up to limit (5K OSS K8s, 15K GKE) • Use spot/preemptible vm for cost saving

Monitor

Monitor • Monitor cluster resource usage • Trying to collect
too much metric might slow down the work, so only collect the most important data

Limits

Limits • Understand individual limits for you kubernetes cluster. •
It’s a multidimensional envelope problem. ◦ #Pod ◦ #Service ◦ #Node ◦ #Secrets • etcd has a set limit on total storage and storage of a single type of object

Fault Tolerance

Fault Tolerance • Assume any and all job can fail.
• Assume nodes will be preempted. • Create checkpoints so your job can restart without losing data.

GCP and GKE

GCP and GKE • Node limit to 15000 • Managed
prometheus for zero hassle metrics collections backed by Google planet scale monarch time series database. • Compact Placement (Define compact placement for GKE nodes | Google Kubernetes Engine (GKE)) • Time Sharing GPU (Time-sharing GPUs on GKE | Google Kubernetes Engine (GKE)) • GKE Image Streaming (Use Image streaming to pull container images | Google Kubernetes Engine (GKE)) • GKE Cost Allocation (View detailed breakdown of cluster costs | Google Kubernetes Engine (GKE))

Thank you. @moficodes

Running Batch Workload on K8s at Scale

Running Batch Workload on K8s at Scale

More Decks by Mofizur Rahman

Other Decks in Technology

Featured

Transcript