Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Running Batch Workload on K8s at Scale

Running Batch Workload on K8s at Scale

We live in the age of data. We generate and consume more data than ever before. But the data is useless without context. We run ETL jobs to process the data. We do HPC to analyze and process the data. We build machine learning models to automate processes and make better decisions. Batch processing allows us to do all of this. Batch processing is the backbone of data science. But we need to rethink how we approach batch processing in order to take advantage of modern hardware, containers, and cloud infrastructure.

In Kubernetes we can run Job/Cronjobs quite easily. But that is not enough to run batch workloads at scale. We need to consider scalability, cost optimization and performance. We also need to think about day 2 operations of managing and upgrading the platform.

In this talk we will discuss how to run batch workloads on k8s at scale. We will cover what type of workloads are good candidates for running on k8s, how to design them to be easy to manage and scale, and common pitfalls to avoid.

Mofizur Rahman

March 11, 2023
Tweet

More Decks by Mofizur Rahman

Other Decks in Technology

Transcript

  1. Who Needs Batch • Data intensive task • Scientific research

    • Parallelizable workload ML Model Training Data Processing
  2. Types of Batch Workload • ETL Pipelines • ML model

    training • HPC workload • Data Analytics
  3. Why do we use Batch • Cost • Speed •

    Reliability • Repeatability
  4. Scheduler Monitoring System Worker nodes Worker nodes Worker nodes Worker

    nodes Worker nodes Job Queue OUTput shared Storage Job INput shared Storage Components
  5. Scheduler INput shared Storage Autoscaling Monitoring System Worker nodes Worker

    nodes Worker nodes Worker nodes Worker nodes Job Queue OUTput shared Storage Job Components
  6. Why use Kubernetes • Resource Management • Scalability • Fault

    Tolerance • Monitoring and Logging • Portability
  7. Why use Kubernetes • Resource Management • Scalability • Fault

    Tolerance • Monitoring and Logging • Portability • Other workload is already on k8s and you are trying to normalize your platform to one thing.
  8. Job

  9. What is a job? • Computations that run to completion

    • A group of pods; run independently or collaboratively to process a task
  10. What is a job? • Computations that run to completion

    • A group of pods; run independently or collaboratively to process a task • Often flexible on time, location and/or types of resources
  11. apiVersion: batch/v1 kind: Job metadata: name: pi spec: template: spec:

    containers: - name: pi image: perl:5.34.0 command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"] restartPolicy: Never backoffLimit: 4 Job / Batch APIs Job A Job creates one or more Pods and will continue to retry execution of the Pods until a specified number of them successfully terminate.
  12. apiVersion: batch/v1 kind: CronJob metadata: name: hello spec: schedule: "*

    * * * *" jobTemplate: spec: template: spec: containers: - name: hello image: busybox:1.28 imagePullPolicy: IfNotPresent command: - /bin/sh - -c - date; echo Hello from the Kubernetes cluster restartPolicy: OnFailure CronJob / Batch APIs CronJob A Cron Job creates one or more Pods and will continue to try execution of the Pods at a specified interval until the CronJob object was removed.
  13. Cron Schedule ┌────────── minute (0 - 59) │ ┌───────── hour

    (0 - 23) │ │ ┌──────── day of the month (1 - 31) │ │ │ ┌─────── month (1 - 12) │ │ │ │ ┌────── day of the week (0 - 6) (Sunday to Saturday; │ │ │ │ │ 7 is also Sunday on some systems) │ │ │ │ │ OR sun, mon, tue, wed, thu, fri, sat │ │ │ │ │ * * * * *
  14. Parallel Jobs with a fixed completion count: • .spec.completions is

    > 0 • the Job represents the overall task, and is complete when there are .spec.completions successful Pods Job / Batch APIs Non-parallel Jobs: • normally, only one Pod is started, unless the Pod fails. • the Job is complete as soon as its Pod terminates successfully. Pod Job
  15. Parallel Jobs with a work queue: • do not specify

    .spec.completions, default to .spec.parallelism. • the Pods must coordinate amongst themselves or an external service to determine what each should work on. For example, a Pod might fetch a batch of up to N items from the work queue. • each Pod is independently capable of determining whether or not all its peers are done, and thus that the entire Job is done. • once at least one Pod has terminated with success and all Pods are terminated, then the Job is completed with success Job / Batch APIs Pod Job Pod Job Pod Job Pod Job Pod Job Pod Job Pod Job Pod Job Pod Job Pod Job Pod Job Pod Job Pod Job Pod Job Pod Job Pod Job Pod Job Pod Job
  16. Quota and budgeting to control who can use what and

    up to what limit 1 batch/v1 missing features
  17. Quota and budgeting to control who can use what and

    up to what limit 2 1 Fair sharing of resources between tenants batch/v1 missing features
  18. Quota and budgeting to control who can use what and

    up to what limit 2 1 Fair sharing of resources between tenants 3 Flexible placement of jobs across different resource types based on availability batch/v1 missing features
  19. Quota and budgeting to control who can use what and

    up to what limit 2 1 Fair sharing of resources between tenants 3 Flexible placement of jobs across different resource types based on availability batch/v1 missing features 4 Support for autoscaled environments where resources can be provisioned on demand
  20. It is a job-level manager that decides when a job

    should be admitted to start (as in pods can be created) and when it should stop (as in active pods should be deleted). Kueue: Kubernetes-native Job Queueing Pod Job Pod Job Pod Job Pod Job Pod Job Pod Job Parallel Jobs with a work queue An easy way to fairly and efficiently share resources APIs Kueue Controller Kueue is a set of APIs and controller for job queueing.
  21. • Quotas and policies for fair sharing among tenants. Kueue:

    Kubernetes-native Job Queueing An easy way to fairly and efficiently share resources apiVersion: kueue.x-k8s.io/v1alpha2 kind: ResourceFlavor metadata: name: default # This ResourceFlavor will be used for all the resources A ResourceFlavor is an object that represents the variations in the nodes available in your cluster by associating them with node labels and taints. For example, you can use ResourceFlavors to represent VMs with different provisioning guarantees (for example, spot versus on-demand), architectures (for example, x86 versus ARM CPUs), brands and models (for example, Nvidia A100 versus T4 GPUs). • Resource fungibility: if a resource flavor is fully utilized, Kueue can admit the job using a different flavor.
  22. Kueue: Kubernetes-native Job Queueing Ref apiVersion: kueue.x-k8s.io/v1alpha2 kind: ClusterQueue metadata:

    name: cq spec: namespaceSelector: {} queueingStrategy: BestEffortFIFO # Default queueing strategy resources: - name: "cpu" flavors: - name: default quota: min: 10 - name: "memory" flavors: - name: default quota: min: 10Gi - name: "nvidia.com/gpu" flavors: - name: default quota: min: 10 - name: "ephemeral-storage" flavors: - name: default quota: min: 10Gi A ClusterQueue is a cluster-scoped object that manages a pool of resources such as CPU, memory, GPU. It manages the ResourceFlavors, and limits the usage and dictates the order in which workloads are admitted.
  23. apiVersion: kueue.x-k8s.io/v1alpha2 kind: ClusterQueue metadata: name: cq spec: namespaceSelector: {}

    queueingStrategy: BestEffortFIFO resources: - name: "cpu" flavors: - name: default quota: min: 10 - name: "memory" flavors: - name: default quota: min: 10Gi - name: "nvidia.com/gpu" flavors: - name: default quota: min: 10 Kueue: Kubernetes-native Job Queueing Ref A ClusterQueue is a cluster-scoped object that manages a pool of resources such as CPU, memory, GPU. It manages the ResourceFlavors, and limits the usage and dictates the order in which workloads are admitted. BestEffortFIFO The default queueing strategy configuration. The workload admission follows the first in first out (FIFO) rule, but if there is not enough quota to admit the workload at the head of the queue, the next one in line is tried. StrictFIFO Guarantees FIFO semantics. Workload at the head of the queue can block queueing until the workload can be admitted.
  24. apiVersion: kueue.x-k8s.io/v1alpha2 kind: LocalQueue metadata: namespace: team-a # LocalQueue under

    team-a namespace name: team-a-lq spec: clusterQueue: cq # Point to the ClusterQueue cq --- apiVersion: kueue.x-k8s.io/v1alpha2 kind: LocalQueue metadata: namespace: team-b # LocalQueue under team-b namespace name: team-b-lq spec: clusterQueue: cq # Point to the ClusterQueue cq Kueue: Kubernetes-native Job Queueing Ref Each team sends their workloads to the LocalQueue in their own namespace. Which are then allocated resources by the ClusterQueue.
  25. Jobs are created under the namespace team-a. This Job points

    to the LocalQueue team-a-lq. To request GPU resources, nodeSelector is set to nvidia-tesla-t4. The Job is composed of three Pods that sleep for 10 seconds in parallel. Jobs are cleaned up after 60 seconds according to ttlSecondsAfterFinished. apiVersion: batch/v1 kind: Job metadata: namespace: team-a # Job under team-a namespace generateName: sample-job- annotations: kueue.x-k8s.io/queue-name: team-a-lq # Point to the LocalQueue spec: ttlSecondsAfterFinished: 60 # Job will be deleted after 60 seconds parallelism: 3 # This Job will have 3 replicas running at the same time completions: 3 # This Job requires 3 completions suspend: true # Set to true and allow Kueue to control the Job template: spec: nodeSelector: cloud.google.com/gke-accelerator: "nvidia-tesla-t4" # Specify the GPU hardware containers: - name: dummy-job image: gcr.io/k8s-staging-perf-tests/sleep:latest args: ["10s"] # Sleep for 10 seconds resources: requests: cpu: "500m" memory: "512Mi" ephemeral-storage: "512Mi" nvidia.com/gpu: "1" limits: cpu: "500m" memory: "512Mi" ephemeral-storage: "512Mi" nvidia.com/gpu: "1" restartPolicy: Never Kueue: Kubernetes-native Job Queueing
  26. Prometheus is a popular monitoring tool backed by CNCF, widely

    considered to be the de-facto standard solution for Kubernetes workloads.
  27. Prometheus Challenges for Batch • Prometheus is pull based •

    Batch Jobs can die/end any time. • Chance of missing important metrics
  28. Prometheus architecture • PromQL is the query language for Prometheus

    deployments and is increasing in popularity • Grafana is a popular open source dashboard that is closely affiliated with Prometheus • Standardized metrics scraping (i.e. /metrics) in OpenMetrics format Short-lived jobs Short-lived jobs Short-lived jobs Push Gateway Retrieval HTTP server TSDB Prometheus Server Service discovery Kubernetes file_sd Jobs / exporters pull metrics Node HDD/SSD Prometheus UI Grafana API clients Data visualization
  29. Batch Storage • Can not do all computation in memory.

    • Write state as checkpoint for fault tolerance • Write intermediate state as a way to pass data • Write output for further use
  30. Batch Storage • Local Storage (HDD, SSD) Quick access to

    storage. Not Shared. E.g. storage on the kubernetes node.
  31. Batch Storage • Local Storage (HDD, SSD) • File Storage

    File storage stores data as a single piece of information in a folder to help organize it among other data. E.g. Google Filestore, Amazon EFS, Azure File System etc.
  32. Batch Storage • Local Storage (HDD, SSD) • File Storage

    • Object Storage Object storage takes each piece of data and designates it as an object. Google Cloud Storage, Amazon S3 etc.
  33. Batch Storage • Local Storage (HDD, SSD) • File Storage

    • Object Storage • Block Storage Block storage takes a file apart into singular blocks of data and then stores these blocks as separate pieces of data. E.g. Google Cloud Persistent Disk, Amazon Elastic Block Storage etc.
  34. Scheduler INput shared Storage Autoscaling Monitoring System Worker nodes Worker

    nodes Worker nodes Worker nodes Worker nodes Job Queue OUTput shared Storage Job Components
  35. Components K8s Scheduler K8s Autoscaling Prometheus Worker nodes Worker nodes

    Worker nodes Worker nodes Node pools Kueue GCS/Filestore/etc. Job GCS/Filestore/etc.
  36. Batch Job Types • Ad Hoc runs ML Model Training

    Drug Discovery Genomics Data Analytics
  37. Batch Job Types • Ad Hoc runs • React to

    events File Processing User Email sending CI/CD Pipelines
  38. Batch Job Types • Ad Hoc runs • React to

    events • Run on schedule Report Generation Analyze periodic data
  39. Batch Job Types • Ad Hoc runs • React to

    events • Run on schedule • Fan Out Fan In Work Pipeline
  40. Batch Job Types • Ad Hoc runs • React to

    events • Run on schedule • Fan Out Fan In • Message Passing Distributed Work Structural Simulation Weather Modeling Fluid Dynamics
  41. Kubernetes Job Job Job Job Job Job Job Job Job

    Job Output Output Output Output Output
  42. Possible Solutions • Put all the numbers in a database

    and query sorted. • Push the data through a message queue and put them in buckets on range. Sort individual ranges and concatenate the files.
  43. Possible Solutions • Put all the numbers in a database

    and query sorted. • Push the data through a message queue and put them in buckets on range. Sort individual ranges and concatenate the files. • Use some sort of external sort.
  44. Possible Solutions • Put all the numbers in a database

    and query sorted. • Push the data through a message queue and put them in buckets on range. Sort individual ranges and concatenate the files. • Use some sort of external sort. • Map-Reduce Sort
  45. Database solution • If we have a Database that can

    handle such a big dataset, actual work is quite simple. • Overkill • Insertion would be slow and expensive • Getting the data out would be slow and expensive
  46. Message Queue -> Bucket • Combining sorted file would be

    fairly quick • Pushing ~15 TB of data through a message queue would be challenging • We will surely overwhelm our network bandwidth. • It will be slow.
  47. Message Queue Message Queue Input Worker Worker Worker Worker Worker

    Output Output Output Output Output This is our bottleneck
  48. External Sort • Room for parallelization • Need to synchronize

    work at some point which will be the bottleneck for peak performance.
  49. Map Reduce • Works really well for parallelizable workload. •

    Map a bunch of data and send them to reducer. • Sort them and write to disk. • Conceptually similar to the message queue system.
  50. Map Reduce Input Mapper Mapper Mapper Mapper Mapper Reducer Reducer

    Reducer Reducer Reducer Output Output Output Output Output
  51. Map Reduce Input Mapper Mapper Mapper Mapper Mapper Reducer Reducer

    Reducer Reducer Reducer Output Output Output Output Output This is our bottleneck
  52. Design • Design loosely coupled workload if possible. • Network

    is not unlimited. • Parallelization is your friend. • Limited communication between workers scales best.
  53. Optimize • Optimization is money. • You only need resources

    when the work is running. • Try to run jobs in colocation for faster network throughput.
  54. Scale K8s • Scale down when not needed • Scale

    up to limit (5K OSS K8s, 15K GKE) • Use spot/preemptible vm for cost saving
  55. Monitor • Monitor cluster resource usage • Trying to collect

    too much metric might slow down the work, so only collect the most important data
  56. Limits • Understand individual limits for you kubernetes cluster. •

    It’s a multidimensional envelope problem. ◦ #Pod ◦ #Service ◦ #Node ◦ #Secrets • etcd has a set limit on total storage and storage of a single type of object
  57. Fault Tolerance • Assume any and all job can fail.

    • Assume nodes will be preempted. • Create checkpoints so your job can restart without losing data.
  58. GCP and GKE • Node limit to 15000 • Managed

    prometheus for zero hassle metrics collections backed by Google planet scale monarch time series database. • Compact Placement (Define compact placement for GKE nodes | Google Kubernetes Engine (GKE)) • Time Sharing GPU (Time-sharing GPUs on GKE | Google Kubernetes Engine (GKE)) • GKE Image Streaming (Use Image streaming to pull container images | Google Kubernetes Engine (GKE)) • GKE Cost Allocation (View detailed breakdown of cluster costs | Google Kubernetes Engine (GKE))