Upgrade to Pro — share decks privately, control downloads, hide ads and more …

GPUs on Kubernetes, OpenShift & OKD, Zvonko Kaiser, Red Hat

GPUs on Kubernetes, OpenShift & OKD, Zvonko Kaiser, Red Hat

A Pattern to Enable Hardware Accelerators

Zvonko Kaiser, Red hat

Red Hat Livestreaming

August 07, 2020
Tweet

More Decks by Red Hat Livestreaming

Other Decks in Technology

Transcript

  1. CONFIDENTIAL designator V0000000 What we'll be discussing today AGENDA 2

    Container Engines Bootstrap Autoscaling Metrics Accelerator Platform Workload Building Blocks
  2. CONFIDENTIAL designator V0000000 BARE METAL 4 Spawning and running containers

    OCI specification - runc - containerd RUNTIME Prestart HOOKS Poststart Poststop Hooks can be used to enhance the functionality of a container runtime - mount files - configure cgroups
  3. CONFIDENTIAL designator V0000000 BARE METAL 5 Spawning and running containers

    OCI specification - runc - containerd RUNTIME NVIDIA prestart hook Bind mount of - devices - binaries - libraries HOOKS NV prestart hook configures the container to use GPUs - mount files - configure cgroups
  4. CONFIDENTIAL designator V0000000 BARE METAL Summary & Resources 12 Bare

    Metal Enablement https://github.com/NVIDIA/dgx-selinux RHEL SELinux Policy for NVIDIA https://www.redhat.com/en/blog/how-use-gpus -containers-bare-metal-rhel-8 How to enable NVIDIA GPUs in containers on bare metal in RHEL 8 https://github.com/zvonkok/oci-decorator Simple Prestart Hook Implementation
  5. CONFIDENTIAL designator V0000000 OPENSHIFT Master Node Master Node Master Node

    CPU Node CPU Node CPU Node CONTROL PLANE CPU WORKERS Use a MachineSet to scale the cluster with a GPU node
  6. CONFIDENTIAL designator V0000000 OPENSHIFT Master Node Master Node Master Node

    CPU Node GPU Node CPU Node CPU Node CONTROL PLANE CPU WORKERS GPU WORKERS Heterogeneous cluster with different compute units
  7. CONFIDENTIAL designator V0000000 OPENSHIFT NODE Prestart ACCELERATOR Poststart Poststop Every

    node could have features that are interesting to different pods GPU FPGA NIC AVX512 4.18.0-80.1.2 RHEL7, 8, RHCOS
  8. CONFIDENTIAL designator V0000000 OPENSHIFT NODE Prestart ACCELERATOR Poststart Poststop Node

    Feature Discovery exposes node features as labels pci-10de=present pci-8086=present pci-1924=present cpuid-AVX512=true kernel-ver.major=4 os_release.ID=rhel
  9. CONFIDENTIAL designator V0000000 OPENSHIFT NODE Prestart ACCELERATOR Poststart Poststop Optimized

    workloads can be placed on the right node pci-10de=present pci-8086=present pci-1924=present cpuid-AVX512=true kernel-ver.major=4 os_release.ID=rhel POD POD POD POD POD POD
  10. CONFIDENTIAL designator V0000000 BARE METAL Node Feature Discovery 19 Exposing

    node features the easy way since 4.2 https://github.com/kubernetes-sigs/node-featur e-discovery Upstream NFD & Operator https://github.com/openshift/node-feature-disc overy Downstream NFD & Operator https://www.openshift.com/blog/building-multiar ch-imagestream-with-the-nfd-operator-and-ope nshift-4 Building Multiarch imageStream with the NFD Operator and OpenShift 4
  11. CONFIDENTIAL designator V0000000 OPENSHIFT ACCELERATOR STACK driver-container Every step has

    to be validated, makes no sense to advance Modules, userspace, hook Small workload using the acc. driver-container-validation
  12. CONFIDENTIAL designator V0000000 OPENSHIFT ACCELERATOR STACK driver-container Device plugins does

    health checks and updates nodes capacity driver-container-validation device-plugin Modules, userspace, hook Small workload using the acc. Expose acc. to the cluster
  13. CONFIDENTIAL designator V0000000 OPENSHIFT ACCELERATOR STACK driver-container A Pod must

    be able to allocate a extended resource driver-container-validation device-plugin device-plugin-validation Modules, userspace, hook Small workload using the acc. Expose acc. to the cluster Allocate resource and use it
  14. CONFIDENTIAL designator V0000000 OPENSHIFT ACCELERATOR STACK driver-container Special resource node-exporter

    registering with the cluster stack driver-container-validation device-plugin device-plugin-validation monitoring Modules, userspace, hook Small workload using the acc. Expose acc. to the cluster Allocate resource and use it Setup Prometheus and Grafana, metrics and alerts
  15. CONFIDENTIAL designator V0000000 OPENSHIFT ACCELERATOR STACK driver-container Discover advanced features

    of a special resource driver-container-validation device-plugin device-plugin-validation monitoring Modules, userspace, hook Small workload using the acc. Expose acc. to the cluster Allocate resource and use it Setup Prometheus and Grafana, metrics and alerts feature-discovery Sidecar container for NFD, fine grained scheduling
  16. CONFIDENTIAL designator V0000000 OPENSHIFT ACCELERATOR STACK driver-container The SRO is

    a pattern to enable special resources in OpenShift driver-container-validation device-plugin device-plugin-validation monitoring Modules, userspace, hook Small workload using the acc. Expose acc. to the cluster Allocate resource and use it Setup Prometheus and Grafana, metrics and alerts SPECIAL RESOURCE OPERATOR feature-discovery Sidecar container for NFD, fine grained scheduling
  17. CONFIDENTIAL designator V0000000 OPENSHIFT ACCELERATOR STACK driver-container The SRO is

    a pattern to enable special resources in OpenShift driver-container-validation device-plugin device-plugin-validation monitoring SPECIAL RESOURCE OPERATOR feature-discovery Hard or Soft Partitioning CONFIGURATIONS Driver Version Custom Manifests
  18. CONFIDENTIAL designator V0000000 OPENSHIFT 29 NFD detects kernel version and

    labels node GPU Worker Node (CoreOS : kernel 4.18.0-80) GPU GPU kubelet CRI-O NFD Worker Daemonset kubernetes API (Master) kernel=4.18.0-80 kernel=4.18.0-80
  19. CONFIDENTIAL designator V0000000 OPENSHIFT 30 SRO builds driver container image

    against kernel GPU Worker Node (CoreOS : kernel 4.18.0-80) GPU GPU kubelet CRI-O kernel=4.18.0-80 Special Resource Operator (SRO) Container build (driver-container-4.18.0 -80) Image registry
  20. CONFIDENTIAL designator V0000000 OPENSHIFT 31 SRO targets specific kernel version

    hosts GPU Worker Node (CoreOS : kernel 4.18.0-80) GPU GPU kubelet CRI-O kernel=4.18.0-80 GPU Driver Daemonset Special Resource Operator (SRO) driver-container-4.18.0-80
  21. CONFIDENTIAL designator V0000000 OPENSHIFT 32 32 NFD detects updated kernel

    and relabels node GPU Worker Node (CoreOS : kernel 4.18.0-147*) GPU GPU kubelet CRI-O NFD Worker Daemonset kubernetes API (Master) kernel=4.18.0-147* kernel=4.18.0-147*
  22. CONFIDENTIAL designator V0000000 OPENSHIFT 33 SRO detects mismatch and rebuilds

    driver container GPU Worker Node (CoreOS : kernel 4.18.0-147*) GPU GPU kubelet CRI-O kernel=4.18.0-147* Special Resource Operator (SRO) BuildConfig (driver-container-4.18.0-147) Image registry GPU Driver Daemonset driver-container-4.18.0-80
  23. CONFIDENTIAL designator V0000000 OPENSHIFT 34 SRO updates daemonset with new

    image GPU Worker Node (CoreOS : kernel 4.18.0-147) GPU GPU kubelet CRI-O kernel=4.18.0-147 GPU Driver Daemonset Special Resource Operator (SRO) driver-container-4.18.0-147
  24. CONFIDENTIAL designator V0000000 BARE METAL Special Resource Operator 35 Enable

    special resources the OpenShift way https://bit.ly/31utbkm Special Resource Operator https://red.ht/2XyOKz6 How to use entitled image builds to build DriverContainers with UBI on OpenShift https://red.ht/2JQuNwB Part 1: How to Enable Hardware Accelerators on OpenShift https://red.ht/34ubzq3 Part 2: How to enable Hardware Accelerators on OpenShift, SRO Building Blocks https://red.ht/34ubzq3 Simplifying deployments of accelerated AI workloads on Red Hat OpenShift with NVIDIA GPU Operator
  25. CONFIDENTIAL designator V0000000 OPENSHIFT Master Node Master Node Master Node

    CPU Node GPU Node CPU Node CPU Node CONTROL PLANE CPU WORKERS GPU WORKERS Initial GPU cluster configuration
  26. CONFIDENTIAL designator V0000000 OPENSHIFT Master Node Master Node Master Node

    CPU Node GPU Node CPU Node CPU Node CONTROL PLANE CPU WORKERS GPU WORKERS Use the Cluster Autoscaler for on demand GPU nodes GPU Node GPU Node
  27. CONFIDENTIAL designator V0000000 OPENSHIFT Master Node Master Node Master Node

    CPU Node GPU Node CPU Node CPU Node CONTROL PLANE CPU WORKERS GPU WORKERS Hard Partitioning - Taints and Tolerations GPU Node GPU Node GPU ONLY PODS
  28. CONFIDENTIAL designator V0000000 OPENSHIFT Master Node Master Node Master Node

    CPU Node GPU Node CPU Node CPU Node CONTROL PLANE CPU WORKERS GPU WORKERS Soft Partitioning - Priority Classes GPU Node GPU Node HIGH PRIORITY LOW PRIORITY
  29. CONFIDENTIAL designator V0000000 OPENSHIFT Master Node Master Node Master Node

    CPU Node GPU Node CPU Node CPU Node CONTROL PLANE CPU WORKERS GPU WORKERS Hard/Soft Partitioning combining both GPU Node GPU Node GPU ONLY PODS HIGH PRIORITY LOW PRIORITY
  30. CONFIDENTIAL designator V0000000 OPENSHIFT Master Node Master Node Master Node

    CPU Node GPU Node CPU Node CPU Node CONTROL PLANE CPU WORKERS GPU WORKERS Per ns, multiple ns for CPU, MEM and extended resources GPU Node GPU Node GPU ONLY PODS HIGH PRIORITY LOW PRIORITY QUOTAS
  31. CONFIDENTIAL designator V0000000 OPENSHIFT Master Node Master Node Master Node

    CPU Node GPU Node CPU Node CPU Node CONTROL PLANE CPU WORKERS GPU TRAINING Clustering nodes with specific roles GPU Node GPU Node GPU ONLY PODS HIGH PRIORITY LOW PRIORITY QUOTAS GPU Node GPU INFERENCE GPU Node GPU Node GPU ONLY PODS HIGH PRIORITY LOW PRIORITY QUOTAS
  32. CONFIDENTIAL designator V0000000 OPENSHIFT Master Node Master Node Master Node

    CPU Node GPU Node CPU Node CPU Node CONTROL PLANE CPU WORKERS GPU TRAINING Pods can be repelled attracted or not scheduled with affinities GPU Node GPU Node GPU ONLY PODS HIGH PRIORITY LOW PRIORITY QUOTAS GPU Node GPU INFERENCE GPU Node GPU Node GPU ONLY PODS HIGH PRIORITY LOW PRIORITY QUOTAS CPU ONLY PODS HIGH PRIORITY LOW PRIORITY QUOTAS INFRA ONLY PODS
  33. CONFIDENTIAL designator V0000000 OPENSHIFT Master Node Master Node Master Node

    CPU Node GPU Node CPU Node CPU Node CONTROL PLANE CPU WORKERS GPU WORKERS Multus & high speed interconnects GPU Node GPU Node GPU ONLY PODS HIGH PRIORITY LOW PRIORITY CPU ONLY PODS HIGH PRIORITY LOW PRIORITY QUOTAS INFRA ONLY PODS
  34. CONFIDENTIAL designator V0000000 OPENSHIFT Master Node Master Node Master Node

    CPU Node GPU Node CPU Node CPU Node CONTROL PLANE CPU WORKERS GPU WORKERS RDMA over Infiniband or Ethernet GPU Node GPU Node GPU ONLY PODS HIGH PRIORITY LOW PRIORITY CPU ONLY PODS HIGH PRIORITY LOW PRIORITY QUOTAS INFRA ONLY PODS RDMA
  35. CONFIDENTIAL designator V0000000 BARE METAL Special Resource Operator 46 Enable

    special resource the OpenShift way https://www.youtube.com/watch?v=TFP0oLG-ss 8&feature=youtu.be Running the NV flowers demo on OpenShift https://www.youtube.com/watch?v=usV_STdcM HY&feature=youtu.be Running RAPIDS with GPUs on OpenShift How to use GPUs with OKD 4.5 https://bit.ly/3gA73M5 Future Work: MIG Support, GPUDirect,
  36. CONFIDENTIAL designator V0000000 linkedin.com/company/red-hat youtube.com/user/RedHatVideos facebook.com/redhatinc twitter.com/RedHat 47 Red Hat

    is the world’s leading provider of enterprise open source software solutions. Award-winning support, training, and consulting services make Red Hat a trusted adviser to the Fortune 500. Thank you