Upgrade to Pro — share decks privately, control downloads, hide ads and more …

GPUs on Kubernetes, OpenShift & OKD, Zvonko Kaiser, Red Hat

GPUs on Kubernetes, OpenShift & OKD, Zvonko Kaiser, Red Hat

A Pattern to Enable Hardware Accelerators

Zvonko Kaiser, Red hat

A61fc58218907d6778a6cbf0fe7611da?s=128

Red Hat OpenShift

August 07, 2020
Tweet

Transcript

  1. CONFIDENTIAL designator V0000000 A Pattern to Enable Hardware Accelerators GPUs

    on Kubernetes, OpenShift & OKD Zvonko Kaiser 1
  2. CONFIDENTIAL designator V0000000 What we'll be discussing today AGENDA 2

    Container Engines Bootstrap Autoscaling Metrics Accelerator Platform Workload Building Blocks
  3. CONFIDENTIAL designator V0000000 Runtime Hooks 3 BARE METAL 3

  4. CONFIDENTIAL designator V0000000 BARE METAL 4 Spawning and running containers

    OCI specification - runc - containerd RUNTIME Prestart HOOKS Poststart Poststop Hooks can be used to enhance the functionality of a container runtime - mount files - configure cgroups
  5. CONFIDENTIAL designator V0000000 BARE METAL 5 Spawning and running containers

    OCI specification - runc - containerd RUNTIME NVIDIA prestart hook Bind mount of - devices - binaries - libraries HOOKS NV prestart hook configures the container to use GPUs - mount files - configure cgroups
  6. CONFIDENTIAL designator V0000000 SELinux 6 BARE METAL 6

  7. CONFIDENTIAL designator V0000000 BARE METAL CONTAINER HOST Container and Host

    two distinct SELinux domains
  8. CONFIDENTIAL designator V0000000 BARE METAL CONTAINER HOST Bind mounts introduce

    host label/context into the container
  9. CONFIDENTIAL designator V0000000 BARE METAL CONTAINER HOST Privileged containers are

    not contained anymore
  10. CONFIDENTIAL designator V0000000 BARE METAL CONTAINER HOST Relabeling host files

    can break host context
  11. CONFIDENTIAL designator V0000000 BARE METAL CONTAINER HOST A SELinux policy

    is needed to run a container unprivileged
  12. CONFIDENTIAL designator V0000000 BARE METAL Summary & Resources 12 Bare

    Metal Enablement https://github.com/NVIDIA/dgx-selinux RHEL SELinux Policy for NVIDIA https://www.redhat.com/en/blog/how-use-gpus -containers-bare-metal-rhel-8 How to enable NVIDIA GPUs in containers on bare metal in RHEL 8 https://github.com/zvonkok/oci-decorator Simple Prestart Hook Implementation
  13. CONFIDENTIAL designator V0000000 Bootstrap Heterogeneity 13 OPENSHIFT 13

  14. CONFIDENTIAL designator V0000000 OPENSHIFT Master Node Master Node Master Node

    CPU Node CPU Node CPU Node CONTROL PLANE CPU WORKERS Use a MachineSet to scale the cluster with a GPU node
  15. CONFIDENTIAL designator V0000000 OPENSHIFT Master Node Master Node Master Node

    CPU Node GPU Node CPU Node CPU Node CONTROL PLANE CPU WORKERS GPU WORKERS Heterogeneous cluster with different compute units
  16. CONFIDENTIAL designator V0000000 OPENSHIFT NODE Prestart ACCELERATOR Poststart Poststop Every

    node could have features that are interesting to different pods GPU FPGA NIC AVX512 4.18.0-80.1.2 RHEL7, 8, RHCOS
  17. CONFIDENTIAL designator V0000000 OPENSHIFT NODE Prestart ACCELERATOR Poststart Poststop Node

    Feature Discovery exposes node features as labels pci-10de=present pci-8086=present pci-1924=present cpuid-AVX512=true kernel-ver.major=4 os_release.ID=rhel
  18. CONFIDENTIAL designator V0000000 OPENSHIFT NODE Prestart ACCELERATOR Poststart Poststop Optimized

    workloads can be placed on the right node pci-10de=present pci-8086=present pci-1924=present cpuid-AVX512=true kernel-ver.major=4 os_release.ID=rhel POD POD POD POD POD POD
  19. CONFIDENTIAL designator V0000000 BARE METAL Node Feature Discovery 19 Exposing

    node features the easy way since 4.2 https://github.com/kubernetes-sigs/node-featur e-discovery Upstream NFD & Operator https://github.com/openshift/node-feature-disc overy Downstream NFD & Operator https://www.openshift.com/blog/building-multiar ch-imagestream-with-the-nfd-operator-and-ope nshift-4 Building Multiarch imageStream with the NFD Operator and OpenShift 4
  20. CONFIDENTIAL designator V0000000 Enable accelerators the OpenShift way 20 OPENSHIFT

    20
  21. CONFIDENTIAL designator V0000000 OPENSHIFT ACCELERATOR STACK driver-container Driver delivery has

    to work on RHEL and RHCOS Modules, userspace, hook
  22. CONFIDENTIAL designator V0000000 OPENSHIFT ACCELERATOR STACK driver-container Every step has

    to be validated, makes no sense to advance Modules, userspace, hook Small workload using the acc. driver-container-validation
  23. CONFIDENTIAL designator V0000000 OPENSHIFT ACCELERATOR STACK driver-container Device plugins does

    health checks and updates nodes capacity driver-container-validation device-plugin Modules, userspace, hook Small workload using the acc. Expose acc. to the cluster
  24. CONFIDENTIAL designator V0000000 OPENSHIFT ACCELERATOR STACK driver-container A Pod must

    be able to allocate a extended resource driver-container-validation device-plugin device-plugin-validation Modules, userspace, hook Small workload using the acc. Expose acc. to the cluster Allocate resource and use it
  25. CONFIDENTIAL designator V0000000 OPENSHIFT ACCELERATOR STACK driver-container Special resource node-exporter

    registering with the cluster stack driver-container-validation device-plugin device-plugin-validation monitoring Modules, userspace, hook Small workload using the acc. Expose acc. to the cluster Allocate resource and use it Setup Prometheus and Grafana, metrics and alerts
  26. CONFIDENTIAL designator V0000000 OPENSHIFT ACCELERATOR STACK driver-container Discover advanced features

    of a special resource driver-container-validation device-plugin device-plugin-validation monitoring Modules, userspace, hook Small workload using the acc. Expose acc. to the cluster Allocate resource and use it Setup Prometheus and Grafana, metrics and alerts feature-discovery Sidecar container for NFD, fine grained scheduling
  27. CONFIDENTIAL designator V0000000 OPENSHIFT ACCELERATOR STACK driver-container The SRO is

    a pattern to enable special resources in OpenShift driver-container-validation device-plugin device-plugin-validation monitoring Modules, userspace, hook Small workload using the acc. Expose acc. to the cluster Allocate resource and use it Setup Prometheus and Grafana, metrics and alerts SPECIAL RESOURCE OPERATOR feature-discovery Sidecar container for NFD, fine grained scheduling
  28. CONFIDENTIAL designator V0000000 OPENSHIFT ACCELERATOR STACK driver-container The SRO is

    a pattern to enable special resources in OpenShift driver-container-validation device-plugin device-plugin-validation monitoring SPECIAL RESOURCE OPERATOR feature-discovery Hard or Soft Partitioning CONFIGURATIONS Driver Version Custom Manifests
  29. CONFIDENTIAL designator V0000000 OPENSHIFT 29 NFD detects kernel version and

    labels node GPU Worker Node (CoreOS : kernel 4.18.0-80) GPU GPU kubelet CRI-O NFD Worker Daemonset kubernetes API (Master) kernel=4.18.0-80 kernel=4.18.0-80
  30. CONFIDENTIAL designator V0000000 OPENSHIFT 30 SRO builds driver container image

    against kernel GPU Worker Node (CoreOS : kernel 4.18.0-80) GPU GPU kubelet CRI-O kernel=4.18.0-80 Special Resource Operator (SRO) Container build (driver-container-4.18.0 -80) Image registry
  31. CONFIDENTIAL designator V0000000 OPENSHIFT 31 SRO targets specific kernel version

    hosts GPU Worker Node (CoreOS : kernel 4.18.0-80) GPU GPU kubelet CRI-O kernel=4.18.0-80 GPU Driver Daemonset Special Resource Operator (SRO) driver-container-4.18.0-80
  32. CONFIDENTIAL designator V0000000 OPENSHIFT 32 32 NFD detects updated kernel

    and relabels node GPU Worker Node (CoreOS : kernel 4.18.0-147*) GPU GPU kubelet CRI-O NFD Worker Daemonset kubernetes API (Master) kernel=4.18.0-147* kernel=4.18.0-147*
  33. CONFIDENTIAL designator V0000000 OPENSHIFT 33 SRO detects mismatch and rebuilds

    driver container GPU Worker Node (CoreOS : kernel 4.18.0-147*) GPU GPU kubelet CRI-O kernel=4.18.0-147* Special Resource Operator (SRO) BuildConfig (driver-container-4.18.0-147) Image registry GPU Driver Daemonset driver-container-4.18.0-80
  34. CONFIDENTIAL designator V0000000 OPENSHIFT 34 SRO updates daemonset with new

    image GPU Worker Node (CoreOS : kernel 4.18.0-147) GPU GPU kubelet CRI-O kernel=4.18.0-147 GPU Driver Daemonset Special Resource Operator (SRO) driver-container-4.18.0-147
  35. CONFIDENTIAL designator V0000000 BARE METAL Special Resource Operator 35 Enable

    special resources the OpenShift way https://bit.ly/31utbkm Special Resource Operator https://red.ht/2XyOKz6 How to use entitled image builds to build DriverContainers with UBI on OpenShift https://red.ht/2JQuNwB Part 1: How to Enable Hardware Accelerators on OpenShift https://red.ht/34ubzq3 Part 2: How to enable Hardware Accelerators on OpenShift, SRO Building Blocks https://red.ht/34ubzq3 Simplifying deployments of accelerated AI workloads on Red Hat OpenShift with NVIDIA GPU Operator
  36. CONFIDENTIAL designator V0000000 OPENSHIFT Master Node Master Node Master Node

    CPU Node GPU Node CPU Node CPU Node CONTROL PLANE CPU WORKERS GPU WORKERS Initial GPU cluster configuration
  37. CONFIDENTIAL designator V0000000 OPENSHIFT Master Node Master Node Master Node

    CPU Node GPU Node CPU Node CPU Node CONTROL PLANE CPU WORKERS GPU WORKERS Use the Cluster Autoscaler for on demand GPU nodes GPU Node GPU Node
  38. CONFIDENTIAL designator V0000000 OPENSHIFT Master Node Master Node Master Node

    CPU Node GPU Node CPU Node CPU Node CONTROL PLANE CPU WORKERS GPU WORKERS Hard Partitioning - Taints and Tolerations GPU Node GPU Node GPU ONLY PODS
  39. CONFIDENTIAL designator V0000000 OPENSHIFT Master Node Master Node Master Node

    CPU Node GPU Node CPU Node CPU Node CONTROL PLANE CPU WORKERS GPU WORKERS Soft Partitioning - Priority Classes GPU Node GPU Node HIGH PRIORITY LOW PRIORITY
  40. CONFIDENTIAL designator V0000000 OPENSHIFT Master Node Master Node Master Node

    CPU Node GPU Node CPU Node CPU Node CONTROL PLANE CPU WORKERS GPU WORKERS Hard/Soft Partitioning combining both GPU Node GPU Node GPU ONLY PODS HIGH PRIORITY LOW PRIORITY
  41. CONFIDENTIAL designator V0000000 OPENSHIFT Master Node Master Node Master Node

    CPU Node GPU Node CPU Node CPU Node CONTROL PLANE CPU WORKERS GPU WORKERS Per ns, multiple ns for CPU, MEM and extended resources GPU Node GPU Node GPU ONLY PODS HIGH PRIORITY LOW PRIORITY QUOTAS
  42. CONFIDENTIAL designator V0000000 OPENSHIFT Master Node Master Node Master Node

    CPU Node GPU Node CPU Node CPU Node CONTROL PLANE CPU WORKERS GPU TRAINING Clustering nodes with specific roles GPU Node GPU Node GPU ONLY PODS HIGH PRIORITY LOW PRIORITY QUOTAS GPU Node GPU INFERENCE GPU Node GPU Node GPU ONLY PODS HIGH PRIORITY LOW PRIORITY QUOTAS
  43. CONFIDENTIAL designator V0000000 OPENSHIFT Master Node Master Node Master Node

    CPU Node GPU Node CPU Node CPU Node CONTROL PLANE CPU WORKERS GPU TRAINING Pods can be repelled attracted or not scheduled with affinities GPU Node GPU Node GPU ONLY PODS HIGH PRIORITY LOW PRIORITY QUOTAS GPU Node GPU INFERENCE GPU Node GPU Node GPU ONLY PODS HIGH PRIORITY LOW PRIORITY QUOTAS CPU ONLY PODS HIGH PRIORITY LOW PRIORITY QUOTAS INFRA ONLY PODS
  44. CONFIDENTIAL designator V0000000 OPENSHIFT Master Node Master Node Master Node

    CPU Node GPU Node CPU Node CPU Node CONTROL PLANE CPU WORKERS GPU WORKERS Multus & high speed interconnects GPU Node GPU Node GPU ONLY PODS HIGH PRIORITY LOW PRIORITY CPU ONLY PODS HIGH PRIORITY LOW PRIORITY QUOTAS INFRA ONLY PODS
  45. CONFIDENTIAL designator V0000000 OPENSHIFT Master Node Master Node Master Node

    CPU Node GPU Node CPU Node CPU Node CONTROL PLANE CPU WORKERS GPU WORKERS RDMA over Infiniband or Ethernet GPU Node GPU Node GPU ONLY PODS HIGH PRIORITY LOW PRIORITY CPU ONLY PODS HIGH PRIORITY LOW PRIORITY QUOTAS INFRA ONLY PODS RDMA
  46. CONFIDENTIAL designator V0000000 BARE METAL Special Resource Operator 46 Enable

    special resource the OpenShift way https://www.youtube.com/watch?v=TFP0oLG-ss 8&feature=youtu.be Running the NV flowers demo on OpenShift https://www.youtube.com/watch?v=usV_STdcM HY&feature=youtu.be Running RAPIDS with GPUs on OpenShift How to use GPUs with OKD 4.5 https://bit.ly/3gA73M5 Future Work: MIG Support, GPUDirect,
  47. CONFIDENTIAL designator V0000000 linkedin.com/company/red-hat youtube.com/user/RedHatVideos facebook.com/redhatinc twitter.com/RedHat 47 Red Hat

    is the world’s leading provider of enterprise open source software solutions. Award-winning support, training, and consulting services make Red Hat a trusted adviser to the Fortune 500. Thank you