Upgrade to Pro — share decks privately, control downloads, hide ads and more …

GPUs on Kubernetes, OpenShift & OKD, Zvonko Kaiser, Red Hat

GPUs on Kubernetes, OpenShift & OKD, Zvonko Kaiser, Red Hat

A Pattern to Enable Hardware Accelerators

Zvonko Kaiser, Red hat

Red Hat Livestreaming

August 07, 2020
Tweet

More Decks by Red Hat Livestreaming

Other Decks in Technology

Transcript

  1. CONFIDENTIAL designator
    V0000000
    A Pattern to Enable Hardware Accelerators
    GPUs on Kubernetes,
    OpenShift & OKD
    Zvonko Kaiser
    1

    View Slide

  2. CONFIDENTIAL designator
    V0000000
    What we'll be
    discussing today
    AGENDA
    2
    Container Engines
    Bootstrap
    Autoscaling
    Metrics
    Accelerator
    Platform
    Workload
    Building Blocks

    View Slide

  3. CONFIDENTIAL designator
    V0000000
    Runtime Hooks
    3
    BARE METAL
    3

    View Slide

  4. CONFIDENTIAL designator
    V0000000
    BARE METAL
    4
    Spawning and running
    containers
    OCI specification
    - runc
    - containerd
    RUNTIME
    Prestart
    HOOKS
    Poststart
    Poststop
    Hooks can be used to enhance the functionality of a container runtime
    - mount files
    - configure cgroups

    View Slide

  5. CONFIDENTIAL designator
    V0000000
    BARE METAL
    5
    Spawning and running
    containers
    OCI specification
    - runc
    - containerd
    RUNTIME
    NVIDIA prestart hook
    Bind mount of
    - devices
    - binaries
    - libraries
    HOOKS
    NV prestart hook configures the container to use GPUs
    - mount files
    - configure cgroups

    View Slide

  6. CONFIDENTIAL designator
    V0000000
    SELinux
    6
    BARE METAL
    6

    View Slide

  7. CONFIDENTIAL designator
    V0000000
    BARE METAL
    CONTAINER
    HOST
    Container and Host two distinct SELinux domains

    View Slide

  8. CONFIDENTIAL designator
    V0000000
    BARE METAL
    CONTAINER
    HOST
    Bind mounts introduce host label/context into the container

    View Slide

  9. CONFIDENTIAL designator
    V0000000
    BARE METAL
    CONTAINER
    HOST
    Privileged containers are not contained anymore

    View Slide

  10. CONFIDENTIAL designator
    V0000000
    BARE METAL
    CONTAINER
    HOST
    Relabeling host files can break host context

    View Slide

  11. CONFIDENTIAL designator
    V0000000
    BARE METAL
    CONTAINER
    HOST
    A SELinux policy is needed to run a container unprivileged

    View Slide

  12. CONFIDENTIAL designator
    V0000000
    BARE METAL
    Summary & Resources
    12
    Bare Metal Enablement
    https://github.com/NVIDIA/dgx-selinux
    RHEL SELinux Policy for NVIDIA
    https://www.redhat.com/en/blog/how-use-gpus
    -containers-bare-metal-rhel-8
    How to enable NVIDIA GPUs in containers on bare
    metal in RHEL 8
    https://github.com/zvonkok/oci-decorator
    Simple Prestart Hook Implementation

    View Slide

  13. CONFIDENTIAL designator
    V0000000
    Bootstrap Heterogeneity
    13
    OPENSHIFT
    13

    View Slide

  14. CONFIDENTIAL designator
    V0000000
    OPENSHIFT
    Master
    Node
    Master
    Node
    Master
    Node
    CPU
    Node
    CPU
    Node
    CPU
    Node
    CONTROL PLANE
    CPU WORKERS
    Use a MachineSet to scale the cluster with a GPU node

    View Slide

  15. CONFIDENTIAL designator
    V0000000
    OPENSHIFT
    Master
    Node
    Master
    Node
    Master
    Node
    CPU
    Node
    GPU
    Node
    CPU
    Node
    CPU
    Node
    CONTROL PLANE
    CPU WORKERS
    GPU WORKERS
    Heterogeneous cluster with different compute units

    View Slide

  16. CONFIDENTIAL designator
    V0000000
    OPENSHIFT
    NODE
    Prestart
    ACCELERATOR
    Poststart
    Poststop
    Every node could have features that are interesting to different pods
    GPU
    FPGA
    NIC
    AVX512
    4.18.0-80.1.2
    RHEL7, 8, RHCOS

    View Slide

  17. CONFIDENTIAL designator
    V0000000
    OPENSHIFT
    NODE
    Prestart
    ACCELERATOR
    Poststart
    Poststop
    Node Feature Discovery exposes node features as labels
    pci-10de=present
    pci-8086=present
    pci-1924=present
    cpuid-AVX512=true
    kernel-ver.major=4
    os_release.ID=rhel

    View Slide

  18. CONFIDENTIAL designator
    V0000000
    OPENSHIFT
    NODE
    Prestart
    ACCELERATOR
    Poststart
    Poststop
    Optimized workloads can be placed on the right node
    pci-10de=present
    pci-8086=present
    pci-1924=present
    cpuid-AVX512=true
    kernel-ver.major=4
    os_release.ID=rhel
    POD POD
    POD
    POD
    POD
    POD

    View Slide

  19. CONFIDENTIAL designator
    V0000000
    BARE METAL
    Node Feature Discovery
    19
    Exposing node features the
    easy way since 4.2
    https://github.com/kubernetes-sigs/node-featur
    e-discovery
    Upstream NFD & Operator
    https://github.com/openshift/node-feature-disc
    overy
    Downstream NFD & Operator
    https://www.openshift.com/blog/building-multiar
    ch-imagestream-with-the-nfd-operator-and-ope
    nshift-4
    Building Multiarch imageStream with the NFD
    Operator and OpenShift 4

    View Slide

  20. CONFIDENTIAL designator
    V0000000
    Enable accelerators the
    OpenShift way
    20
    OPENSHIFT
    20

    View Slide

  21. CONFIDENTIAL designator
    V0000000
    OPENSHIFT
    ACCELERATOR STACK
    driver-container
    Driver delivery has to work on RHEL and RHCOS
    Modules, userspace, hook

    View Slide

  22. CONFIDENTIAL designator
    V0000000
    OPENSHIFT
    ACCELERATOR STACK
    driver-container
    Every step has to be validated, makes no sense to advance
    Modules, userspace, hook
    Small workload using the acc.
    driver-container-validation

    View Slide

  23. CONFIDENTIAL designator
    V0000000
    OPENSHIFT
    ACCELERATOR STACK
    driver-container
    Device plugins does health checks and updates nodes capacity
    driver-container-validation
    device-plugin
    Modules, userspace, hook
    Small workload using the acc.
    Expose acc. to the cluster

    View Slide

  24. CONFIDENTIAL designator
    V0000000
    OPENSHIFT
    ACCELERATOR STACK
    driver-container
    A Pod must be able to allocate a extended resource
    driver-container-validation
    device-plugin
    device-plugin-validation
    Modules, userspace, hook
    Small workload using the acc.
    Expose acc. to the cluster
    Allocate resource and use it

    View Slide

  25. CONFIDENTIAL designator
    V0000000
    OPENSHIFT
    ACCELERATOR STACK
    driver-container
    Special resource node-exporter registering with the cluster stack
    driver-container-validation
    device-plugin
    device-plugin-validation
    monitoring
    Modules, userspace, hook
    Small workload using the acc.
    Expose acc. to the cluster
    Allocate resource and use it
    Setup Prometheus and Grafana,
    metrics and alerts

    View Slide

  26. CONFIDENTIAL designator
    V0000000
    OPENSHIFT
    ACCELERATOR STACK
    driver-container
    Discover advanced features of a special resource
    driver-container-validation
    device-plugin
    device-plugin-validation
    monitoring
    Modules, userspace, hook
    Small workload using the acc.
    Expose acc. to the cluster
    Allocate resource and use it
    Setup Prometheus and Grafana,
    metrics and alerts
    feature-discovery Sidecar container for NFD, fine
    grained scheduling

    View Slide

  27. CONFIDENTIAL designator
    V0000000
    OPENSHIFT
    ACCELERATOR STACK
    driver-container
    The SRO is a pattern to enable special resources in OpenShift
    driver-container-validation
    device-plugin
    device-plugin-validation
    monitoring
    Modules, userspace, hook
    Small workload using the acc.
    Expose acc. to the cluster
    Allocate resource and use it
    Setup Prometheus and Grafana,
    metrics and alerts
    SPECIAL RESOURCE OPERATOR
    feature-discovery Sidecar container for NFD, fine
    grained scheduling

    View Slide

  28. CONFIDENTIAL designator
    V0000000
    OPENSHIFT
    ACCELERATOR STACK
    driver-container
    The SRO is a pattern to enable special resources in OpenShift
    driver-container-validation
    device-plugin
    device-plugin-validation
    monitoring
    SPECIAL RESOURCE OPERATOR
    feature-discovery
    Hard or
    Soft
    Partitioning
    CONFIGURATIONS
    Driver
    Version
    Custom
    Manifests

    View Slide

  29. CONFIDENTIAL designator
    V0000000
    OPENSHIFT
    29
    NFD detects kernel version and labels node
    GPU
    Worker Node (CoreOS : kernel 4.18.0-80)
    GPU GPU
    kubelet CRI-O
    NFD Worker
    Daemonset
    kubernetes API
    (Master)
    kernel=4.18.0-80
    kernel=4.18.0-80

    View Slide

  30. CONFIDENTIAL designator
    V0000000
    OPENSHIFT
    30
    SRO builds driver container image against kernel
    GPU
    Worker Node (CoreOS : kernel 4.18.0-80)
    GPU GPU
    kubelet CRI-O
    kernel=4.18.0-80
    Special Resource
    Operator
    (SRO)
    Container build
    (driver-container-4.18.0
    -80)
    Image registry

    View Slide

  31. CONFIDENTIAL designator
    V0000000
    OPENSHIFT
    31
    SRO targets specific kernel version hosts
    GPU
    Worker Node (CoreOS : kernel 4.18.0-80)
    GPU GPU
    kubelet CRI-O
    kernel=4.18.0-80
    GPU Driver
    Daemonset
    Special Resource
    Operator
    (SRO)
    driver-container-4.18.0-80

    View Slide

  32. CONFIDENTIAL designator
    V0000000
    OPENSHIFT
    32
    32
    NFD detects updated kernel and relabels node
    GPU
    Worker Node (CoreOS : kernel 4.18.0-147*)
    GPU GPU
    kubelet CRI-O
    NFD Worker
    Daemonset
    kubernetes API
    (Master)
    kernel=4.18.0-147*
    kernel=4.18.0-147*

    View Slide

  33. CONFIDENTIAL designator
    V0000000
    OPENSHIFT
    33
    SRO detects mismatch and rebuilds driver container
    GPU
    Worker Node (CoreOS : kernel 4.18.0-147*)
    GPU GPU
    kubelet CRI-O
    kernel=4.18.0-147*
    Special Resource
    Operator
    (SRO)
    BuildConfig
    (driver-container-4.18.0-147)
    Image registry
    GPU Driver
    Daemonset
    driver-container-4.18.0-80

    View Slide

  34. CONFIDENTIAL designator
    V0000000
    OPENSHIFT
    34
    SRO updates daemonset with new image
    GPU
    Worker Node (CoreOS : kernel 4.18.0-147)
    GPU GPU
    kubelet CRI-O
    kernel=4.18.0-147
    GPU Driver
    Daemonset
    Special Resource
    Operator
    (SRO)
    driver-container-4.18.0-147

    View Slide

  35. CONFIDENTIAL designator
    V0000000
    BARE METAL
    Special Resource Operator
    35
    Enable special resources the
    OpenShift way
    https://bit.ly/31utbkm
    Special Resource Operator
    https://red.ht/2XyOKz6
    How to use entitled image builds to build
    DriverContainers with UBI on OpenShift
    https://red.ht/2JQuNwB
    Part 1: How to Enable Hardware Accelerators on
    OpenShift
    https://red.ht/34ubzq3
    Part 2: How to enable Hardware Accelerators on
    OpenShift, SRO Building Blocks
    https://red.ht/34ubzq3
    Simplifying deployments of accelerated AI
    workloads on Red Hat OpenShift with NVIDIA GPU
    Operator

    View Slide

  36. CONFIDENTIAL designator
    V0000000
    OPENSHIFT
    Master
    Node
    Master
    Node
    Master
    Node
    CPU
    Node
    GPU
    Node
    CPU
    Node
    CPU
    Node
    CONTROL PLANE
    CPU WORKERS
    GPU WORKERS
    Initial GPU cluster configuration

    View Slide

  37. CONFIDENTIAL designator
    V0000000
    OPENSHIFT
    Master
    Node
    Master
    Node
    Master
    Node
    CPU
    Node
    GPU
    Node
    CPU
    Node
    CPU
    Node
    CONTROL PLANE
    CPU WORKERS
    GPU WORKERS
    Use the Cluster Autoscaler for on demand GPU nodes
    GPU
    Node
    GPU
    Node

    View Slide

  38. CONFIDENTIAL designator
    V0000000
    OPENSHIFT
    Master
    Node
    Master
    Node
    Master
    Node
    CPU
    Node
    GPU
    Node
    CPU
    Node
    CPU
    Node
    CONTROL PLANE
    CPU WORKERS
    GPU WORKERS
    Hard Partitioning - Taints and Tolerations
    GPU
    Node
    GPU
    Node
    GPU ONLY PODS

    View Slide

  39. CONFIDENTIAL designator
    V0000000
    OPENSHIFT
    Master
    Node
    Master
    Node
    Master
    Node
    CPU
    Node
    GPU
    Node
    CPU
    Node
    CPU
    Node
    CONTROL PLANE
    CPU WORKERS
    GPU WORKERS
    Soft Partitioning - Priority Classes
    GPU
    Node
    GPU
    Node
    HIGH PRIORITY
    LOW PRIORITY

    View Slide

  40. CONFIDENTIAL designator
    V0000000
    OPENSHIFT
    Master
    Node
    Master
    Node
    Master
    Node
    CPU
    Node
    GPU
    Node
    CPU
    Node
    CPU
    Node
    CONTROL PLANE
    CPU WORKERS
    GPU WORKERS
    Hard/Soft Partitioning combining both
    GPU
    Node
    GPU
    Node
    GPU ONLY PODS
    HIGH PRIORITY
    LOW PRIORITY

    View Slide

  41. CONFIDENTIAL designator
    V0000000
    OPENSHIFT
    Master
    Node
    Master
    Node
    Master
    Node
    CPU
    Node
    GPU
    Node
    CPU
    Node
    CPU
    Node
    CONTROL PLANE
    CPU WORKERS
    GPU WORKERS
    Per ns, multiple ns for CPU, MEM and extended resources
    GPU
    Node
    GPU
    Node
    GPU ONLY PODS
    HIGH PRIORITY
    LOW PRIORITY
    QUOTAS

    View Slide

  42. CONFIDENTIAL designator
    V0000000
    OPENSHIFT
    Master
    Node
    Master
    Node
    Master
    Node
    CPU
    Node
    GPU
    Node
    CPU
    Node
    CPU
    Node
    CONTROL PLANE
    CPU WORKERS
    GPU TRAINING
    Clustering nodes with specific roles
    GPU
    Node
    GPU
    Node
    GPU ONLY PODS
    HIGH PRIORITY
    LOW PRIORITY
    QUOTAS
    GPU
    Node
    GPU INFERENCE
    GPU
    Node
    GPU
    Node
    GPU ONLY PODS
    HIGH PRIORITY
    LOW PRIORITY
    QUOTAS

    View Slide

  43. CONFIDENTIAL designator
    V0000000
    OPENSHIFT
    Master
    Node
    Master
    Node
    Master
    Node
    CPU
    Node
    GPU
    Node
    CPU
    Node
    CPU
    Node
    CONTROL PLANE
    CPU WORKERS
    GPU TRAINING
    Pods can be repelled attracted or not scheduled with affinities
    GPU
    Node
    GPU
    Node
    GPU ONLY PODS
    HIGH PRIORITY
    LOW PRIORITY
    QUOTAS
    GPU
    Node
    GPU INFERENCE
    GPU
    Node
    GPU
    Node
    GPU ONLY PODS
    HIGH PRIORITY
    LOW PRIORITY
    QUOTAS
    CPU ONLY PODS
    HIGH PRIORITY
    LOW PRIORITY
    QUOTAS
    INFRA ONLY PODS

    View Slide

  44. CONFIDENTIAL designator
    V0000000
    OPENSHIFT
    Master
    Node
    Master
    Node
    Master
    Node
    CPU
    Node
    GPU
    Node
    CPU
    Node
    CPU
    Node
    CONTROL PLANE
    CPU WORKERS
    GPU WORKERS
    Multus & high speed interconnects
    GPU
    Node
    GPU
    Node
    GPU ONLY PODS
    HIGH PRIORITY
    LOW PRIORITY
    CPU ONLY PODS
    HIGH PRIORITY
    LOW PRIORITY
    QUOTAS
    INFRA ONLY PODS

    View Slide

  45. CONFIDENTIAL designator
    V0000000
    OPENSHIFT
    Master
    Node
    Master
    Node
    Master
    Node
    CPU
    Node
    GPU
    Node
    CPU
    Node
    CPU
    Node
    CONTROL PLANE
    CPU WORKERS
    GPU WORKERS
    RDMA over Infiniband or Ethernet
    GPU
    Node
    GPU
    Node
    GPU ONLY PODS
    HIGH PRIORITY
    LOW PRIORITY
    CPU ONLY PODS
    HIGH PRIORITY
    LOW PRIORITY
    QUOTAS
    INFRA ONLY PODS
    RDMA

    View Slide

  46. CONFIDENTIAL designator
    V0000000
    BARE METAL
    Special Resource Operator
    46
    Enable special resource the
    OpenShift way
    https://www.youtube.com/watch?v=TFP0oLG-ss
    8&feature=youtu.be
    Running the NV flowers demo on OpenShift
    https://www.youtube.com/watch?v=usV_STdcM
    HY&feature=youtu.be
    Running RAPIDS with GPUs on OpenShift
    How to use GPUs with OKD 4.5
    https://bit.ly/3gA73M5
    Future Work: MIG Support, GPUDirect,

    View Slide

  47. CONFIDENTIAL designator
    V0000000
    linkedin.com/company/red-hat
    youtube.com/user/RedHatVideos
    facebook.com/redhatinc
    twitter.com/RedHat
    47
    Red Hat is the world’s leading provider of enterprise
    open source software solutions. Award-winning
    support, training, and consulting services make
    Red Hat a trusted adviser to the Fortune 500.
    Thank you

    View Slide