Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Using GPUs in OpenStack / OpenShift

Using GPUs in OpenStack / OpenShift

Presentation given at the HPC Forum Basel together with Jens-Christian Fischer (SWITCH).

Adrian Kosmaczewski

May 16, 2019
Tweet

More Decks by Adrian Kosmaczewski

Other Decks in Technology

Transcript

  1. © 2019 SWITCH | Jens-Christian Fischer [email protected] @jcfischer HPC Forum

    Basel, 16.5.2019 Using GPUs in OpenStack / OpenShift Adrian Kosmaczewski [email protected] @akosma
  2. © 2019 SWITCH | VSHN • Spelled "Vision" – The

    DevOps Company • Located in Zürich, • Founded 2014 by ETHZ alumni • Currently 33 VSHNeers • First Kubernetes Certified Provider in Switzerland • Authorized Docker Consulting Partner
  3. © 2019 SWITCH | SWITCHengines Customer tailored computing and storage

    performance for universities, research and teaching – was further developed in the SCALE-UP project mandated by swissuniversities. Your benefits • Your data in Switzerland • Integrated network and security • Support for academic use cases • Simple administration and billing • Created together with you Customers • Universities • Research institutions • eLearning Center • University hospitals • Spin-Offs Services • SWITCHengines (IaaS) • Virtual Private Cloud (VPC) • SCALE-UP (academic project)
  4. © 2019 SWITCH | Agenda • State of SWITCHengines •

    Let's get technical! – technical details GPU in OpenStack VMs – technical details about exposing said GPUs to Containers • Use Cases • Where are we going with this?
  5. © 2019 SWITCH | SWITCHengines numbers (as of 16.5.2019) Datacenters

    in Zurich and Lausanne • CPU cores: 3748 (physical cores) • Memory: ~ 30 TB • Storage: ~ 6 PB (Ceph SATA) / ~ 1100 Disks ~ 100 TB (Ceph SSD) / 50 NVMe • GPU: 8 Titan XP 16 T4 34 P100 • Network: Dual 10 Gbs / upgrading to 100 Gbs (Q2 2019) L2 tunnel to campus networks (VPC)
  6. © 2019 SWITCH | SWITCHengines users • Education – Hundreds

    of users at universities of applied science (classroom) – Specialised training (Bioinformatics) – Bachelor / Masters projects • Research – Across universities – SDSC • Enterprise – business continuity – off site storage – datacenter migrations
  7. © 2019 SWITCH | Funding options • Pay-as-you go •

    Multi year contracts (with discounts) • SNF & innosuisse projects are qualified for funding
  8. © 2019 SWITCH | Community demand • Storage – Long

    term storage of scientific data (LTS Project) • 5-10+ years, luke warm storage, S3 interface • off-site, regulatory demands, "Vault" • Initial procurement of 3 PB underway • Paid service in 2020 • Contact: [email protected] – Backups etc. • increasing demand from many customers • GPU – that's why we are here today
  9. © 2019 SWITCH | Setting up GPUs on OpenStack Enable

    IOMMU on host /etc/default/grub GRUB_CMDLINE_LINUX="console=tty1 console=ttyS1,115200n8 consoleblank=0 intel_iommu=
  10. © 2019 SWITCH | Setting up GPUs on OpenStack What

    do we have installed? # lspci | grep -i nvidia 3b:00.0 3D controller: NVIDIA Corporation Device 1eb8 (rev a1) 5e:00.0 3D controller: NVIDIA Corporation Device 1eb8 (rev a1) 86:00.0 3D controller: NVIDIA Corporation Device 1eb8 (rev a1) af:00.0 3D controller: NVIDIA Corporation Device 1eb8 (rev a1)
  11. © 2019 SWITCH | Setting up GPUs on OpenStack Make

    them know to nova /etc/nova/nova.conf pci_passthrough_whitelist={"vendor_id":"10de"} pci_alias={"name":"P100","vendor_id":"10de","product_id":"15f8"} pci_alias={"name":"P100-12GB","vendor_id":"10de","product_id":"15f7"} pci_alias={"name":"TitanXpVGA","vendor_id":"10de","product_id":"1b02"} pci_alias={"name":"TitanXpAudio","vendor_id":"10de","product_id":"10ef"} pci_alias={"name":"T4","vendor_id":"10de","device_type":"type-PF", "product_id":"1eb8"}
  12. © 2019 SWITCH | Create new flavors openstack flavor create

    --private \ --ram 94208 --disk 30 --vcpus 8 \ --property pci_passthrough:alias='T4:2' \ g1.c08r92-2t4 openstack flavor create --private \ --ram 47104 --disk 30 --vcpus 4 \ --property pci_passthrough:alias='T4:1' \ g1.c04r46-1t4
  13. © 2019 SWITCH | If you have TitanX GPUs (you

    shouldn't) NVIDIA doesn't want you to run them virtualized #459753 adds support for a img_hide_hypervisor_id image property. This is included in OpenStack Pike and above. #555861 adds support for a hide_hypervisor_id flavor property. This is included in OpenStack Rocky and above. openstack flavor create --private \ --ram 47104 --disk 30 --vcpus 4 \ --property pci_passthrough:alias='TitanXpVGA:1,TitanXpAudio:1' \ --property hide_hypervisor_id=true \ g1.c04r46-1titanxp
  14. © 2019 SWITCH | OpenStack & OpenShift • Working with

    Appuio to deliver OpenShift service on SWITCHengines • Installation up and running, used for internal / external tests and some productive deployments
  15. © 2019 SWITCH | Kubernetes & GPUs • Experimental support

    – NVIDIA GPU in v1.6 – Device plugin since v1.9 • Requires installation of GPU drivers – NVIDIA: nvidia.com/gpu – AMD: amd.com/gpu
  16. © 2019 SWITCH | Sample YAML Pod Definition apiVersion: v1

    kind: Pod metadata: name: cuda-vector-add spec: restartPolicy: OnFailure containers: - name: cuda-vector-add image: "k8s.gcr.io/cuda-vector-add:v0.1" resources: limits: nvidia.com/gpu: 1 # requesting 1 GPU
  17. © 2019 SWITCH | Limitations • GPUs only specified in

    `limits` section – Can specify `limits` without `requests` – `limits` and `requests` must be equal – Cannot specify `requests` without `limits` • GPUs cannot be shared across containers and pods • Each container can only request one or more GPUs – No "fractions" of GPUs
  18. © 2019 SWITCH | AMD GPU Plugin • Must be

    preinstalled in nodes • Install with: kubectl create -f https://raw.githubusercontent.com/RadeonOpenCompute/k8s- device-plugin/r1.10/k8s-ds-amdgpu-dp.yaml
  19. © 2019 SWITCH | Official NVIDIA GPU Plugin • NVIDIA

    drivers must be pre-installed in nodes – https://github.com/NVIDIA/nvidia-docker • nvidia-container-runtime must be configured as the default runtime for docker instead of runc – https://github.com/NVIDIA/k8s-device-plugin • NVIDIA drivers ~= 361.93
  20. © 2019 SWITCH | Google Cloud Engine NVIDIA Plugin •

    Does not require nvidia-docker • Compatible with any CRI • Installation kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/contai ner-engine-accelerators/stable/daemonset.yaml More info: https://github.com/GoogleCloudPlatform/container-engine-accelerators
  21. © 2019 SWITCH | OpenShift • NVIDIA and Red Hat

    working closely • Preview in 3.9 • GA in 3.10
  22. © 2019 SWITCH | OpenShift 3.10 NVIDIA Driver Installation yum

    install kernel-devel-\`uname -r\` yum install -y https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86\ _64/cuda-repo-rhel7-9.2.88-1.x86\_64.rpm yum -y install xorg-x11-drv-nvidia xorg-x11-drv-nvidia-devel modprobe -r nouveau nvidia-modprobe && nvidia-modprobe -u
  23. © 2019 SWITCH | OpenShift 3.10 with the GPU Device

    Plugin oc new-project nvidia oc create serviceaccount nvidia-deviceplugin oc create -f nvidia-deviceplugin-scc.yaml oc label node <node> openshift.com/gpu-accelerator=true
  24. © 2019 SWITCH | Deploy the NVIDIA Device Plugin Daemonset

    oc create -f nvidia-deviceplugin.yaml oc get pods NAME READY STATUS RESTARTS AGE nvidia-device-plugin-daemonset-s9ngg 1/1 Running 0 1m oc describe node <node>|egrep ‘Capacity|Allocatable|gpu’ Capacity: nvidia.com/gpu: 2 Allocatable: nvidia.com/gpu: 2
  25. © 2019 SWITCH | Deploy a pod that requires a

    GPU oc create -f cuda-vector-add.yaml oc get pods NAME READY STATUS RESTARTS AGE cuda-vector-add 0/1 Completed 0 3s nvidia-device-plugin-daemonset-s9ngg 1/1 Running 0 9m oc logs cuda-vector-add [Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED Done
  26. © 2019 SWITCH | GPU Example – Prostate Segmentation Source:

    https://www.redhat.com/files/summit/session-assets/2018/S1413-Medical-image-processing-with-OpenShift-and-OpenStack-Distribution.pdf
  27. © 2019 SWITCH | GPU Example – Monte Carlo Source:

    https://www.redhat.com/files/summit/session-assets/2018/S1413-Medical-image-processing-with-OpenShift-and-OpenStack-Distribution.pdf
  28. © 2019 SWITCH | Creating a (self) service Giving access

    to special flavors Seeing how the current availability is Creating billing records
  29. © 2019 SWITCH | Seeing GPU usage & availability mysql>

    select label, status from pci_devices; +-----------------+-----------+ | label | status | +-----------------+-----------+ | label_10de_15f8 | available | | label_10de_15f8 | allocated | | label_10de_10ef | allocated | | label_10de_10ef | allocated | | label_10de_10ef | allocated | | label_10de_10ef | allocated | | label_10de_10ef | available | | label_10de_10ef | allocated | | label_10de_10ef | available |
  30. © 2019 SWITCH | Billing Updating our current billing software

    (home-grown Java) Working with vendor of new billing software (cyclops-labs.io)
  31. © 2019 SWITCH | Use Cases (that we know of)

    SDSC • Various Machine Learning / Deep Learning projects • GPUs available via https://renkulab.io/ (on containers, implemented by SDSC) or on VMs FHNW • Various Master projects Various test instances
  32. © 2019 SWITCH | Audience participation Need for GPUs in

    cloud? Usage patterns? Desired delivery? (VM, Container) K8s or OpenShift? Should we continue building it? Will you use it? Discreet Records [Public domain]
  33. © 2019 SWITCH | Service announcements • GPUs available –

    TitanXP: CHF 0.50 / hour – Tesla T4: CHF 0.75 / hour – P100: CHF 1.00 / hour + cost of VM (discounts with higher usage) • Available to VM projects in ZH Region • OpenShift powered by Appuio and OpenStack – Service in 2020