Using GPUs in OpenStack / OpenShift

© 2019 SWITCH | Jens-Christian Fischer [email protected] @jcfischer HPC Forum
Basel, 16.5.2019 Using GPUs in OpenStack / OpenShift Adrian Kosmaczewski [email protected] @akosma

© 2019 SWITCH | VSHN • Spelled "Vision" – The
DevOps Company • Located in Zürich, • Founded 2014 by ETHZ alumni • Currently 33 VSHNeers • First Kubernetes Certified Provider in Switzerland • Authorized Docker Consulting Partner

© 2019 SWITCH | SWITCHengines Customer tailored computing and storage
performance for universities, research and teaching – was further developed in the SCALE-UP project mandated by swissuniversities. Your benefits • Your data in Switzerland • Integrated network and security • Support for academic use cases • Simple administration and billing • Created together with you Customers • Universities • Research institutions • eLearning Center • University hospitals • Spin-Offs Services • SWITCHengines (IaaS) • Virtual Private Cloud (VPC) • SCALE-UP (academic project)

© 2019 SWITCH | Agenda • State of SWITCHengines •
Let's get technical! – technical details GPU in OpenStack VMs – technical details about exposing said GPUs to Containers • Use Cases • Where are we going with this?

© 2019 SWITCH | SWITCHengines numbers (as of 16.5.2019) Datacenters
in Zurich and Lausanne • CPU cores: 3748 (physical cores) • Memory: ~ 30 TB • Storage: ~ 6 PB (Ceph SATA) / ~ 1100 Disks ~ 100 TB (Ceph SSD) / 50 NVMe • GPU: 8 Titan XP 16 T4 34 P100 • Network: Dual 10 Gbs / upgrading to 100 Gbs (Q2 2019) L2 tunnel to campus networks (VPC)

© 2019 SWITCH | SWITCHengines users • Education – Hundreds
of users at universities of applied science (classroom) – Specialised training (Bioinformatics) – Bachelor / Masters projects • Research – Across universities – SDSC • Enterprise – business continuity – off site storage – datacenter migrations

© 2019 SWITCH | Funding options • Pay-as-you go •
Multi year contracts (with discounts) • SNF & innosuisse projects are qualified for funding

© 2019 SWITCH | Community demand • Storage – Long
term storage of scientific data (LTS Project) • 5-10+ years, luke warm storage, S3 interface • off-site, regulatory demands, "Vault" • Initial procurement of 3 PB underway • Paid service in 2020 • Contact: [email protected] – Backups etc. • increasing demand from many customers • GPU – that's why we are here today

© 2019 SWITCH | Τὶς γλαῦκ᾿ Ἀθήναζε [ἐκόμισε] Aristophanes, ca
400 BC

© 2019 SWITCH | Setting up GPUs on OpenStack Enable
IOMMU on host /etc/default/grub GRUB_CMDLINE_LINUX="console=tty1 console=ttyS1,115200n8 consoleblank=0 intel_iommu=

© 2019 SWITCH | Setting up GPUs on OpenStack What
do we have installed? # lspci | grep -i nvidia 3b:00.0 3D controller: NVIDIA Corporation Device 1eb8 (rev a1) 5e:00.0 3D controller: NVIDIA Corporation Device 1eb8 (rev a1) 86:00.0 3D controller: NVIDIA Corporation Device 1eb8 (rev a1) af:00.0 3D controller: NVIDIA Corporation Device 1eb8 (rev a1)

© 2019 SWITCH | Setting up GPUs on OpenStack Make
them know to nova /etc/nova/nova.conf pci_passthrough_whitelist={"vendor_id":"10de"} pci_alias={"name":"P100","vendor_id":"10de","product_id":"15f8"} pci_alias={"name":"P100-12GB","vendor_id":"10de","product_id":"15f7"} pci_alias={"name":"TitanXpVGA","vendor_id":"10de","product_id":"1b02"} pci_alias={"name":"TitanXpAudio","vendor_id":"10de","product_id":"10ef"} pci_alias={"name":"T4","vendor_id":"10de","device_type":"type-PF", "product_id":"1eb8"}

© 2019 SWITCH | Add PciScheduler /etc/nova/nova.conf scheduler_default_filters=RetryFilter,AvailabilityZoneFilter,RamFilter, ComputeFilter,ImagePropertiesFilter,ServerGroupAntiAffinityFilter, ServerGroupAffinityFilter,AggregateImageOsDistroIsolation,
AggregateInstanceExtraSpecsFilter,ComputeCapabilitiesFilter, PciPassthroughFilter

© 2019 SWITCH | Create new flavors openstack flavor create
--private \ --ram 94208 --disk 30 --vcpus 8 \ --property pci_passthrough:alias='T4:2' \ g1.c08r92-2t4 openstack flavor create --private \ --ram 47104 --disk 30 --vcpus 4 \ --property pci_passthrough:alias='T4:1' \ g1.c04r46-1t4

© 2019 SWITCH | If you have TitanX GPUs (you
shouldn't) NVIDIA doesn't want you to run them virtualized #459753 adds support for a img_hide_hypervisor_id image property. This is included in OpenStack Pike and above. #555861 adds support for a hide_hypervisor_id flavor property. This is included in OpenStack Rocky and above. openstack flavor create --private \ --ram 47104 --disk 30 --vcpus 4 \ --property pci_passthrough:alias='TitanXpVGA:1,TitanXpAudio:1' \ --property hide_hypervisor_id=true \ g1.c04r46-1titanxp

© 2019 SWITCH | OpenStack & OpenShift • Working with
Appuio to deliver OpenShift service on SWITCHengines • Installation up and running, used for internal / external tests and some productive deployments

© 2019 SWITCH | OpenStack & OpenShift & GPU A
match made in heaven?

© 2019 SWITCH | Kubernetes & GPUs • Experimental support
– NVIDIA GPU in v1.6 – Device plugin since v1.9 • Requires installation of GPU drivers – NVIDIA: nvidia.com/gpu – AMD: amd.com/gpu

© 2019 SWITCH | Sample YAML Pod Definition apiVersion: v1
kind: Pod metadata: name: cuda-vector-add spec: restartPolicy: OnFailure containers: - name: cuda-vector-add image: "k8s.gcr.io/cuda-vector-add:v0.1" resources: limits: nvidia.com/gpu: 1 # requesting 1 GPU

© 2019 SWITCH | Limitations • GPUs only specified in
`limits` section – Can specify `limits` without `requests` – `limits` and `requests` must be equal – Cannot specify `requests` without `limits` • GPUs cannot be shared across containers and pods • Each container can only request one or more GPUs – No "fractions" of GPUs

© 2019 SWITCH | AMD GPU Plugin • Must be
preinstalled in nodes • Install with: kubectl create -f https://raw.githubusercontent.com/RadeonOpenCompute/k8s- device-plugin/r1.10/k8s-ds-amdgpu-dp.yaml

© 2019 SWITCH | Official NVIDIA GPU Plugin • NVIDIA
drivers must be pre-installed in nodes – https://github.com/NVIDIA/nvidia-docker • nvidia-container-runtime must be configured as the default runtime for docker instead of runc – https://github.com/NVIDIA/k8s-device-plugin • NVIDIA drivers ~= 361.93

© 2019 SWITCH | Google Cloud Engine NVIDIA Plugin •
Does not require nvidia-docker • Compatible with any CRI • Installation kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/contai ner-engine-accelerators/stable/daemonset.yaml More info: https://github.com/GoogleCloudPlatform/container-engine-accelerators

© 2019 SWITCH | OpenShift • NVIDIA and Red Hat
working closely • Preview in 3.9 • GA in 3.10

© 2019 SWITCH | OpenShift 3.10 NVIDIA Driver Installation yum
install kernel-devel-\`uname -r\` yum install -y https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86\ _64/cuda-repo-rhel7-9.2.88-1.x86\_64.rpm yum -y install xorg-x11-drv-nvidia xorg-x11-drv-nvidia-devel modprobe -r nouveau nvidia-modprobe && nvidia-modprobe -u

© 2019 SWITCH | OpenShift 3.10 with the GPU Device
Plugin oc new-project nvidia oc create serviceaccount nvidia-deviceplugin oc create -f nvidia-deviceplugin-scc.yaml oc label node <node> openshift.com/gpu-accelerator=true

© 2019 SWITCH | Deploy the NVIDIA Device Plugin Daemonset
oc create -f nvidia-deviceplugin.yaml oc get pods NAME READY STATUS RESTARTS AGE nvidia-device-plugin-daemonset-s9ngg 1/1 Running 0 1m oc describe node <node>|egrep ‘Capacity|Allocatable|gpu’ Capacity: nvidia.com/gpu: 2 Allocatable: nvidia.com/gpu: 2

© 2019 SWITCH | Deploy a pod that requires a
GPU oc create -f cuda-vector-add.yaml oc get pods NAME READY STATUS RESTARTS AGE cuda-vector-add 0/1 Completed 0 3s nvidia-device-plugin-daemonset-s9ngg 1/1 Running 0 9m oc logs cuda-vector-add [Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED Done

© 2019 SWITCH | GPU Example – Prostate Segmentation Source:
https://www.redhat.com/files/summit/session-assets/2018/S1413-Medical-image-processing-with-OpenShift-and-OpenStack-Distribution.pdf

© 2019 SWITCH | GPU Example – Monte Carlo Source:
https://www.redhat.com/files/summit/session-assets/2018/S1413-Medical-image-processing-with-OpenShift-and-OpenStack-Distribution.pdf

© 2019 SWITCH | Creating a (self) service Giving access
to special flavors Seeing how the current availability is Creating billing records

© 2019 SWITCH | Billing Updating our current billing software
(home-grown Java) Working with vendor of new billing software (cyclops-labs.io)

© 2019 SWITCH | Use Cases (that we know of)
SDSC • Various Machine Learning / Deep Learning projects • GPUs available via https://renkulab.io/ (on containers, implemented by SDSC) or on VMs FHNW • Various Master projects Various test instances

© 2019 SWITCH | Audience participation Need for GPUs in
cloud? Usage patterns? Desired delivery? (VM, Container) K8s or OpenShift? Should we continue building it? Will you use it? Discreet Records [Public domain]

© 2019 SWITCH | Service announcements • GPUs available –
TitanXP: CHF 0.50 / hour – Tesla T4: CHF 0.75 / hour – P100: CHF 1.00 / hour + cost of VM (discounts with higher usage) • Available to VM projects in ZH Region • OpenShift powered by Appuio and OpenStack – Service in 2020

Working for a better digital world

Using GPUs in OpenStack / OpenShift

Using GPUs in OpenStack / OpenShift

More Decks by Adrian Kosmaczewski

Other Decks in Technology

Featured

Transcript