Using GPUs in OpenStack / OpenShift

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

© 2019 SWITCH | VSHN • Spelled "Vision" – The DevOps Company • Located in Zürich, • Founded 2014 by ETHZ alumni • Currently 33 VSHNeers • First Kubernetes Certified Provider in Switzerland • Authorized Docker Consulting Partner

Slide 4

Slide 4 text

© 2019 SWITCH | SWITCHengines Customer tailored computing and storage performance for universities, research and teaching – was further developed in the SCALE-UP project mandated by swissuniversities. Your benefits • Your data in Switzerland • Integrated network and security • Support for academic use cases • Simple administration and billing • Created together with you Customers • Universities • Research institutions • eLearning Center • University hospitals • Spin-Offs Services • SWITCHengines (IaaS) • Virtual Private Cloud (VPC) • SCALE-UP (academic project)

Slide 5

Slide 5 text

© 2019 SWITCH | Agenda • State of SWITCHengines • Let's get technical! – technical details GPU in OpenStack VMs – technical details about exposing said GPUs to Containers • Use Cases • Where are we going with this?

Slide 6

Slide 6 text

© 2019 SWITCH | SWITCHengines numbers (as of 16.5.2019) Datacenters in Zurich and Lausanne • CPU cores: 3748 (physical cores) • Memory: ~ 30 TB • Storage: ~ 6 PB (Ceph SATA) / ~ 1100 Disks ~ 100 TB (Ceph SSD) / 50 NVMe • GPU: 8 Titan XP 16 T4 34 P100 • Network: Dual 10 Gbs / upgrading to 100 Gbs (Q2 2019) L2 tunnel to campus networks (VPC)

Slide 7

Slide 7 text

© 2019 SWITCH | SWITCHengines users • Education – Hundreds of users at universities of applied science (classroom) – Specialised training (Bioinformatics) – Bachelor / Masters projects • Research – Across universities – SDSC • Enterprise – business continuity – off site storage – datacenter migrations

Slide 8

Slide 8 text

Slide 9

Slide 9 text

© 2019 SWITCH | Community demand • Storage – Long term storage of scientific data (LTS Project) • 5-10+ years, luke warm storage, S3 interface • off-site, regulatory demands, "Vault" • Initial procurement of 3 PB underway • Paid service in 2020 • Contact: [email protected] – Backups etc. • increasing demand from many customers • GPU – that's why we are here today

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

© 2019 SWITCH | Setting up GPUs on OpenStack What do we have installed? # lspci | grep -i nvidia 3b:00.0 3D controller: NVIDIA Corporation Device 1eb8 (rev a1) 5e:00.0 3D controller: NVIDIA Corporation Device 1eb8 (rev a1) 86:00.0 3D controller: NVIDIA Corporation Device 1eb8 (rev a1) af:00.0 3D controller: NVIDIA Corporation Device 1eb8 (rev a1)

Slide 15

Slide 15 text

© 2019 SWITCH | Setting up GPUs on OpenStack Make them know to nova /etc/nova/nova.conf pci_passthrough_whitelist={"vendor_id":"10de"} pci_alias={"name":"P100","vendor_id":"10de","product_id":"15f8"} pci_alias={"name":"P100-12GB","vendor_id":"10de","product_id":"15f7"} pci_alias={"name":"TitanXpVGA","vendor_id":"10de","product_id":"1b02"} pci_alias={"name":"TitanXpAudio","vendor_id":"10de","product_id":"10ef"} pci_alias={"name":"T4","vendor_id":"10de","device_type":"type-PF", "product_id":"1eb8"}

Slide 16

Slide 16 text

© 2019 SWITCH | Add PciScheduler /etc/nova/nova.conf scheduler_default_filters=RetryFilter,AvailabilityZoneFilter,RamFilter, ComputeFilter,ImagePropertiesFilter,ServerGroupAntiAffinityFilter, ServerGroupAffinityFilter,AggregateImageOsDistroIsolation, AggregateInstanceExtraSpecsFilter,ComputeCapabilitiesFilter, PciPassthroughFilter

Slide 17

Slide 17 text

© 2019 SWITCH | Create new flavors openstack flavor create --private \ --ram 94208 --disk 30 --vcpus 8 \ --property pci_passthrough:alias='T4:2' \ g1.c08r92-2t4 openstack flavor create --private \ --ram 47104 --disk 30 --vcpus 4 \ --property pci_passthrough:alias='T4:1' \ g1.c04r46-1t4

Slide 18

Slide 18 text

© 2019 SWITCH | If you have TitanX GPUs (you shouldn't) NVIDIA doesn't want you to run them virtualized #459753 adds support for a img_hide_hypervisor_id image property. This is included in OpenStack Pike and above. #555861 adds support for a hide_hypervisor_id flavor property. This is included in OpenStack Rocky and above. openstack flavor create --private \ --ram 47104 --disk 30 --vcpus 4 \ --property pci_passthrough:alias='TitanXpVGA:1,TitanXpAudio:1' \ --property hide_hypervisor_id=true \ g1.c04r46-1titanxp

Slide 19

Slide 19 text

Slide 20

Slide 20 text

© 2019 SWITCH | OpenStack & OpenShift • Working with Appuio to deliver OpenShift service on SWITCHengines • Installation up and running, used for internal / external tests and some productive deployments

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Slide 24

Slide 24 text

© 2019 SWITCH | Sample YAML Pod Definition apiVersion: v1 kind: Pod metadata: name: cuda-vector-add spec: restartPolicy: OnFailure containers: - name: cuda-vector-add image: "k8s.gcr.io/cuda-vector-add:v0.1" resources: limits: nvidia.com/gpu: 1 # requesting 1 GPU

Slide 25

Slide 25 text

© 2019 SWITCH | Limitations • GPUs only specified in `limits` section – Can specify `limits` without `requests` – `limits` and `requests` must be equal – Cannot specify `requests` without `limits` • GPUs cannot be shared across containers and pods • Each container can only request one or more GPUs – No "fractions" of GPUs

Slide 26

Slide 26 text

Slide 27

Slide 27 text

© 2019 SWITCH | Official NVIDIA GPU Plugin • NVIDIA drivers must be pre-installed in nodes – https://github.com/NVIDIA/nvidia-docker • nvidia-container-runtime must be configured as the default runtime for docker instead of runc – https://github.com/NVIDIA/k8s-device-plugin • NVIDIA drivers ~= 361.93

Slide 28

Slide 28 text

© 2019 SWITCH | Google Cloud Engine NVIDIA Plugin • Does not require nvidia-docker • Compatible with any CRI • Installation kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/contai ner-engine-accelerators/stable/daemonset.yaml More info: https://github.com/GoogleCloudPlatform/container-engine-accelerators

Slide 29

Slide 29 text

Slide 30

Slide 30 text

© 2019 SWITCH | OpenShift 3.10 NVIDIA Driver Installation yum install kernel-devel-\`uname -r\` yum install -y https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86\ _64/cuda-repo-rhel7-9.2.88-1.x86\_64.rpm yum -y install xorg-x11-drv-nvidia xorg-x11-drv-nvidia-devel modprobe -r nouveau nvidia-modprobe && nvidia-modprobe -u

Slide 31

Slide 31 text

© 2019 SWITCH | OpenShift 3.10 with the GPU Device Plugin oc new-project nvidia oc create serviceaccount nvidia-deviceplugin oc create -f nvidia-deviceplugin-scc.yaml oc label node openshift.com/gpu-accelerator=true

Slide 32

Slide 32 text

© 2019 SWITCH | Deploy the NVIDIA Device Plugin Daemonset oc create -f nvidia-deviceplugin.yaml oc get pods NAME READY STATUS RESTARTS AGE nvidia-device-plugin-daemonset-s9ngg 1/1 Running 0 1m oc describe node |egrep ‘Capacity|Allocatable|gpu’ Capacity: nvidia.com/gpu: 2 Allocatable: nvidia.com/gpu: 2

Slide 33

Slide 33 text

© 2019 SWITCH | Deploy a pod that requires a GPU oc create -f cuda-vector-add.yaml oc get pods NAME READY STATUS RESTARTS AGE cuda-vector-add 0/1 Completed 0 3s nvidia-device-plugin-daemonset-s9ngg 1/1 Running 0 9m oc logs cuda-vector-add [Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED Done

Slide 34

Slide 34 text

Slide 35

Slide 35 text

Slide 36

Slide 36 text

Slide 37

Slide 37 text

Slide 38

Slide 38 text

Slide 39

Slide 39 text

Slide 40

Slide 40 text

Slide 41

Slide 41 text

Slide 42

Slide 42 text

© 2019 SWITCH | Use Cases (that we know of) SDSC • Various Machine Learning / Deep Learning projects • GPUs available via https://renkulab.io/ (on containers, implemented by SDSC) or on VMs FHNW • Various Master projects Various test instances

Slide 43

Slide 43 text

© 2019 SWITCH | Audience participation Need for GPUs in cloud? Usage patterns? Desired delivery? (VM, Container) K8s or OpenShift? Should we continue building it? Will you use it? Discreet Records [Public domain]

Slide 44

Slide 44 text

© 2019 SWITCH | Service announcements • GPUs available – TitanXP: CHF 0.50 / hour – Tesla T4: CHF 0.75 / hour – P100: CHF 1.00 / hour + cost of VM (discounts with higher usage) • Available to VM projects in ZH Region • OpenShift powered by Appuio and OpenStack – Service in 2020