Multi-tenancy Best Practices for Google Kubernetes Engine

Slide 1

Slide 1 text

AHMET ALP BALKAN SOFTWARE ENGINEER, GOOGLE CLOUD YOSHI TAMURA PRODUCT MANAGER, GOOGLE CLOUD THURSDAY, JULY 26 IO232 Multi-Tenancy Best Practices for Google Kubernetes Engine 1

Slide 2

Slide 2 text

Who are we? Ahmet Alp Balkan (@ahmetb) Software Engineer at Developer Relations I work on making Kubernetes Engine easier to understand and use for developers and operators and write open source tools for Kubernetes. Previously, I worked at Microsoft Azure, on porting Docker to Windows and ACR. I maintain "kubectx". 2

Slide 3

Slide 3 text

Yoshi Tamura (@yoshiat) Product Manager, Kubernetes Engine I work on Multi-tenancy and Hardware Accelerators (GPU and Cloud TPU) in Kubernetes Engine. Who are we? 3

Slide 4

Slide 4 text

Practical Multi-Tenancy on Kubernetes Engine Following slides heavily inspired by KubeCon EU'18 talk of David Oppenheimer, Software Engineer, Google 4 Register your interest at: gke.page.link/multi-tenancy

Slide 5

Slide 5 text

trust multi-tenancy modes isolation access control resource usage scheduling multi-tenancy features policy management preventing contention billing 5

Slide 6

Slide 6 text

0 What is multi-tenancy? 6

Slide 7

Slide 7 text

Software Multi-tenancy single instance of software runs on a server and serves multiple tenants. 7

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Kubernetes Multi-tenancy Providing isolation and fair resource sharing between multiple users and their workloads within a cluster. 9

Slide 10

Slide 10 text

1 Trust 10

Slide 11

Slide 11 text

● Your compiler* ● Operating system ● Dependencies ● Deployment pipeline ● Container runtime ... Do you trust... * Bonus reading on compilers: - Reflections on trusting trust. Ken Thompson. 1984. CACM 27, 8 (August 1984), 761-763. - Fully Countering Trusting Trust through Diverse Double-Compiling. D A Wheeler. PhD thesis, George Mason University, Oct. 2009. 11

Slide 12

Slide 12 text

Levels of trust software multi-tenancy Trusted Semi-trusted Non-trusted the code comes from an audited source, built and run by trusted components (a.k.a “the dream”) the code comes from potentially hostile users, cannot assume good intent (a.k.a. hosting providers) trusted code, but has 3rd party dependencies or software not fully audited (a.k.a most people) 12

Slide 13

Slide 13 text

2 Kubernetes Engine Multi-Tenancy Primitives 13

Slide 14

Slide 14 text

Kubernetes Cluster vs Namespace boundary cluster cluster namespace namespace namespace namespace namespace namespace namespace namespace 14

Slide 15

Slide 15 text

project-2 Pros ● Separate control plane (API) for each tenant (for free*) ● Strong network isolation (if it's per-cluster VPC) However: ● Need tools to manage 10s or 100s of clusters ● Resource/configuration fragmentation of clusters ● Slow turn-up: need to create a cluster for a new tenant * Google Kubernetes Engine control plane (master) is free of charge. Cluster per Tenant cluster cluster cluster project-1 15

Slide 16

Slide 16 text

Namespace per tenant (intra-cluster multi-tenancy) Namespaces provide logical isolation between tenants on a cluster. Kubernetes policies are namespace-scoped. ● Logical isolation between tenants ● Policies for API access restrictions & resource usage constraints Pros: ● Tenants can reuse extensions/controllers/CRDs ● Shared control plane (=shared ops, shared security/auditing…) ns1 ns2 ns3 ns4 16

Slide 17

Slide 17 text

Kubernetes Engine primitives Quotas Network Policy Pod Security Policy Pod Priority Limit Range IAM Sandbox Pods RBAC Access Control Resource Sharing Runtime Isolation Pod Affinity /Anti-Affinity Admissio n Control 17

Slide 18

Slide 18 text

3 Use cases of Kubernetes Multi-tenancy 18

Slide 19

Slide 19 text

Enterprise SaaS (Software as a Service) Multi-tenancy use cases in Kubernetes KaaS (Kubernetes as a Service) 19

Slide 20

Slide 20 text

All users from the same company/organization Namespaces ⇔ Tenants ⇔ Teams Semi-trusted tenants (you can fire them on violation) Cluster Roles: ● Cluster Admin ○ CRUD any policy objects ○ Create/assign namespaces to “Namespace Admins” ○ Manage policies (resource usage quotas, networking) ● Namespace Admin ○ Manage users in the namespace(s) they own. ● User ○ CRUD non-policy objects in the namespace(s) they have access to “Enterprise” Model Control Plane (apiserver) Cluster Admin ns2 ns3 ns4 ns1 Namespace Admin Namespace Admin Namespace Admin 20

Slide 21

Slide 21 text

Many apps from different teams, semi-trusted ● Vanilla container isolation may suffice ● If not: Sandboxing with gVisor, limit capabilities, use seccomp/AppArmor/... Network isolation: ● Allow all traffic within a namespace ● Whitelist traffic from/to other namespaces (=teams) “Enterprise” Model Control Plane (apiserver) Cluster Admin ns2 ns3 ns4 ns1 Namespace Admin Namespace Admin Namespace Admin 21

Slide 22

Slide 22 text

“Software as a Service” model Control Plane (apiserver) Cluster Admin SaaS API/proxy SaaS Consumers cluster 22 Consumer deploys their app through a custom control plane.

Slide 23

Slide 23 text

“Software as a Service” model 23 Control Plane (apiserver) Cluster Admin SaaS API/proxy SaaS Consumers cluster Consumer deploys their app through a custom control plane. After the app is deployed, customers directly connect to the app. Example: Wordpress hosting

Slide 24

Slide 24 text

“Software as a Service” model 24 Control Plane (apiserver) Cluster Admin SaaS API/proxy SaaS Consumers cluster Consumer deploys their app through a custom control plane. After the app is deployed, customers directly connect to the app. Example: Wordpress hosting SaaS API is a trusted client of Kubernetes. Cluster admins can access the Kubernetes API directly. Tenant workloads may have untrusted pieces: ● such as WordPress extensions ● may require sandboxing with gVisor etc.

Slide 25

Slide 25 text

Untrusted tenants running untrusted code. (Platform as a Service or hosting companies.) Tenants may create their namespaces, but cannot set policy objects. Stronger isolation requirements than enterprise/SaaS: ● isolated world view (separate control plane) ● tenants must not see each other ● strong node and network isolation ○ sandbox pods ○ sole-tenant nodes ○ multi-tenant networking/DNS “Kubernetes as a Service” model Control Plane (apiserver) Cluster Admin ns1 ns2 ns3 ns4 25

Slide 26

Slide 26 text

Slide 27

Slide 27 text

4 Kubernetes Multi-tenancy Policy APIs and Features 27

Slide 28

Slide 28 text

Kubernetes Engine multi-tenancy primitives Quotas Network Policy Pod Security Policy Pod Priority Limit Range IAM Sandbox Pods RBAC Access Control Resource Sharing Runtime Isolation Pod Security Context Pod Affinity Admissio n Control 28

Slide 29

Slide 29 text

Kubernetes Engine multi-tenancy primitives Quotas Network Policy Pod Priority Limit Range IAM Sandbox Pods RBAC Auth related Scheduling related Pod Security Context Pod Affinity Admissio n Control Pod Security Policy 29

Slide 30

Slide 30 text

Auth related features 30

Slide 31

Slide 31 text

Authentication, Authorization, Admission Control Plane (apiserver) Authorizer Pluggable Auth (GKE IAM) RBAC Admission Control allow etcd Cloud IAM Policies {Cluster,}Role {Cluster,}RoleBinding allow Pods 31

Slide 32

Slide 32 text

Kubernetes RBAC Which users/groups/Service Accounts can do which operations on which API resources in which namespaces. 32

Slide 33

Slide 33 text

Kubernetes RBAC Mostly useful for: ● Giving access to pods calling Kubernetes API (with Kubernetes Service Accounts) ● Giving fine-grained access to people/groups calling Kubernetes API (with Google accounts) Concepts: ClusterRole A preset of capabilities, cluster-wide Role ClusterRole, but namespace-scoped ClusterRoleBinding Give permissions of a ClusterRole to: ● Google users/groups ● Google Cloud IAM Service Accounts ● Kubernetes Service Accounts RoleBinding ClusterRoleBinding, but namespace-scoped. 33

Slide 34

Slide 34 text

kind: ClusterRoleBinding apiVersion: rbac.authorization.k8s.io/v1 metadata: name: "admins:namespace-creator" roleRef: kind: Role name: "namespace-creator" apiGroup: rbac.authorization.k8s.io subjects: - kind: User name: "[email protected]" # Google user apiGroup: rbac.authorization.k8s.io Kubernetes RBAC Example ClusterRole+Binding for namespace-creator: kind: ClusterRole apiVersion: rbac.authorization.k8s.io/v1 metadata: name: "namespace-creator" rules: - apiGroups: [""] # core resources: ["namespaces"] verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] 34

Slide 35

Slide 35 text

Practical for giving Google users/groups project-wide access: Curated IAM “Roles”: Kubernetes Engine + Cloud IAM Admin *Can do everything* Viewer *Can view everything* Cluster Admin Can manage clusters (create/delete/upgrade clusters) Cannot view what's in the clusters (Kubernetes API) Developer Can do everything in a cluster (Kubernetes API) Cannot manage clusters (create/delete/upgrade clusters) You can curate new ones with Cloud IAM Custom Roles. 35

Slide 36

Slide 36 text

Kubernetes Engine + IAM Give someone "Developer" role on all clusters in the project: gcloud projects add-iam-policy-binding PROJECT_ID \ --member=user:[email protected] \ --role=roles/container.developer Give a Google Group "Viewer" role on all clusters in the project: gcloud projects add-iam-policy-binding PROJECT_ID \ --member=group:[email protected] \ --role=roles/container.viewer 36

Slide 37

Slide 37 text

Admission Controls Intercept API request before resource is persisted. Admission control can mutate and allow/deny. Admission Control etcd Admission Plugins allow 37

Slide 38

Slide 38 text

Compiled into Kubernetes apiserver binary. Enabled Admission Plugins be changed on Kubernetes Engine. But these 15 admission plugins are already enabled: Initializers, NamespaceLifecycle, LimitRanger, ServiceAccount, PersistentVolumeLabel, DefaultStorageClass, DefaultTolerationSeconds, NodeRestriction, PodPreset, ExtendedResourceToleration, PersistentVolumeClaimResize, Priority, StorageObjectInUseProtection, MutatingAdmissionWebhook, ValidatingAdmissionWebhook Admission Controls 38

Slide 39

Slide 39 text

Extending Admission Controls You can develop webhooks to create your own Admission Controllers. Admission Control etcd ValidatingAdmissionWebHook MutatingAdmissionWebHook allow Other Admission Plugins 39

Slide 40

Slide 40 text

PodSecurityPolicy Restricts access to host {filesystem, network, ports, PID namespace, IFS namespace}... Limits privileged containers, volume types, enforces read-only filesystem etc. Enforced through its own admission plugin. Admission Control Pod Spec PSP Admission Controller allow/deny PodSecurityPolicy Spec 40

Slide 41

Slide 41 text

PodSecurityPolicy apiVersion: policy/v1beta1 kind: PodSecurityPolicy metadata: name: prevent-root-privileged spec: # Don't allow privileged pods! privileged: false # Don't allow root containers! runAsUser: rule: "MustRunAsNonRoot" $ kubectl create role psp:unprivileged \ --verb=use \ --resource=podsecuritypolicy \ --resource-name=unprivileged $ kubectl create rolebinding developers:unprivileged \ --role=psp:unprivileged \ [email protected] \ [email protected] apiVersion: v1 kind: Pod metadata: name: foo spec: containers: - image: k8s.gcr.io/pause securityContext: privileged: true REJECT 41

Slide 42

Slide 42 text

Which pods can talk to which other pods (based on their namespace/labels) or IP ranges. Available on Kubernetes Engine with Calico Network Plugin (--enable-network-policy). Network Policy 42

Slide 43

Slide 43 text

Network Policy kind: NetworkPolicy apiVersion: networking.k8s.io/v1 metadata: name: db-allow-frontend spec: podSelector: matchLabels: app: mysql ingress: - from: - podSelector: matchLabels: app: frontend Example: Allow traffic to "mysql" Pods from "frontend" pods 43

Slide 44

Slide 44 text

...Pragmatic recipes at github.com/ahmetb/kubernetes-network-policy-recipes Network Policy 44

Slide 45

Slide 45 text

...Pragmatic recipes at github.com/ahmetb/kubernetes-network-policy-recipes Network Policy 45

Slide 46

Slide 46 text

Scheduling related features 46

Slide 47

Slide 47 text

Pod Priority/Preemption (beta – Kubernetes 1.11) Pod Priority: Puts high priority pods waiting in Pending state in front of the scheduling queue. Pod Preemption: Evicts lower priority pod(s) from a Node, if high priority pod cannot be scheduled due to not enough space/resources in the cluster. Use PriorityClasses to define: apiVersion: scheduling.k8s.io/v1beta1 kind: PriorityClass metadata: name: "high" value: 1000000 apiVersion: scheduling.k8s.io/v1beta1 kind: PriorityClass metadata: name: "normal" value: 1000 globalDefault: true apiVersion: scheduling.k8s.io/v1beta1 kind: PriorityClass metadata: name: "low" value: 10 47

Slide 48

Slide 48 text

Resource Quotas Limits total memory/cpu/storage that pods can use, and how many objects of each type (pods, load balancers, ConfigMaps, etc.) on a per-namespace basis 48

Slide 49

Slide 49 text

apiVersion: v1 kind: ResourceQuota metadata: name: compute-quota namespace: staging spec: hard: requests.cpu: "8" requests.memory: 2Gi limits.cpu: "10" limits.memory: 3Gi requests.storage: 120Gi Resource Quotas – Example apiVersion: v1 kind: ResourceQuota metadata: name: object-quota namespace: staging spec: hard: pods: "30" services: "2" services.loadbalancers: "0" persistentvolumeclaims: "5" 49

Slide 50

Slide 50 text

apiVersion: v1 kind: ResourceQuota metadata: name: low-priority-compute spec: scopeSelector: matchExpressions: - operator : In scopeName: PriorityClass values: ["low"] hard: pods: "100" cpu: "10" memory: 12GiB apiVersion: scheduling.k8s.io/v1beta1 kind: PriorityClass metadata: name: low value: 10 Resource Quotas + PriorityClass Set different quotas for pods per PriorityClass (alpha in Kubernetes 1.11, disabled by default) apiVersion: v1 kind: Pod metadata: name: unimportant-pod spec: containers: [...] priorityClassName: low 50

Slide 51

Slide 51 text

If pod spec doesn't specify limits/requests use these defaults. Limit Range Specify {default, min, max} resource constraints for each pod/container per namespace. apiVersion: v1 kind: LimitRange metadata: name: default-compute-limits spec: limits: - type: Pod # or "Container" default: memory: 128MiB cpu: 200m defaultRequest: memory: 64MiB cpu: 100m 51

Slide 52

Slide 52 text

apiVersion: v1 kind: LimitRange metadata: name: compute-limits spec: limits: - type: "Container" min: memory: 32MiB cpu: 10m max: memory: 800MiB cpu: "2" A container cannot have less resources than these. Limit Range Specify {default, min, max} resource constraints for each pod/container. A container cannot have more resources than these. 52

Slide 53

Slide 53 text

Pod Anti-Affinity apiVersion: v1 kind: Pod metadata: name: foo labels: team: "billing" spec: ... apiVersion: v1 kind: Pod metadata: name: bar labels: team: "billing" spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - topologyKey: "kubernetes.io/hostname" labelSelector: matchExpressions: - key: "team" operator: NotIn values: ["billing"] Constrain scheduling of pods, based on the labels of other pods scheduled on the node. Example: 53 keep me off of nodes that have pods that don't have the "billing" label

Slide 54

Slide 54 text

Use taints on nodes and tolerations on Pods to dedicate partition of cluster to particular pods/users. Useful for partitioning/dedicating special machines on the cluster to the team(s) that asked for it. Dedicated Nodes GPU node GPU node node node node node GPU node GPU node node node node node Reserved for ML Team node node 54

Slide 55

Slide 55 text

You can apply "taints" to Kubernetes Engine node-pools at creation time: $ gcloud container node-pools create gpu-pool \ --cluster=example-cluster \ --node-taints=team=machine-learning:NoSchedule (This is better than “kubectl taint nodes” command as it keeps working when node pools resize or nodes are auto-repaired.) Dedicated Nodes 55

Slide 56

Slide 56 text

apiVersion: v1 kind: Pod metadata: labels: team: "machine-learning" spec: tolerations: - key: "team" operator: "Equal" value: "machine-learning" effect: "NoSchedule" You can apply "taints" to Kubernetes Engine node-pools at creation time: $ gcloud container node-pools create gpu-pool \ --cluster=example-cluster \ --node-taints=team=machine-learning:NoSchedule (This is better than “kubectl taint nodes” command as it keeps working when node pools resize or nodes are auto-repaired.) Use a "toleration" on the pods from this team: Dedicated Nodes 56

Slide 57

Slide 57 text

Sandboxed Pods Linux kernel bugs and security vulnerabilities may bypass container security boundaries. Approaches in this space: ● Kata Containers ● gVisor (Google’s approach!) Check out talk: IO310-Sandboxing your containers with gVisor 57

Slide 58

Slide 58 text

gVisor - Google approach to Sandbox Pods Sandbox for Containers Implements Linux system calls in user space Zero config Written in Go Container Kernel System Calls Hardware gVisor Limited System Calls Independent Kernel Virtualization-based Strong Isolation 58

Slide 59

Slide 59 text

gVisor on Kubernetes - Architecture runsc: OCI runtime powered by gVisor Sentry (emulated Linux Kernel) is the 1st isolation boundary seccomp + namespace is the 2nd isolation boundary Gofer handles Network and File I/O KVM Gofer Host Linux Kernel Container Sentry (emulated Linux Kernel) Sandbox User Kernel 9P seccomp + ns runsc OCI Kubernetes 59

Slide 60

Slide 60 text

Sandbox Pods in Kubernetes Work In Progress RuntimeClass is a new API to specify runtimes Specify the RuntimeClass in your Pod spec apiVersion: v1alpha1 kind: RuntimeClass metadata: name: gvisor spec: runtimeHandler: gvisor ... apiVersion: v1 kind: Pod ... spec: ... runtimeClassName: gvisor

Slide 61

Slide 61 text

5 Applying multi-tenancy & Current limitations 61

Slide 62

Slide 62 text

project cluster1 You wrote all these policies, but how do you deploy and manage them in practice? Keeping Kubernetes/IAM policies up to date across namespaces / clusters / projects is difficult! Scalable Policy Management ns2 ns1 cluster2 ns2 ns1 project cluster3 ns2 ns1 project cluster4 ns2 ns1 62

Slide 63

Slide 63 text

Kubernetes Engine Policy Management NEW! (alpha) Centrally defined policies. ● Single source of truth ● ..as opposed to "git" vs "Kubernetes API" vs "Cloud IAM" Applies policies hierarchically. ● Organization → Folder → Project → Cluster → Namespace ● Policies are inherited. Lets you manage namespaces, RBAC, and more… Check out talk (happening now): IO200-Take Control of your Multi-cluster, Multi-Tenant Kubernetes Workloads Participate in alpha: goog.page.link/kpm-alpha 63

Slide 64

Slide 64 text

Kubernetes API: ● Currently API calls are not rate limited, open to DoS from tenants, impacting others. Networking: ● Networking is not a scheduled resource in Kubernetes, yet (cannot use with limits/requests) ● Tenants can still discover each other via Kubernetes DNS Many more... Kubernetes Multi-tenancy Limitations Today 64

Slide 65

Slide 65 text

Determine your use case ● How trusted are your tenant users and workloads? ● What degree and kinds of isolation do you need? Namespace-centric multi-tenancy ● Utilize Policy objects for scheduling and access control. ● Think about personas and map them to RBAC cluster roles. ● Automate policies across clusters with GKE Policy Management (alpha). Key Takeaways 65

Slide 66

Slide 66 text

Kubernetes Multi-tenancy Working Group - https://github.com/kubernetes/community/tree/master/wg-multitenancy - [email protected] - Organizers: - David Oppenheimer (@davidopp), Google - Jessie Frazelle (@jessfraz), Microsoft Kubernetes Policy Working Group - https://github.com/kubernetes/community/tree/master/wg-policy - [email protected] Participate! 66 Register your interest at: gke.page.link/multi-tenancy

Slide 67

Slide 67 text

Thank you. Ahmet Alp Balkan (@ahmetb) Yoshi Tamura (@yoshiat) 67 Register your interest at: gke.page.link/multi-tenancy

Slide 68

Slide 68 text

Example: “testing team has 10,000 CPU hours per month” Most of the resources are billable on the cloud: ● Compute: CPU/memory ● Networking: transfer costs, load balancing, reserved IPs ● Storage: persistent disks, SSDs ● Other services (Cloud PubSub, Cloud SQL, …) provisioned through Service Catalog. Kubernetes doesn't offer a way to do internal chargeback for compute/cloud resources used. Internal Billing/Chargeback 68

Slide 69

Slide 69 text

Function ldap Date LGTM Notes Speaker(s) ahmetb / yoshiat ahmetb → Done (7/19) yoshiat → Peer Reviewer davidopp 7/23 a couple of small remaining comments to resolve, but nothing to block LGTM PR jacinda Legal Design PMM hrdinsky / praveenz Practice Buddy (optional) Approvals & Reviews 69