Multi-tenancy Best Practices for Google Kubernetes Engine

AHMET ALP BALKAN SOFTWARE ENGINEER, GOOGLE CLOUD YOSHI TAMURA PRODUCT
MANAGER, GOOGLE CLOUD THURSDAY, JULY 26 IO232 Multi-Tenancy Best Practices for Google Kubernetes Engine 1

Who are we? Ahmet Alp Balkan (@ahmetb) Software Engineer at
Developer Relations I work on making Kubernetes Engine easier to understand and use for developers and operators and write open source tools for Kubernetes. Previously, I worked at Microsoft Azure, on porting Docker to Windows and ACR. I maintain "kubectx". 2

Yoshi Tamura (@yoshiat) Product Manager, Kubernetes Engine I work on
Multi-tenancy and Hardware Accelerators (GPU and Cloud TPU) in Kubernetes Engine. Who are we? 3

Practical Multi-Tenancy on Kubernetes Engine Following slides heavily inspired by
KubeCon EU'18 talk of David Oppenheimer, Software Engineer, Google 4 Register your interest at: gke.page.link/multi-tenancy

trust multi-tenancy modes isolation access control resource usage scheduling multi-tenancy
features policy management preventing contention billing 5

0 What is multi-tenancy? 6

Software Multi-tenancy single instance of software runs on a server
and serves multiple tenants. 7

Kubernetes Multi-tenancy Providing isolation and fair resource sharing between multiple
users and their workloads within a cluster. 9

1 Trust 10

• Your compiler* • Operating system • Dependencies • Deployment
pipeline • Container runtime ... Do you trust... * Bonus reading on compilers: - Reflections on trusting trust. Ken Thompson. 1984. CACM 27, 8 (August 1984), 761-763. - Fully Countering Trusting Trust through Diverse Double-Compiling. D A Wheeler. PhD thesis, George Mason University, Oct. 2009. 11

Levels of trust software multi-tenancy Trusted Semi-trusted Non-trusted the code
comes from an audited source, built and run by trusted components (a.k.a “the dream”) the code comes from potentially hostile users, cannot assume good intent (a.k.a. hosting providers) trusted code, but has 3rd party dependencies or software not fully audited (a.k.a most people) 12

2 Kubernetes Engine Multi-Tenancy Primitives 13

Kubernetes Cluster vs Namespace boundary cluster cluster namespace namespace namespace
namespace namespace namespace namespace namespace 14

project-2 Pros • Separate control plane (API) for each tenant
(for free*) • Strong network isolation (if it's per-cluster VPC) However: • Need tools to manage 10s or 100s of clusters • Resource/configuration fragmentation of clusters • Slow turn-up: need to create a cluster for a new tenant * Google Kubernetes Engine control plane (master) is free of charge. Cluster per Tenant cluster cluster cluster project-1 15

Namespace per tenant (intra-cluster multi-tenancy) Namespaces provide logical isolation between
tenants on a cluster. Kubernetes policies are namespace-scoped. • Logical isolation between tenants • Policies for API access restrictions & resource usage constraints Pros: • Tenants can reuse extensions/controllers/CRDs • Shared control plane (=shared ops, shared security/auditing…) ns1 ns2 ns3 ns4 16

Kubernetes Engine primitives Quotas Network Policy Pod Security Policy Pod
Priority Limit Range IAM Sandbox Pods RBAC Access Control Resource Sharing Runtime Isolation Pod Affinity /Anti-Affinity Admissio n Control 17

3 Use cases of Kubernetes Multi-tenancy 18

Enterprise SaaS (Software as a Service) Multi-tenancy use cases in
Kubernetes KaaS (Kubernetes as a Service) 19

All users from the same company/organization Namespaces ⇔ Tenants ⇔
Teams Semi-trusted tenants (you can fire them on violation) Cluster Roles: • Cluster Admin ◦ CRUD any policy objects ◦ Create/assign namespaces to “Namespace Admins” ◦ Manage policies (resource usage quotas, networking) • Namespace Admin ◦ Manage users in the namespace(s) they own. • User ◦ CRUD non-policy objects in the namespace(s) they have access to “Enterprise” Model Control Plane (apiserver) Cluster Admin ns2 ns3 ns4 ns1 Namespace Admin Namespace Admin Namespace Admin 20

Many apps from different teams, semi-trusted • Vanilla container isolation
may suffice • If not: Sandboxing with gVisor, limit capabilities, use seccomp/AppArmor/... Network isolation: • Allow all traffic within a namespace • Whitelist traffic from/to other namespaces (=teams) “Enterprise” Model Control Plane (apiserver) Cluster Admin ns2 ns3 ns4 ns1 Namespace Admin Namespace Admin Namespace Admin 21

“Software as a Service” model Control Plane (apiserver) Cluster Admin
SaaS API/proxy SaaS Consumers cluster 22 Consumer deploys their app through a custom control plane.

“Software as a Service” model 23 Control Plane (apiserver) Cluster
Admin SaaS API/proxy SaaS Consumers cluster Consumer deploys their app through a custom control plane. After the app is deployed, customers directly connect to the app. Example: Wordpress hosting

“Software as a Service” model 24 Control Plane (apiserver) Cluster
Admin SaaS API/proxy SaaS Consumers cluster Consumer deploys their app through a custom control plane. After the app is deployed, customers directly connect to the app. Example: Wordpress hosting SaaS API is a trusted client of Kubernetes. Cluster admins can access the Kubernetes API directly. Tenant workloads may have untrusted pieces: • such as WordPress extensions • may require sandboxing with gVisor etc.

Untrusted tenants running untrusted code. (Platform as a Service or
hosting companies.) Tenants may create their namespaces, but cannot set policy objects. Stronger isolation requirements than enterprise/SaaS: • isolated world view (separate control plane) • tenants must not see each other • strong node and network isolation ◦ sandbox pods ◦ sole-tenant nodes ◦ multi-tenant networking/DNS “Kubernetes as a Service” model Control Plane (apiserver) Cluster Admin ns1 ns2 ns3 ns4 25

Untrusted tenants running untrusted code. (Platform as a Service or
hosting companies.) Tenants may create their namespaces, but cannot set policy objects. Stronger isolation requirements than enterprise/SaaS: • isolated world view (separate control plane) • tenants must not see each other • strong node and network isolation ◦ sandbox pods ◦ sole-tenant nodes ◦ multi-tenant networking/DNS “Kubernetes as a Service” model Control Plane (apiserver) Cluster Admin ns1 ns2 ns3 ns4 26

4 Kubernetes Multi-tenancy Policy APIs and Features 27

Kubernetes Engine multi-tenancy primitives Quotas Network Policy Pod Security Policy
Pod Priority Limit Range IAM Sandbox Pods RBAC Access Control Resource Sharing Runtime Isolation Pod Security Context Pod Affinity Admissio n Control 28

Kubernetes Engine multi-tenancy primitives Quotas Network Policy Pod Priority Limit
Range IAM Sandbox Pods RBAC Auth related Scheduling related Pod Security Context Pod Affinity Admissio n Control Pod Security Policy 29

Auth related features 30

Authentication, Authorization, Admission Control Plane (apiserver) Authorizer Pluggable Auth (GKE
IAM) RBAC Admission Control allow etcd Cloud IAM Policies {Cluster,}Role {Cluster,}RoleBinding allow Pods 31

Kubernetes RBAC Which users/groups/Service Accounts can do which operations on
which API resources in which namespaces. 32

Kubernetes RBAC Mostly useful for: • Giving access to pods
calling Kubernetes API (with Kubernetes Service Accounts) • Giving fine-grained access to people/groups calling Kubernetes API (with Google accounts) Concepts: ClusterRole A preset of capabilities, cluster-wide Role ClusterRole, but namespace-scoped ClusterRoleBinding Give permissions of a ClusterRole to: • Google users/groups • Google Cloud IAM Service Accounts • Kubernetes Service Accounts RoleBinding ClusterRoleBinding, but namespace-scoped. 33

kind: ClusterRoleBinding apiVersion: rbac.authorization.k8s.io/v1 metadata: name: "admins:namespace-creator" roleRef: kind: Role
name: "namespace-creator" apiGroup: rbac.authorization.k8s.io subjects: - kind: User name: "[email protected]" # Google user apiGroup: rbac.authorization.k8s.io Kubernetes RBAC Example ClusterRole+Binding for namespace-creator: kind: ClusterRole apiVersion: rbac.authorization.k8s.io/v1 metadata: name: "namespace-creator" rules: - apiGroups: [""] # core resources: ["namespaces"] verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] 34

Practical for giving Google users/groups project-wide access: Curated IAM “Roles”:
Kubernetes Engine + Cloud IAM Admin *Can do everything* Viewer *Can view everything* Cluster Admin Can manage clusters (create/delete/upgrade clusters) Cannot view what's in the clusters (Kubernetes API) Developer Can do everything in a cluster (Kubernetes API) Cannot manage clusters (create/delete/upgrade clusters) You can curate new ones with Cloud IAM Custom Roles. 35

Kubernetes Engine + IAM Give someone "Developer" role on all
clusters in the project: gcloud projects add-iam-policy-binding PROJECT_ID \ --member=user:[email protected] \ --role=roles/container.developer Give a Google Group "Viewer" role on all clusters in the project: gcloud projects add-iam-policy-binding PROJECT_ID \ --member=group:[email protected] \ --role=roles/container.viewer 36

Admission Controls Intercept API request before resource is persisted. Admission
control can mutate and allow/deny. Admission Control etcd Admission Plugins allow 37

Compiled into Kubernetes apiserver binary. Enabled Admission Plugins be changed
on Kubernetes Engine. But these 15 admission plugins are already enabled: Initializers, NamespaceLifecycle, LimitRanger, ServiceAccount, PersistentVolumeLabel, DefaultStorageClass, DefaultTolerationSeconds, NodeRestriction, PodPreset, ExtendedResourceToleration, PersistentVolumeClaimResize, Priority, StorageObjectInUseProtection, MutatingAdmissionWebhook, ValidatingAdmissionWebhook Admission Controls 38

Extending Admission Controls You can develop webhooks to create your
own Admission Controllers. Admission Control etcd ValidatingAdmissionWebHook MutatingAdmissionWebHook allow <your webhooks> Other Admission Plugins <your webhooks> 39

PodSecurityPolicy Restricts access to host {filesystem, network, ports, PID namespace,
IFS namespace}... Limits privileged containers, volume types, enforces read-only filesystem etc. Enforced through its own admission plugin. Admission Control Pod Spec PSP Admission Controller allow/deny PodSecurityPolicy Spec 40

PodSecurityPolicy apiVersion: policy/v1beta1 kind: PodSecurityPolicy metadata: name: prevent-root-privileged spec: #
Don't allow privileged pods! privileged: false # Don't allow root containers! runAsUser: rule: "MustRunAsNonRoot" $ kubectl create role psp:unprivileged \ --verb=use \ --resource=podsecuritypolicy \ --resource-name=unprivileged $ kubectl create rolebinding developers:unprivileged \ --role=psp:unprivileged \ [email protected] \ [email protected] apiVersion: v1 kind: Pod metadata: name: foo spec: containers: - image: k8s.gcr.io/pause securityContext: privileged: true REJECT 41

Which pods can talk to which other pods (based on
their namespace/labels) or IP ranges. Available on Kubernetes Engine with Calico Network Plugin (--enable-network-policy). Network Policy 42

Network Policy kind: NetworkPolicy apiVersion: networking.k8s.io/v1 metadata: name: db-allow-frontend spec:
podSelector: matchLabels: app: mysql ingress: - from: - podSelector: matchLabels: app: frontend Example: Allow traffic to "mysql" Pods from "frontend" pods 43

...Pragmatic recipes at github.com/ahmetb/kubernetes-network-policy-recipes Network Policy 44

...Pragmatic recipes at github.com/ahmetb/kubernetes-network-policy-recipes Network Policy 45

Scheduling related features 46

Pod Priority/Preemption (beta – Kubernetes 1.11) Pod Priority: Puts high
priority pods waiting in Pending state in front of the scheduling queue. Pod Preemption: Evicts lower priority pod(s) from a Node, if high priority pod cannot be scheduled due to not enough space/resources in the cluster. Use PriorityClasses to define: apiVersion: scheduling.k8s.io/v1beta1 kind: PriorityClass metadata: name: "high" value: 1000000 apiVersion: scheduling.k8s.io/v1beta1 kind: PriorityClass metadata: name: "normal" value: 1000 globalDefault: true apiVersion: scheduling.k8s.io/v1beta1 kind: PriorityClass metadata: name: "low" value: 10 47

Resource Quotas Limits total memory/cpu/storage that pods can use, and
how many objects of each type (pods, load balancers, ConfigMaps, etc.) on a per-namespace basis 48

apiVersion: v1 kind: ResourceQuota metadata: name: compute-quota namespace: staging spec:
hard: requests.cpu: "8" requests.memory: 2Gi limits.cpu: "10" limits.memory: 3Gi requests.storage: 120Gi Resource Quotas – Example apiVersion: v1 kind: ResourceQuota metadata: name: object-quota namespace: staging spec: hard: pods: "30" services: "2" services.loadbalancers: "0" persistentvolumeclaims: "5" 49

apiVersion: v1 kind: ResourceQuota metadata: name: low-priority-compute spec: scopeSelector: matchExpressions:
- operator : In scopeName: PriorityClass values: ["low"] hard: pods: "100" cpu: "10" memory: 12GiB apiVersion: scheduling.k8s.io/v1beta1 kind: PriorityClass metadata: name: low value: 10 Resource Quotas + PriorityClass Set different quotas for pods per PriorityClass (alpha in Kubernetes 1.11, disabled by default) apiVersion: v1 kind: Pod metadata: name: unimportant-pod spec: containers: [...] priorityClassName: low 50

If pod spec doesn't specify limits/requests use these defaults. Limit
Range Specify {default, min, max} resource constraints for each pod/container per namespace. apiVersion: v1 kind: LimitRange metadata: name: default-compute-limits spec: limits: - type: Pod # or "Container" default: memory: 128MiB cpu: 200m defaultRequest: memory: 64MiB cpu: 100m 51

apiVersion: v1 kind: LimitRange metadata: name: compute-limits spec: limits: -
type: "Container" min: memory: 32MiB cpu: 10m max: memory: 800MiB cpu: "2" A container cannot have less resources than these. Limit Range Specify {default, min, max} resource constraints for each pod/container. A container cannot have more resources than these. 52

Pod Anti-Affinity apiVersion: v1 kind: Pod metadata: name: foo labels:
team: "billing" spec: ... apiVersion: v1 kind: Pod metadata: name: bar labels: team: "billing" spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - topologyKey: "kubernetes.io/hostname" labelSelector: matchExpressions: - key: "team" operator: NotIn values: ["billing"] Constrain scheduling of pods, based on the labels of other pods scheduled on the node. Example: 53 keep me off of nodes that have pods that don't have the "billing" label

Use taints on nodes and tolerations on Pods to dedicate
partition of cluster to particular pods/users. Useful for partitioning/dedicating special machines on the cluster to the team(s) that asked for it. Dedicated Nodes GPU node GPU node node node node node GPU node GPU node node node node node Reserved for ML Team node node 54

You can apply "taints" to Kubernetes Engine node-pools at creation
time: $ gcloud container node-pools create gpu-pool \ --cluster=example-cluster \ --node-taints=team=machine-learning:NoSchedule (This is better than “kubectl taint nodes” command as it keeps working when node pools resize or nodes are auto-repaired.) Dedicated Nodes 55

apiVersion: v1 kind: Pod metadata: labels: team: "machine-learning" spec: tolerations:
- key: "team" operator: "Equal" value: "machine-learning" effect: "NoSchedule" You can apply "taints" to Kubernetes Engine node-pools at creation time: $ gcloud container node-pools create gpu-pool \ --cluster=example-cluster \ --node-taints=team=machine-learning:NoSchedule (This is better than “kubectl taint nodes” command as it keeps working when node pools resize or nodes are auto-repaired.) Use a "toleration" on the pods from this team: Dedicated Nodes 56

Sandboxed Pods Linux kernel bugs and security vulnerabilities may bypass
container security boundaries. Approaches in this space: • Kata Containers • gVisor (Google’s approach!) Check out talk: IO310-Sandboxing your containers with gVisor 57

gVisor - Google approach to Sandbox Pods Sandbox for Containers
Implements Linux system calls in user space Zero config Written in Go Container Kernel System Calls Hardware gVisor Limited System Calls Independent Kernel Virtualization-based Strong Isolation 58

gVisor on Kubernetes - Architecture runsc: OCI runtime powered by
gVisor Sentry (emulated Linux Kernel) is the 1st isolation boundary seccomp + namespace is the 2nd isolation boundary Gofer handles Network and File I/O KVM Gofer Host Linux Kernel Container Sentry (emulated Linux Kernel) Sandbox User Kernel 9P seccomp + ns runsc OCI Kubernetes 59

Sandbox Pods in Kubernetes Work In Progress RuntimeClass is a
new API to specify runtimes Specify the RuntimeClass in your Pod spec apiVersion: v1alpha1 kind: RuntimeClass metadata: name: gvisor spec: runtimeHandler: gvisor ... apiVersion: v1 kind: Pod ... spec: ... runtimeClassName: gvisor

5 Applying multi-tenancy & Current limitations 61

project cluster1 You wrote all these policies, but how do
you deploy and manage them in practice? Keeping Kubernetes/IAM policies up to date across namespaces / clusters / projects is difficult! Scalable Policy Management ns2 ns1 cluster2 ns2 ns1 project cluster3 ns2 ns1 project cluster4 ns2 ns1 62

Kubernetes Engine Policy Management NEW! (alpha) Centrally defined policies. •
Single source of truth • ..as opposed to "git" vs "Kubernetes API" vs "Cloud IAM" Applies policies hierarchically. • Organization → Folder → Project → Cluster → Namespace • Policies are inherited. Lets you manage namespaces, RBAC, and more… Check out talk (happening now): IO200-Take Control of your Multi-cluster, Multi-Tenant Kubernetes Workloads Participate in alpha: goog.page.link/kpm-alpha 63

Kubernetes API: • Currently API calls are not rate limited,
open to DoS from tenants, impacting others. Networking: • Networking is not a scheduled resource in Kubernetes, yet (cannot use with limits/requests) • Tenants can still discover each other via Kubernetes DNS Many more... Kubernetes Multi-tenancy Limitations Today 64

Determine your use case • How trusted are your tenant
users and workloads? • What degree and kinds of isolation do you need? Namespace-centric multi-tenancy • Utilize Policy objects for scheduling and access control. • Think about personas and map them to RBAC cluster roles. • Automate policies across clusters with GKE Policy Management (alpha). Key Takeaways 65

Kubernetes Multi-tenancy Working Group - https://github.com/kubernetes/community/tree/master/wg-multitenancy - [email protected] - Organizers:
- David Oppenheimer (@davidopp), Google - Jessie Frazelle (@jessfraz), Microsoft Kubernetes Policy Working Group - https://github.com/kubernetes/community/tree/master/wg-policy - [email protected] Participate! 66 Register your interest at: gke.page.link/multi-tenancy

Thank you. Ahmet Alp Balkan (@ahmetb) Yoshi Tamura (@yoshiat) 67
Register your interest at: gke.page.link/multi-tenancy

Example: “testing team has 10,000 CPU hours per month” Most
of the resources are billable on the cloud: • Compute: CPU/memory • Networking: transfer costs, load balancing, reserved IPs • Storage: persistent disks, SSDs • Other services (Cloud PubSub, Cloud SQL, …) provisioned through Service Catalog. Kubernetes doesn't offer a way to do internal chargeback for compute/cloud resources used. Internal Billing/Chargeback 68

Function ldap Date LGTM Notes Speaker(s) ahmetb / yoshiat ahmetb
→ Done (7/19) yoshiat → Peer Reviewer davidopp 7/23 a couple of small remaining comments to resolve, but nothing to block LGTM PR jacinda Legal Design PMM hrdinsky / praveenz Practice Buddy (optional) Approvals & Reviews 69

Multi-tenancy Best Practices for Google Kuberne...

Multi-tenancy Best Practices for Google Kubernetes Engine

More Decks by Ahmet Alp Balkan

Other Decks in Technology

Featured

Transcript