Upgrade to Pro — share decks privately, control downloads, hide ads and more …

What does “production ready” really mean for a Kubernetes cluster? -- Umeå May 2019

What does “production ready” really mean for a Kubernetes cluster? -- Umeå May 2019

An updated version of the presentation I did at KubeCon Copenhagen: https://speakerdeck.com/luxas/what-does-production-ready-really-mean-for-a-kubernetes-cluster
Online slides: https://docs.google.com/presentation/d/1k5gDZzGgbXAuLQ5oeQAl1t-JhqBkvcJ4qoINALA47Do/edit

I gave this talk at a meetup in Umeå: https://www.meetup.com/Cloud-Native-Northern-Sweden/events/260276456/
The talk was not recorded.
Location: XLENT Norr, Umeå, Sweden

Lucas Käldström

May 07, 2019
Tweet

More Decks by Lucas Käldström

Other Decks in Technology

Transcript

  1. 1 What does “production ready” really mean for a Kubernetes

    cluster? Lucas Käldström - CNCF Ambassador 7th of May, 2019 - Umeå Image credit: @ashleymcnamara
  2. 2 $ whoami Lucas Käldström, High School Student, 19 years

    old CNCF Ambassador, Certified Kubernetes Administrator and Kubernetes SIG Lead KubeCon Speaker in Berlin, Austin, Copenhagen, Shanghai & Seattle Kubernetes Approver and Subproject Owner, active in the community for ~3 years. Got kubeadm to GA. Driving luxas labs which currently performs contracting for Weaveworks A guy that has never attended a computing class
  3. 3 Agenda 1. Define the buzzwords! a. What does “production-ready”

    mean to you? b. What are the requirements for a highly available cluster? 2. What to think about when securing the cluster a. TLS certificates for all components b. Enable and set up RBAC (Role Based Access Control) c. Attack vectors you might not have thought about before
  4. 4 Agenda 3. Make the cluster highly-available if needed a.

    Do you need it? b. How to set up a HA cluster with kubeadm c. “Attack vectors” you might not have thought about before 4. Use the Cluster API for controlling the cluster declaratively a. Intro to the Cluster API b. How to set up Kubernetes using the Cluster API and upgrade/rollback
  5. 5 Agenda 5. Essential Kubernetes Addons a. Container Runtime &

    Registry b. Monitoring the cluster c. Centralized Logging & Audit Logging d. Out-of-tree Cloud Providers e. Ingress Controllers f. Persistent Storage with CSI
  6. 6 Which layer are you talking about? Master A Master

    N Node 1 Node N Kubernetes cluster Machines Application A Application B App C App D App E Applications Focusing on this layer
  7. 8 “The cluster is production ready when it is in

    a good enough shape for the user to serve real-world traffic”
  8. 9 “Your offering is production ready when it slightly exceeds

    your customer’s expectations in a way that allows for business growth” -- Carter Morgan, Google (@_askcarter)
  9. 12 1. The cluster is reasonably secure 2. The cluster

    components are highly available enough for the user’s needs 3. All elements in the cluster are declaratively controlled 4. Changes to the cluster state can be safely applied (upgrades/rollbacks) 5. The cluster passes as many end-to-end tests as possible Production-ready cluster?
  10. 13 Nodes Master Kubernetes’ high-level component architecture Node 3 OS

    Container Runtime Kubelet Networking Node 2 OS Container Runtime Kubelet Networking Node 1 OS Container Runtime Kubelet Networking API Server (REST API) Controller Manager (Controller Loops) Scheduler (Bind Pod to Node) etcd (key-value DB, SSOT) User Legend: CNI CRI OCI Protobuf gRPC JSON
  11. 14 What about “high availability”? 1. Instances (>=1) of a

    component can fail without causing the cluster to fail 2. Machines (>=1) in the cluster can fail without causing the cluster to fail More about this in section III.
  12. 16 1. TLS-secured communication everywhere! a. Use mutual TLS for

    all communication b. Certificates/identities should be rotatable c. Use a separate CA for etcd d. Use the Certificates/CSR API, with an external key signer if possible e. Encrypt Secrets stored in etcd
  13. 17 2. API Authentication and Authorization a. Disable ABAC, Anonymous

    Authentication and Insecure HTTP access b. Enforce the RBAC and Node Authorizers c. It’s recommended to delegate user authentication to a 3rd-party service d. Enable Advanced Audit Logging
  14. 18 3. Lock down the kubelets in the cluster a.

    Each kubelet should have: i. unique client credentials ii. a serving cert signed by the cluster CA b. Disable the readonly port (10255) & public (!) cAdvisor port (4194) c. Enforce authn & authz for the main kubelet port (10250) d. Enable automatic certificate rotation for the kubelets
  15. 19 4. Be careful with the Dashboard and Helm 2

    a. Don’t give them (or any app!) cluster-admin power; very easy to escalate privileges b. The security of the dashboard has improved since v1.7.0 i. The dashboard now has a login screen and delegates privileges c. Specify the exact operations tiller may perform with RBAC d. Secure the Helm <-> Tiller communication with TLS certificates
  16. 20 5. Deny by default -- best security practices a.

    Deny-all with RBAC b. Deny-all with NetworkPolicy c. Set up a restrictive PodSecurityPolicy as the default
  17. 21 Setting up a dynamic TLS-secured cluster Nodes Control Plane

    API Server Controller Manager Scheduler CN=system:kube-controller-manager CN=system:kube-scheduler Kubelet: node-1 HTTPS (6443) Kubelet client O=system:masters Self-signed HTTPS (10250) CN=system:node:node-1 O=system:nodes Kubelet: node-2 (to be joined) Self-signed HTTPS (10250) Bootstrap Token & trusted CA CN=system:node:node-2 O=system:nodes CSR Approver CSR Signer Legend: Logs / Exec calls Normal HTTPS POST CSR SAR Webhook PATCH CSR node-1 CSR node-2 CSR Bootstrap Token CSR=Certificate Signing Request, SAR=Subject Access Review
  18. 22 More information about Kubernetes security 1. Try out Aqua

    Security’s kube-bench project 2. Official docs: Best Practices for Securing a Kubernetes Cluster 3. Hacking and Hardening Kubernetes Clusters by Example [I] - Brad Geesaman 4. 11 Ways (Not) to Get Hacked on the Kubernetes blog
  19. 24 kubeadm Master 1 Master N Node 1 Node N

    kubeadm kubeadm kubeadm kubeadm Cloud Provider Load Balancers Monitoring Logging Cluster API Spec Cluster API Cluster API Implementation Addons Kubernetes API Bootstrapping Machines Infrastructure = The official tool to bootstrap a minimum viable, best-practice Kubernetes cluster Layer 2 kubeadm Layer 3 Addon Operators Layer 1 Cluster API
  20. 25 HA etcd cluster External Load Balancer or DNS-based API

    server resolving How achieve HA with kubeadm? Master A (kubeadm init) API Server Controller Manager Scheduler Shared certificates etcd etcd etcd Master B (kubeadm init) API Server Controller Manager Scheduler Shared certificates Master C (kubeadm init) API Server Controller Manager Scheduler Shared certificates Nodes (kubeadm join) Kubelet 1 Kubelet 2 Kubelet 3 Kubelet 4 Kubelet 5 Do-it-yourself 1. Set up HA etcd cluster 2. Copy certificates from master A to B and C 3. Set up a loadbalancer in front of the API servers
  21. 26 Is this cluster setup highly-available? HA etcd cluster Master

    A API Server Controller Manager Scheduler Shared certificates etcd etcd etcd Master B API Server Controller Manager Scheduler Shared certificates Master C API Server Controller Manager Scheduler Shared certificates Nodes Kubelet 1 Kubelet 2 Kubelet 3 Kubelet 4 Kubelet 5 Master D Loadbalancer No Single point of failure :(
  22. 27 Other things to keep in mind with a HA

    cluster 1. Remember to keep the CoreDNS replicas >= 1, and use Pod anti-affinity 2. Some certificates need to be identical across control plane nodes a. e.g. the ServiceAccount signing private key for the controller-manager b. => Needs to be rotated for all instances at the same time 3. Monitoring the cluster components becomes increasingly more important with a HA cluster that is expected to have a high SLO a. You can for example use Prometheus and kube-state-metrics as a starting point 4. Do you need a HA cluster? a. Is it worth the added cost and complexity?
  23. 28 “Monitor it so you know when it fails before

    your customers do” -- Justin Santa Barbara, Google (@justinsb)
  24. 30 • The What and the Why of Cluster API

    ◦ “To make the management of (X) clusters across (Y) providers simple, secure, and configurable.” ◦ “How can I manage any number of clusters in a similar fashion to how I manage deployments in Kubernetes?” ◦ “How do I manage other lifecycle events across that infrastructure (upgrades, deletions, etc.)?” ◦ “How can we control all of this via a consistent API across providers?” Cluster API
  25. 31 “GitOps” for your cluster(s) apiVersion: cluster.k8s.io/v1alpha1 kind: MachineDeployment metadata:

    name: my-nodes spec: replicas: 3 selector: matchLabels: foo: bar template: metadata: labels: foo: bar spec: providerConfig: value: apiVersion: "baremetalconfig/v1alpha1" kind: "BareMetalProviderConfig" zone: "us-central1-f" machineType: "n1-standard-1" image: "ubuntu-1604-lts" versions: kubelet: 1.14.2 containerRuntime: name: containerd version: 1.2.0 • With Kubernetes we manage our applications declaratively a. Why not for the cluster itself? • With the Cluster API, we can declaratively define the desired cluster state a. Operator implementations reconcile the state b. Use Spec & Status like the rest of k8s c. Common management solutions for e.g. upgrades, autoscaling and repair d. Allows for “GitOps” workflows
  26. 34 Choose your runtime & registry Docker is the most

    common runtime, but you could consider using containerd (Graduated) or cri-o (Incubating) instead for less footprint and attack area. Also, an internal container image registry might be needed. Harbor can set up a scalable registry for you on Kubernetes.
  27. 35 Monitoring the cluster Now that the cluster is up

    and running, let’s start monitoring it. As a good starting point, you can use the prometheus-operator Helm Chart. That gives you a Prometheus instance running in Kubernetes, good preset rules for monitoring (kube-state-metrics), and Grafana dashboards for visualization.
  28. 36 Enable Fluent Bit for logging In order to store

    container logs for a long period of time, you need to enable a log forwarder from the container runtime to some kind of logging aggregation service like ElasticSearch. You can use the fluent-bit-kubernetes-logging project as a good starting point for this task. Bonus points for also aggregating the Audit Logs
  29. 37 Enable cloud/environment extensions What’s traditionally called Cloud Providers for

    Kubernetes; handles Node creation/deletion with the environment, and Type=LoadBalancer Services, and optional other features. Anyone can create a so-called Cloud Provider integration for their environment. Example to the right.
  30. 38 Set up an Ingress controller In order to expose

    your Services to the outer world, you need some kind of 3rd-party Ingress Controller. Ingress Controllers makes your Ingress objects in Kubernetes work. You might want the controller itself to be a Type=LoadBalancer Service. The ones you could look out for are Traefik, Nginx Ingress, and Contour.
  31. 39 Persistent Storage is key Lastly, you most likely need

    Persistent Storage for many of your applications. Kubernetes supports the Container Storage Interface (CSI) for providers to implement. Rook implements various types of clustered storage in a Kubernetes-native way. Alternatively, you can use your cloud provider’s solution.
  32. 40 Recap 1. Identify the needs of your business a.

    How much money and effort do you want to put into HA & security? 2. High Availability != multiple masters a. Multiple masters are a requirement for high availability 3. Pay attention to the certificate identities for your components a. And make sure you lock things down well with RBAC, disable unnecessary ports, etc. 4. Declarative control over your cluster is better than imperative a. The Cluster API and the GitOps models are worth checking out
  33. 42 Related resources (in no particular order) 1. https://5pi.de/2017/12/15/production-grade-kubernetes/ 2.

    https://youtu.be/PXJu8ujNEmU 3. https://thenewstack.io/ebooks/kubernetes/state-of-kubernetes-ecosystem/ 4. https://kccncna17.sched.com/event/CU5x/101-ways-to-crash-your-cluster-i-marius-grigoriu-emmanuel-gomez-nordstrom 5. https://kccncna17.sched.com/event/CU6H/certifik8s-all-you-need-to-know-about-certificates-in-kubernetes-i-alexander-brand-apprenda 6. https://kccncna17.sched.com/event/CU86/shipping-in-pirate-infested-waters-practical-attack-and-defense-in-kubernetes-a-greg-castle-cj -cullen-google 7. https://kccncna17.sched.com/event/CU6z/hacking-and-hardening-kubernetes-clusters-by-example-i-brad-geesaman-symantec 8. https://kccncna17.sched.com/event/CUFK/keynote-kubernetes-at-github-jesse-newland-principal-site-reliability-engineer-github 9. https://kccncna17.sched.com/event/CU8b/what-happens-when-something-goes-wrong-on-kubernetes-reliability-i-marek-grabowski-tina -zhang-google 10. https://kccncna17.sched.com/event/CU64/automating-and-testing-production-ready-kubernetes-clusters-in-the-public-cloud-ron-lipke-g annetusa-today-network 11. https://stripe.com/blog/operating-kubernetes 12. https://blog.envoyproxy.io/introduction-to-modern-network-load-balancing-and-proxying-a57f6ff80236 13. https://jvns.ca/blog/2017/10/10/operating-a-kubernetes-network/ 14. https://acotten.com/post/kube17-security 15. https://applatix.com/making-kubernetes-production-ready/ 16. https://www.aquasec.com/wiki/display/containers/Kubernetes+in+Production 17. https://www.weave.works/blog/provisioning-lifecycle-production-ready-kubernetes-cluster/ 18. https://www.weave.works/blog/demystifying-production-ready-apps-on-kubernetes-with-carter-morgan 19. https://www.slideshare.net/gn00023040/all-the-troubles-you-get-into-when-setting-up-a-production-ready-kubernetes-cluster 20. https://www.slideshare.net/gn00023040/a-million-ways-of-deploying-a-kubernetes-cluster 21. https://blog.sophaskins.net/blog/misadventures-with-kube-dns/ 22. https://thenewstack.io/kubernetes-high-availability-no-single-point-of-failure/