Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Pitfalls of Kubernetes Adoption

Pitfalls of Kubernetes Adoption

Kubernetes is an amazing piece of technology certainly not without it’s odd quirks. The team at Zappi got into Kubernetes about 4 years ago and we’ve hit some snags and surprises along the way. This talk lists some of the issues that stood out along the way that hopefully will give a heads up to anyone looking to get started or even those who’ve been using it for a while.

Links:

* Conference: SREcon'20 EMEA
* Program: https://www.usenix.org/srecon/conversations/20emea
* Video: https://youtu.be/0tnIS0Bgjy8

King'ori Maina

October 28, 2020
Tweet

More Decks by King'ori Maina

Other Decks in Technology

Transcript

  1. Who Am I? • Infrastructure Engineer at Zappi • Work

    on the Infrastructure/Ops Team • We manage infrastructure to support the business by making infrastructure boring • Support developers by building tooling to make working with infrastructure painless • Support the security team to make sure our infrastructure is secure King’ori Maina Infrastructure Engineer/Manager @itskingori .io
  2. Meet The Infrastructure Team King’ori Maina Infrastructure Engineer/Manager @itskingori Zac

    Blazic Infrastructure Engineer @zacblazic Hadrian Valentine Infrastructure Engineer @hadrianvale Together our job is finding the balance between reliability, risk and cost.
  3. Some Context Why Kubernetes? • Learn once, apply everywhere •

    API to hard problems and an “API to a community” • Active community with good governance • Community is continuously working to make things better Our Experience? • Running Kubernetes for about 4 years • Self-managed clusters on AWS using Kops • We got; immutable, extensible, scalable and reviewable infrastructure • Experience has been pleasant & we’re happy
  4. 4 1 2 3 Migration > Preparation Process Dockerizing the

    application Adapt app to handle cluster non-determinism Making K8s required infrastructure changes Catering to runtime considerations Application Dependencies System Dependencies Host Dependencies CI/CD Pipelines Developer Tooling Image Size Image/Layer Caching Build-time Secrets
  5. 4 1 2 3 Migration > Preparation Process Dockerizing the

    application Adapt app to handle cluster non-determinism Making K8s required infrastructure changes Catering to runtime considerations Configuration Management Container Capabilities File Permissions Graceful Termination Log & Metrics Collection Resource Usage Profiling Secrets Management Signal Handling Application Dependencies System Dependencies Host Dependencies CI/CD Pipelines Developer Tooling Image Size Image/Layer Caching Build-time Secrets
  6. 4 1 2 3 Migration > Preparation Process Dockerizing the

    application Adapt app to handle cluster non-determinism Making K8s required infrastructure changes Catering to runtime considerations Autoscaling Inter-pod Affinity Node Affinity Pod Eviction Pod Priority Taints & Tolerations Volume AZ Binding Configuration Management Container Capabilities File Permissions Graceful Termination Log & Metrics Collection Resource Usage Profiling Secrets Management Signal Handling Application Dependencies System Dependencies Host Dependencies CI/CD Pipelines Developer Tooling Image Size Image/Layer Caching Build-time Secrets
  7. 4 1 2 3 Migration > Preparation Process Dockerizing the

    application Adapt app to handle cluster non-determinism Making K8s required infrastructure changes Catering to runtime considerations Autoscaling Inter-pod Affinity Node Affinity Pod Eviction Pod Priority Taints & Tolerations Volume AZ Binding Autoscaling Groups CDNs & DNS LBs & Reverse Proxies SSL Certs & Termination VPC Endpoints & Subnets VPNs & WAFs Configuration Management Container Capabilities File Permissions Graceful Termination Log & Metrics Collection Resource Usage Profiling Secrets Management Signal Handling Application Dependencies System Dependencies Host Dependencies CI/CD Pipelines Developer Tooling Image Size Image/Layer Caching Build-time Secrets
  8. 2 1 3 Resource Planning > Usage Bin-packing results in

    some wastage if there’s no perfect fit Resources to schedule against don’t always have the same ratio Actual usage varies between reqeust and limits Workload Burstable (Limits) Workload Utilization (Requests) Node Capacity (Allocatable) Not enough space left so we keep spinning up new nodes for new workloads CPU & memory resources are often not equally utilized Workload resource usage is a continuosly moving target cpu mem cpu mem cpu mem 1 Box =1 Core (CPU), 1 GB (RAM)
  9. Resource Planning > Non-Pod Problem • Non-Kubernetes components on the

    host are often overlooked • If non-pod componentes don’t have enough resources you can end up with a very unstable cluster Recommendations • Set requests with some breathing room • Always set limits • Don’t set limits too high • Increase eviction thresholds • Reserve resources for system components Node capacity
  10. Resource Planning > CPU Throttling Problem • Kubernetes uses CFS

    quotas to enforce CPU limits • Detrimental to latency critical applications • Some Linux Kernels have a bug; throttling kicks in before burning through the quota Possible Solutions • Increase CPU limits that are too low • Remove CPU limits entirely • Disable CPU CFS quota enforcement for containers that specify CPU limits • Adjust CPU CFS quota period value (smaller period better latency)
  11. Costs > Production Ready Clusters Considerations • Cost of the

    control plane ◦ Dependent on choice of distribution • High availability clusters ◦ Minimum of 3 masters for consensus ◦ Multiple AZs for masters & nodes • Number of clusters ◦ One per environment or one per application? • Resource usage of agents on the node ◦ Extra agents on the node to add extra functionality e.g. logging, metrics & security agents • Extra hardware required that wouldn’t be in a simple setup ◦ NATs, ELBs, Bastions, EBS volumes
  12. 2 1 3 Costs > Supporting Services Logging infrastructure is

    necessary to quickly search through logs of all components & apps Metrics & alerting infrastructure can help anticipate problems & discover bottlenecks Tracing infrastructure is useful in debugging calls across multiple microservices
  13. Costs > Accounting & Allocation Complications • It’s hard enough

    estimating usage, what about attributing cost to usage? • There more to resources than CPU and mem e.g. GPU, data transfer costs, persistent volumes etc. • Accounting for out-of-cluster resources e.g. databases (RDS) and block storage (S3) etc. • Factoring in dynamic pricing/billing e.g. on- demand, reserved, savings-plans & spot instances • What about on-premise Kubernetes clusters? .com Real-time cost allocation with a view of all native concepts!
  14. Complexity > Distributed Design Complexity is spread across: architecture (distributed

    systems are hard to design & debug), concepts (new ideas and abstractions) and configuration (bootstrapping & runtime).
  15. Complexity > So Many Addons Adding Functionality • Autoscalers addons

    – cluster autoscaler , vertical pod autoscaler & addon-resizer • Metrics addons – custom-metrics-adapter, kube-state-metrics & metrics-server • Networking & DNS addons – external-dns & service meshes, CNIs • Other addons – ingress-nginx controller, aws- load-balancer controller, node-problem-detector, draino & etcd-manager
  16. Tightening Kubernetes Security • Enable RBAC authorization method • Enable

    encryption for secret data at rest • Enable audit logging • Disable public access of the control plane via the Internet and apiserver basic auth • Disable kubelet read-only port & anonymous authentication. • Disable auto-mounting of the default service account token on service accounts. • Drop support for insecure TLS cipher suites - on kubelet, kube-apiserver and kube-controller- manager. Security > Insecure by Default In-Built Tools At Your Disposal • Use RBAC permissions • Use Network Policies • Use PSPs and have these as defaults: ◦ Disallow privileged containers & privilege escalation ◦ Disallow usage of hostPath volume mounts ◦ Disallow sharing of the host process ID namespace, IPC namespace & network stack (i.e. access to loopback, localhost, snooping on network traffic on the local node) ◦ Disallow pods that run as root ◦ Disallow use of NET_RAW or ALL capabilities ◦ Whitelist allowable volume types
  17. Security > Tooling Paradigm Shift Example Challenges for Security Teams

    • Larger attack surface are with containers is a challenge for compliance • Network vulnerability scanning no longer as effective • Pluggable authentication modules to allow 2FA gone & identification of users who get exec sessions into containers • Confusion on network events due to reliance on fixed IP addresses • Carrying out forensic work on short-lived containers and nodes • Expensive Intrusion Detection (IDS), Intrusion Prevention (IPS) & Anti-Malware systems Intrusion Detection Systems Vulnerability Static Analysis Intrusion Prevention Systems Anti-Malware
  18. Wrapping Up … • Kubernetes is great, but is likely

    overkill for small teams and applications • Consider containerisation as a first step and with other simpler orchestrators e.g. AWS ECS • Moving to Kubernetes will probably need a dedicated team to manage the clusters • Continuously evaluate security, it shouldn’t be an after-thought • Choose Kubernetes only if the benefits outweigh the costs • Every choice you make has pros & cons, it all boils down to trade-offs, pick your poison • Keep the main thing, the main thing … as an SRE, the goal is to find that balance between reliability, risk and cost