Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Pitfalls of Kubernetes Adoption

Pitfalls of Kubernetes Adoption

Kubernetes is an amazing piece of technology certainly not without it’s odd quirks. The team at Zappi got into Kubernetes about 4 years ago and we’ve hit some snags and surprises along the way. This talk lists some of the issues that stood out along the way that hopefully will give a heads up to anyone looking to get started or even those who’ve been using it for a while.

Links:

* Conference: SREcon'20 EMEA
* Program: https://www.usenix.org/srecon/conversations/20emea
* Video: https://youtu.be/0tnIS0Bgjy8

King'ori Maina

October 28, 2020
Tweet

More Decks by King'ori Maina

Other Decks in Technology

Transcript

  1. Pitfalls of
    Kubernetes
    Adoption
    SREcon EMEA 2020
    King’ori Maina
    “King”

    View Slide

  2. Who Am I?
    ● Infrastructure Engineer at Zappi
    ● Work on the Infrastructure/Ops Team
    ● We manage infrastructure to support
    the business by making infrastructure
    boring
    ● Support developers by building tooling
    to make working with infrastructure
    painless
    ● Support the security team to make
    sure our infrastructure is secure
    King’ori Maina
    Infrastructure Engineer/Manager
    @itskingori
    .io

    View Slide

  3. Meet The Infrastructure Team
    King’ori Maina
    Infrastructure Engineer/Manager
    @itskingori
    Zac Blazic
    Infrastructure Engineer
    @zacblazic
    Hadrian Valentine
    Infrastructure Engineer
    @hadrianvale
    Together our job is finding the balance between
    reliability, risk and cost.

    View Slide

  4. Some Context
    Why Kubernetes?
    ● Learn once, apply everywhere
    ● API to hard problems and an “API to a
    community”
    ● Active community with good governance
    ● Community is continuously working to make
    things better
    Our Experience?
    ● Running Kubernetes for about 4 years
    ● Self-managed clusters on AWS using Kops
    ● We got; immutable, extensible, scalable and
    reviewable infrastructure
    ● Experience has been pleasant & we’re happy

    View Slide

  5. 1.
    Migration is Not a
    Walk in the Park

    View Slide

  6. 4
    1 2 3
    Migration > Preparation Process
    Dockerizing the
    application
    Adapt app
    to handle
    cluster
    non-determinism
    Making K8s
    required
    infrastructure
    changes
    Catering to
    runtime
    considerations
    Application Dependencies
    System Dependencies
    Host Dependencies
    CI/CD Pipelines
    Developer Tooling
    Image Size
    Image/Layer Caching
    Build-time Secrets

    View Slide

  7. 4
    1 2 3
    Migration > Preparation Process
    Dockerizing the
    application
    Adapt app
    to handle
    cluster
    non-determinism
    Making K8s
    required
    infrastructure
    changes
    Catering to
    runtime
    considerations
    Configuration Management
    Container Capabilities
    File Permissions
    Graceful Termination
    Log & Metrics Collection
    Resource Usage Profiling
    Secrets Management
    Signal Handling
    Application Dependencies
    System Dependencies
    Host Dependencies
    CI/CD Pipelines
    Developer Tooling
    Image Size
    Image/Layer Caching
    Build-time Secrets

    View Slide

  8. 4
    1 2 3
    Migration > Preparation Process
    Dockerizing the
    application
    Adapt app
    to handle
    cluster
    non-determinism
    Making K8s
    required
    infrastructure
    changes
    Catering to
    runtime
    considerations
    Autoscaling
    Inter-pod Affinity
    Node Affinity
    Pod Eviction
    Pod Priority
    Taints & Tolerations
    Volume AZ Binding
    Configuration Management
    Container Capabilities
    File Permissions
    Graceful Termination
    Log & Metrics Collection
    Resource Usage Profiling
    Secrets Management
    Signal Handling
    Application Dependencies
    System Dependencies
    Host Dependencies
    CI/CD Pipelines
    Developer Tooling
    Image Size
    Image/Layer Caching
    Build-time Secrets

    View Slide

  9. 4
    1 2 3
    Migration > Preparation Process
    Dockerizing the
    application
    Adapt app
    to handle
    cluster
    non-determinism
    Making K8s
    required
    infrastructure
    changes
    Catering to
    runtime
    considerations
    Autoscaling
    Inter-pod Affinity
    Node Affinity
    Pod Eviction
    Pod Priority
    Taints & Tolerations
    Volume AZ Binding
    Autoscaling Groups
    CDNs & DNS
    LBs & Reverse Proxies
    SSL Certs & Termination
    VPC Endpoints & Subnets
    VPNs & WAFs
    Configuration Management
    Container Capabilities
    File Permissions
    Graceful Termination
    Log & Metrics Collection
    Resource Usage Profiling
    Secrets Management
    Signal Handling
    Application Dependencies
    System Dependencies
    Host Dependencies
    CI/CD Pipelines
    Developer Tooling
    Image Size
    Image/Layer Caching
    Build-time Secrets

    View Slide

  10. 2.
    Resource Planning
    is Still Hard Harder

    View Slide

  11. 2
    1 3
    Resource Planning > Usage
    Bin-packing results in
    some wastage if there’s
    no perfect fit
    Resources to schedule
    against don’t always
    have the same ratio
    Actual usage varies
    between reqeust
    and limits
    Workload Burstable (Limits)
    Workload Utilization (Requests)
    Node Capacity (Allocatable)
    Not enough space left so we keep
    spinning up new nodes for new
    workloads
    CPU & memory resources are
    often not equally utilized
    Workload resource usage is a
    continuosly moving target
    cpu mem cpu mem cpu mem
    1 Box =1 Core (CPU), 1 GB (RAM)

    View Slide

  12. Resource Planning > Non-Pod
    Problem
    ● Non-Kubernetes components on the
    host are often overlooked
    ● If non-pod componentes don’t have
    enough resources you can end up with a
    very unstable cluster
    Recommendations
    ● Set requests with some breathing room
    ● Always set limits
    ● Don’t set limits too high
    ● Increase eviction thresholds
    ● Reserve resources for system components
    Node capacity

    View Slide

  13. Resource Planning > CPU Throttling
    Problem
    ● Kubernetes uses CFS quotas to enforce CPU
    limits
    ● Detrimental to latency critical applications
    ● Some Linux Kernels have a bug; throttling
    kicks in before burning through the quota
    Possible Solutions
    ● Increase CPU limits that are too low
    ● Remove CPU limits entirely
    ● Disable CPU CFS quota enforcement for
    containers that specify CPU limits
    ● Adjust CPU CFS quota period value
    (smaller period better latency)

    View Slide

  14. 3.
    Costs Will Likely
    Definitely Go Up

    View Slide

  15. Costs > Production Ready Clusters
    Considerations
    ● Cost of the control plane
    ○ Dependent on choice of distribution
    ● High availability clusters
    ○ Minimum of 3 masters for consensus
    ○ Multiple AZs for masters & nodes
    ● Number of clusters
    ○ One per environment or one per application?
    ● Resource usage of agents on the node
    ○ Extra agents on the node to add extra
    functionality e.g. logging, metrics & security
    agents
    ● Extra hardware required that wouldn’t be in a
    simple setup
    ○ NATs, ELBs, Bastions, EBS volumes

    View Slide

  16. 2
    1 3
    Costs > Supporting Services
    Logging infrastructure
    is necessary to quickly
    search through logs of
    all components & apps
    Metrics & alerting
    infrastructure can help
    anticipate problems &
    discover bottlenecks
    Tracing infrastructure
    is useful in debugging
    calls across multiple
    microservices

    View Slide

  17. Costs > Accounting & Allocation
    Complications
    ● It’s hard enough estimating usage, what
    about attributing cost to usage?
    ● There more to resources than CPU and mem
    e.g. GPU, data transfer costs, persistent
    volumes etc.
    ● Accounting for out-of-cluster resources e.g.
    databases (RDS) and block storage (S3) etc.
    ● Factoring in dynamic pricing/billing e.g. on-
    demand, reserved, savings-plans & spot
    instances
    ● What about on-premise Kubernetes
    clusters?
    .com
    Real-time cost allocation with a view of all native concepts!

    View Slide

  18. 4.
    Complexity for the
    Operator Multiplies

    View Slide

  19. Complexity > Distributed Design
    Complexity is spread across: architecture (distributed systems are hard to
    design & debug), concepts (new ideas and abstractions) and configuration
    (bootstrapping & runtime).

    View Slide

  20. Complexity > So Many Addons
    Adding Functionality
    ● Autoscalers addons – cluster autoscaler ,
    vertical pod autoscaler & addon-resizer
    ● Metrics addons – custom-metrics-adapter,
    kube-state-metrics & metrics-server
    ● Networking & DNS addons – external-dns &
    service meshes, CNIs
    ● Other addons – ingress-nginx controller, aws-
    load-balancer controller, node-problem-detector,
    draino & etcd-manager

    View Slide

  21. 5.
    Security Tools
    Need To Evolve

    View Slide

  22. Tightening Kubernetes Security
    ● Enable RBAC authorization method
    ● Enable encryption for secret data at rest
    ● Enable audit logging
    ● Disable public access of the control plane via
    the Internet and apiserver basic auth
    ● Disable kubelet read-only port & anonymous
    authentication.
    ● Disable auto-mounting of the default service
    account token on service accounts.
    ● Drop support for insecure TLS cipher suites - on
    kubelet, kube-apiserver and kube-controller-
    manager.
    Security > Insecure by Default
    In-Built Tools At Your Disposal
    ● Use RBAC permissions
    ● Use Network Policies
    ● Use PSPs and have these as defaults:
    ○ Disallow privileged containers & privilege
    escalation
    ○ Disallow usage of hostPath volume mounts
    ○ Disallow sharing of the host process ID
    namespace, IPC namespace & network stack
    (i.e. access to loopback, localhost, snooping
    on network traffic on the local node)
    ○ Disallow pods that run as root
    ○ Disallow use of NET_RAW or ALL capabilities
    ○ Whitelist allowable volume types

    View Slide

  23. Security > Tooling Paradigm Shift
    Example Challenges for Security Teams
    ● Larger attack surface are with containers is a
    challenge for compliance
    ● Network vulnerability scanning no longer as
    effective
    ● Pluggable authentication modules to allow 2FA
    gone & identification of users who get exec
    sessions into containers
    ● Confusion on network events due to reliance on
    fixed IP addresses
    ● Carrying out forensic work on short-lived
    containers and nodes
    ● Expensive Intrusion Detection (IDS), Intrusion
    Prevention (IPS) & Anti-Malware systems
    Intrusion Detection Systems
    Vulnerability Static Analysis
    Intrusion Prevention Systems
    Anti-Malware

    View Slide

  24. Wrapping Up …
    ● Kubernetes is great, but is likely overkill for small teams and applications
    ● Consider containerisation as a first step and with other simpler orchestrators e.g. AWS ECS
    ● Moving to Kubernetes will probably need a dedicated team to manage the clusters
    ● Continuously evaluate security, it shouldn’t be an after-thought
    ● Choose Kubernetes only if the benefits outweigh the costs
    ● Every choice you make has pros & cons, it all boils down to trade-offs, pick your poison
    ● Keep the main thing, the main thing … as an SRE, the goal is to find that balance between
    reliability, risk and cost

    View Slide

  25. Thank
    you!

    View Slide