Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Moving to Kubernetes: the Bad and the Ugly

Maxime
June 26, 2019

Moving to Kubernetes: the Bad and the Ugly

Over those past two years we’ve deployed a new platform at Xing based on Kubernetes on which to develop, test, and deploy XING's applications.
Today most of our workloads run on this platform but the road sometimes felt a bit bumpy.

From misconfiguration, to software bugs in the Kubernetes ecosystem or even Kernel race conditions, we encountered a variety of problems.
In this talk we will discuss some of those issues, how we fixed, mitigated, or worked around them.

Maxime

June 26, 2019
Tweet

Other Decks in Technology

Transcript

  1. Some numbers • 11 clusters • 500 worker nodes •

    30k Pods • 230 productive applications
  2. Bad results • Major bug in NGINX • One major

    and a few minors in NGINX Ingress Controller
  3. —— apiVersion: v1 kind: Deployment metadata: name: nginx-ingress-controller spec: template:

    spec: terminationGracePeriodSeconds: “60” Ingress Controller shutdown Allow more than the default 30 seconds
  4. Client API Server Kubelet Pods Endpoint Controller NGINX Ingress Controller

    NGINX Generates config & sends “reload” signal Pod deletion
  5. —— apiVersion: v1 kind: Pod spec: containers: - name: lifecycle-demo-container

    lifecycle: preStop: exec: command: [“/bin/sh","-c","sleep 4"] preStop hook Give ingress—nginx time to remove the upstream
  6. A few additional words • Recent releases do less configuration

    reloads • At XING we still run 0.10 • Taking a step-back to look for alternatives
  7. SNAT 172.16.0.2 172.16.0.2 172.16.0.2 Node-1, 10.0.0.1 External server 192.168.0.100 Before

    (1) SRC 172.16.0.2 DST 192.168.0.100 After (2) SRC 10.0.0.1 DST 192.168.0.100 Network Interface 1 2
  8. Collecting clues • Response time jumps from 1s or 3s

    • TCP SYN packets disappearing within the host • NAT statistics show insert_failed > 0
  9. Conntrack • Kernel tracks connections in a (conntrack) table •

    Race leads to some insertion to conflict
  10. Race condition When: • multiple containers on the same host

    • at the same moment • connect to the same address outside the cluster then one connection might be delayed by 1s or more
  11. Timezones • Not supported • SIG Apps+Architecture against it for

    now • “Write your own controller” (see #47202)
  12. Frozen CronJob After a CronJob missed 100 runs in a

    row, it’s ignored by Kubernetes, forever
  13. Prevent out-of-schedule runs “CronJobs might be executed twice or not

    at all “ Set startDeadlineSeconds on your CronJobs
  14. Bugs • Daemon randomly freezing • Container inspection hanging •

    Container stuck, can't be stopped, killed • Docker shims crashing
  15. Pod Lifecycle Event Generator • Kubelet lists the containers every

    seconds • When a container’s state changes, inspects it and generates an event • Not healthy if this operation takes > 3 minutes
  16. Origin of flappiness • Permanently broken containers are inspected twice

    • Inspection timing out = 2x 2m delay = NotReady • Successful container listing = Ready
  17. Origin of timeouts Caused by stuck inspection commands on: •

    Containers with no processes • Containers with only zombies
  18. Container killer • Runs on every node • Searches for

    containers that can’t be inspected • kill -9 the Docker shims • Provides metrics about suspicious Pods
  19. There is more to talk about • cgroup hierarchy and

    oom killers • side effects of using ipvs • process trees with sharedPIDs • some production horror stories
  20. We had this case where Our got stuck own controller

    QuotaController DaemonSetController CronJob garbage collection