[2018.12 Meetup] [TALK #2] Vitali Henne - How to shoot yourself in the foot using kubernetes

How to shoot yourself in the foot using kubernetes We’ve
just opened an office in Lisbon. Why, yes, we are hiring! [email protected]

k8s and freiheit. A love story? We’ve been using k8s
since 2015 - just before it turned 1.0. Right now, we have 655 pods and 242 deployments. Of all the technological choices we made - this is the one I regret the least. :-) [email protected]

zero downtime deployments? [email protected]

zero downtime deployments? Readiness probes using health endpoints! [email protected]

zero downtime deployments Readiness probes: - allow k8s to check
the health of a service - are considered with rolling updates: - old services will only shut down if the freshly deployed services are healthy - new services will only receive traffic if they are healthy [email protected]

zero downtime deployments If it fails for an instance, that
instance won’t be served by the load balancer. - Cool during the deployment. - Cool for temporary glitches (I can’t answer right now, but I’m sure I’ll be fine in a bit) - Not so cool for real problems - your probes might just have created a down-time, because none of your services is being serviced any more. [email protected]

Enter:Liveness Probes - k8s can check if a pod thinks
it’s alive and kill it if necessary [email protected]

CPU limits - allow k8s to schedule pods on nodes
so that they are not overcommitted - Processes will be throttled if they use too much [email protected]

Liveness Probes + CPU limits = <3? [email protected]

Liveness Probes + CPU limits CPU limits: What happens when
they are violated? -> k8s/docker throttles your service. What do services often need when they start up? -> a lot of CPU. 1. Throttled at startup 2. Can not answer the liveness probe 3. Will be killed 4. go to step 1. [email protected]

Shared Resources Problems: - Nodes are overcommitted - Services are
stealing resources from each other Solution: - Just set limits! New Problems: - How to set limits? [email protected]

Shared Resources Need: - Good monitoring and some load tests
- For JVM-services: These need limits as well. - Either use the new “Container aware” flags (1.8: -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap or 1.10: -XX:+UseContainerSupport) or do it manually (don’t forget to set heap space and take into account off heap (classes, threads, buffers, Jetty)) [email protected]

Things fail/pods die Solution: Readiness probes using health endpoints More
problems: Services know best, right? Okay. I shouldn’t depend on other services, but at least the DB should be available? -> 3 replicas, 200 services, 1 check per second: 600 requests/sec just to answer health? [email protected]

Things fail/pods die Scenario 2: Your services need about 5
minutes to start up (no problem, right, you’ve got rolling updates). The health check only depends on the availability of your DB (ok, the driver is complicated.. better restart the service). You have a weird network glitch and your DB is not available for 30 secs. -> All your services are being killed and it takes them 5 minutes to come back up. -> You’ve just earned 5 minutes of downtime. (Fail fast... but not too fast) [email protected]

Multiple processes per container Example: About a year ago alpine
had bad DNS - needed to have its own DNS server. Now you have two processes and only one health probe, which probably only checks one process. If the other fails, everything will fail. Can fix that! Just use something to supervise these processes. Turns out, now you have a third problem. [email protected]

Multiple processes per container Example: Scheduling tasks.. I’ll just add
Cron to the image! Needs to be monitored in principle.. but it’s battle proven.. guess it’s fine. Logging? Yes, we write json to STDOUT. Cron? I’ll just truncate and and prefix every line... [email protected]

File systems Well.. there is the host’s file system! But
I might be scheduled on a different node when I die! If that’s not a problem - who does the cleanup? If that’s a problem, Node Pinning to the rescue! And the node dies.. [email protected]

[2018.12 Meetup] [TALK #2] Vitali Henne - How t...

[2018.12 Meetup] [TALK #2] Vitali Henne - How to shoot yourself in the foot using kubernetes

DevOps Lisbon

More Decks by DevOps Lisbon

Other Decks in Technology

Featured

Transcript

How to shoot yourself in the foot using kubernetes We’ve

k8s and freiheit. A love story? We’ve been using k8s

zero downtime deployments? [email protected]

zero downtime deployments? Readiness probes using health endpoints! [email protected]

zero downtime deployments Readiness probes: - allow k8s to check

zero downtime deployments If it fails for an instance, that

Enter:Liveness Probes - k8s can check if a pod thinks

CPU limits - allow k8s to schedule pods on nodes

Liveness Probes + CPU limits = <3? [email protected]

Liveness Probes + CPU limits CPU limits: What happens when

Shared Resources Problems: - Nodes are overcommitted - Services are

Shared Resources Need: - Good monitoring and some load tests

Things fail/pods die Solution: Readiness probes using health endpoints More

Things fail/pods die Scenario 2: Your services need about 5

Multiple processes per container Example: About a year ago alpine

Multiple processes per container Example: Scheduling tasks.. I’ll just add

File systems Well.. there is the host’s file system! But