since 2015 - just before it turned 1.0. Right now, we have 655 pods and 242 deployments. Of all the technological choices we made - this is the one I regret the least. :-) [email protected]
the health of a service - are considered with rolling updates: - old services will only shut down if the freshly deployed services are healthy - new services will only receive traffic if they are healthy [email protected]
instance won’t be served by the load balancer. - Cool during the deployment. - Cool for temporary glitches (I can’t answer right now, but I’m sure I’ll be fine in a bit) - Not so cool for real problems - your probes might just have created a down-time, because none of your services is being serviced any more. [email protected]
they are violated? -> k8s/docker throttles your service. What do services often need when they start up? -> a lot of CPU. 1. Throttled at startup 2. Can not answer the liveness probe 3. Will be killed 4. go to step 1. [email protected]
- For JVM-services: These need limits as well. - Either use the new “Container aware” flags (1.8: -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap or 1.10: -XX:+UseContainerSupport) or do it manually (don’t forget to set heap space and take into account off heap (classes, threads, buffers, Jetty)) [email protected]
problems: Services know best, right? Okay. I shouldn’t depend on other services, but at least the DB should be available? -> 3 replicas, 200 services, 1 check per second: 600 requests/sec just to answer health? [email protected]
minutes to start up (no problem, right, you’ve got rolling updates). The health check only depends on the availability of your DB (ok, the driver is complicated.. better restart the service). You have a weird network glitch and your DB is not available for 30 secs. -> All your services are being killed and it takes them 5 minutes to come back up. -> You’ve just earned 5 minutes of downtime. (Fail fast... but not too fast) [email protected]
had bad DNS - needed to have its own DNS server. Now you have two processes and only one health probe, which probably only checks one process. If the other fails, everything will fail. Can fix that! Just use something to supervise these processes. Turns out, now you have a third problem. [email protected]
Cron to the image! Needs to be monitored in principle.. but it’s battle proven.. guess it’s fine. Logging? Yes, we write json to STDOUT. Cron? I’ll just truncate and and prefix every line... [email protected]
I might be scheduled on a different node when I die! If that’s not a problem - who does the cleanup? If that’s a problem, Node Pinning to the rescue! And the node dies.. [email protected]