Lessons learned from 1 year of Kubernetes in production

Lessons learned from 1 year of Kubernetes in production at
Crowd re 30 Jul 2017 Amanpreet Singh Software Engineer, Crowd re

What Crowd re really is

What the stack looks like 50+ microservices written in Java,
Node.Js, Python, Go Mostly stateless (except the Chat service) Heavy use of Amazon SQS to decouple some parts Data stores - DynamoDB, MongoDB, Elasticsearch, Aerospike, MySQL, Postgres, Redis, S3

Before Kubernetes

Before Kubernetes State of things before Kubernetes came along Google
App Engine ---> AWS Monolith ---> Microservices Lots of services == Lots of AWS Elastic Beanstalk environments Slow, less-repeatable deploys Low self-healing ability Underutilization

Before Kubernetes Why it made sense to move to Kubernetes
Our architecture ts pretty well in Kubernetes world Containers good for packaging == repeatability++ Uni ed pool of resouces & bin-packing == utilization++ Quick container restarts + rescheduling == self-healing++

How we migrated

How we migrated Have 12-factor apps! Containerize all the things!
To move everything at once or one-by-one?

How we migrated lots of bene ts moving initial few
services, not so much after that move a relatively less important service rst, to deal with the unknowns move a complex service - if that works, everything else would work too supporting services, that hardly do anything now

The Lessons

Service discovery (internal dns) pitfalls k8s does service discovery via
pre-populated env vars or internal dns Service IPs don't change unless we delete and recreate the service Use internal dns only when we need the pod IPs directly (in DBs, for example) Protip: Create a service of type ExternalName - easy to set an alias that could be resolved via KubeDNS

Are your apps Kubernetes-ready? Running an app in k8s doesn't
magically make it awesome Make sure our apps have good healthchecks - k8s won't deploy bad code if you have failing healthchecks! Gracefully handle shutdown

Resource constraints Constraint all the things! Keep all those leaky
apps from wreaking havoc Choose appropriate QoS Class based on service type/priority - Gauranteed, Burstable or Best-E ort

Surviving Failures Have enough number of pods to survive multiple
node/pod failures Did you know? AWS provides a CMAAS (Chaos Monkey As a Service) It's called "running in US-EAST-1" Have at least one extra node than required, since new node takes a while to come up.

Logging & Monitoring Since container and logs are ephemeral, ship
logs quickly! K8S creates symlinks to actual docker logs - with useful info in lenames POD-NAME_NAMESPACE_CONTAINER-NAME_CONTAINER-ID.log Be sure to monitor pod restarts! Check if it was OOM Killed, App Error or Healthcheck failure To run logging & monitoring agent, use Daemonsets https://github.com/ApsOps/filebeat-kubernetes

Stateful applications K8S has StatefulSets (previously PetSets) which are pretty
awesome This can get tricky though - attaching EBS volumes to nodes may not always work as quickly we expect it to Members coming-and-going are generally costly operations for most of the data stores Bottomline: we don't have to go all in with k8s. Evaluate your use-cases for persistent workloads, and have enough replicas

Kubernetes Alpha resources Be careful when using k8s alpha resources
- they're alpha for a reason CronJobs (prev. ScheduledJobs) had lots of missed schedules for us

Sticky sessions Since k8s services are L3/L4 based, it can't
see the headers k8s has a sessionA nity, but it can't see the actual client IP Solution that just works - ELB w/ ProxyProtocol enabled --> intermediary nginx --> websocket app

After Kubernetes

After Kubernetes Ease of managing lots of services Deploys are
super fast Much better resource utilization Self-healing services

Thank you 30 Jul 2017 Amanpreet Singh Software Engineer, Crowd
re @ApsOps (http://twitter.com/ApsOps)

Lessons learned from 1 year of Kubernetes in pr...

Lessons learned from 1 year of Kubernetes in production

Amanpreet Singh

More Decks by Amanpreet Singh

Other Decks in Technology

Featured

Transcript