Slide 1

Slide 1 text

A Year of Kubernetes at Wongnai GDG Cloud Bangkok Meetup #2

Slide 2

Slide 2 text

Me ● Manatsawin Hanmongkolchai ● Junior Architect at Wongnai ● Follow me on Medium at life.wongnai.com and blog.whs.in.th

Slide 3

Slide 3 text

Why Kubernetes? ● A move to Docker is natural ○ Reproducible environment, who doesn't like it? ● What about Kubernetes

Slide 4

Slide 4 text

Why Kubernetes? ● Our infrastructure, early 2016 L7 ELB Autoscale Group EC2: Java EC2: Java EC2: Java EC2: Java EC2: CMS (HungryFatGuy, WeKorat) EC2: Internal tools RDS EC2: Cassandra

Slide 5

Slide 5 text

Why Kubernetes? ● We were transforming to a microservice world ● And also working on many new services ○ Restaurant Management System (RMS) - used in LINE MAN ○ LINE Chatbot ● Autoscaling server per service doesn't scale - each service consume less than a server

Slide 6

Slide 6 text

kubeup ● Around August 2016, I experimented with kube-aws and kubeup ● kube-aws (by CoreOS) feels experimental ● kubeup seems to be supported by mainline

Slide 7

Slide 7 text

kubeup ● The first to hit production is LINE Chatbot on 27 September 2016

Slide 8

Slide 8 text

Production day 1 ● Resource allocation were a huge pain. ● We had limited budget for cluster size (3 machines) for 2 apps ● Very modest resource request were used - nobody know how much resource we're using exactly ● This result in infighting - whenever Jenkins start building it will crash something

Slide 9

Slide 9 text

Production day 1 ● Influx of traffic to LINE Chatbot crashed our RMS development server ● In the end we used node selector. RMS and Chatbot must not be on the same server

Slide 10

Slide 10 text

We ❤ Kubernetes ● Docker containers start really fast - faster than firing up EC2 instances ● Simple deployment - edit the container tag and wait ● Readiness check ensure basic stability (but don't rely on it much) ● Web interface allow team members to skip learning kubectl (but not you)

Slide 11

Slide 11 text

Kubernetes woes ● There is no monitoring. If a pod goes into crash loop nobody know ● I wrote kube-slack to send a message to slack. https://github.com/wongnai/kube-slack ● It works, but the channel is so spammy

Slide 12

Slide 12 text

Hack we made in production ● Changing scheduler policy https://life.wongnai.com/how-kubernetes-schedule-pods-352a7bb0eb10 ● Sometimes pods of the same application schedule on one node. If it goes down, the whole thing goes down. ● Most popular hack is to use inter-pod affinity, but that is available in 1.4 ● I modified the scheduler policy to prioritize spreading instead of utilization

Slide 13

Slide 13 text

What we were missing out ● 1.4: Scheduled Job, Dynamic PVC Provisioning, Init containers, Pod affinity (that's why we modified scheduler), New interface ● 1.6: Node Affinity (now master is just another node) ● 1.7: Network Policy

Slide 14

Slide 14 text

It's time for upgrade ● kubeup have no upgrade path (but there were no other tools at that time) ● To upgrade, I manually edit the launch configuration to point to new Kubernetes binary and roll the cluster ○ Which is not easy because it is gzipped

Slide 15

Slide 15 text

Migrating to Kops ● Kubeup was replaced by Kops (Kubernetes Operator) ● Kops does have an upgrade path ● Expected time to migrate: 2 months ● Actual time taken: 3.5 months

Slide 16

Slide 16 text

What went wrong ● kubectl edit make it easy to make changes, and so is the web interface ● … but the changes are not tracked! All YAML files are outdated! ○ I built a tool that do kubectl get pod -o yaml, run sanity pass, manual review then push it to new cluster ● Release scheduling issue

Slide 17

Slide 17 text

But we had (almost) no downtime ● We can move traffic using ALB host-based routing ● Broken deployment? Rollback to old server in 30 seconds - faster than DNS-based

Slide 18

Slide 18 text

Our deploying system ● We have our own deploying system - Project Eastern ● No plan for open source yet - it's deeply integrated into our Jenkins instance (that's why we can't move to GitLab)

Slide 19

Slide 19 text

Project Eastern Architecture Jenkins UI Node lookup by environment Jenkins Swarm (Kube1) K8S 1.3 API Templating Jenkins Swarm (Kube2) K8S 1.7 API Templating Jenkins Swarm (GKE) K8S 1.8 API Templating

Slide 20

Slide 20 text

Project Eastern Templating ● Logicless apiVersion: autoscaling/v1 kind: HorizontalPodAutoscaler metadata: name: wongnai-react labels: app: wongnai-react spec: targetCPUUtilizationPercentage: 100 maxReplicas: 15 minReplicas: 3 scaleTargetRef: kind: Deployment name: wongnai-react apiVersion: extensions/v1beta1 # load! overrides/hpa-${NAMESPACE}.yaml, overrides/hpa-default-val.yaml Load partials

Slide 21

Slide 21 text

Our templating system ● Simple to write, no {{ partial | indent:4 }} hack like Helm ○ Partials are automatically indented to the load partial line ● Basic condition by loading file by namespace ○ We'll need complex conditions soon…. ● Simple implementation: ○ Read one line ○ If the line begins with # load then recursively run this with the first file found ○ Indent the partials to the number of spaces found before # ● I'm considering open sourcing it, but it is low priority >_< ○ Plus we are considering other solutions

Slide 22

Slide 22 text

Kubernetes Our current architecture ALB Autoscale Group EC2: Java EC2: Java EC2: CMS (HungryFatGuy, WeKorat) EC2: Internal tools RDS EC2: Cassandra Traefik Ingress Controller nginx React NLB api-gateway Cooking Media Java (admin) Internal tools Chatbot RMS (LINEMAN) LINEMAN ALB ElastiCache

Slide 23

Slide 23 text

What we're working on ● Autoscaling cluster ● Proper resource allocation and deployment autoscaler settings ● EC2 <> K8S communication ● New deployment tool?

Slide 24

Slide 24 text

Summary ● Kubernetes is essential to our microservice architecture ● You can run Kubernetes without migrating everything ● Kubernetes upgrade is still something that has to be planned for ● You would need to build (or find) some DevOps tools, as there are no established solution yet

Slide 25

Slide 25 text

Thank you