A Year of Kubernetes at Wongnai

A Year of Kubernetes at Wongnai GDG Cloud Bangkok Meetup
#2

Me • Manatsawin Hanmongkolchai • Junior Architect at Wongnai •
Follow me on Medium at life.wongnai.com and blog.whs.in.th

Why Kubernetes? • A move to Docker is natural ◦
Reproducible environment, who doesn't like it? • What about Kubernetes

Why Kubernetes? • Our infrastructure, early 2016 L7 ELB Autoscale
Group EC2: Java EC2: Java EC2: Java EC2: Java EC2: CMS (HungryFatGuy, WeKorat) EC2: Internal tools RDS EC2: Cassandra

Why Kubernetes? • We were transforming to a microservice world
• And also working on many new services ◦ Restaurant Management System (RMS) - used in LINE MAN ◦ LINE Chatbot • Autoscaling server per service doesn't scale - each service consume less than a server

kubeup • Around August 2016, I experimented with kube-aws and
kubeup • kube-aws (by CoreOS) feels experimental • kubeup seems to be supported by mainline

kubeup • The first to hit production is LINE Chatbot
on 27 September 2016

Production day 1 • Resource allocation were a huge pain.
• We had limited budget for cluster size (3 machines) for 2 apps • Very modest resource request were used - nobody know how much resource we're using exactly • This result in infighting - whenever Jenkins start building it will crash something

Production day 1 • Influx of traffic to LINE Chatbot
crashed our RMS development server • In the end we used node selector. RMS and Chatbot must not be on the same server

We ❤ Kubernetes • Docker containers start really fast -
faster than firing up EC2 instances • Simple deployment - edit the container tag and wait • Readiness check ensure basic stability (but don't rely on it much) • Web interface allow team members to skip learning kubectl (but not you)

Kubernetes woes • There is no monitoring. If a pod
goes into crash loop nobody know • I wrote kube-slack to send a message to slack. https://github.com/wongnai/kube-slack • It works, but the channel is so spammy

Hack we made in production • Changing scheduler policy https://life.wongnai.com/how-kubernetes-schedule-pods-352a7bb0eb10
• Sometimes pods of the same application schedule on one node. If it goes down, the whole thing goes down. • Most popular hack is to use inter-pod affinity, but that is available in 1.4 • I modified the scheduler policy to prioritize spreading instead of utilization

What we were missing out • 1.4: Scheduled Job, Dynamic
PVC Provisioning, Init containers, Pod affinity (that's why we modified scheduler), New interface • 1.6: Node Affinity (now master is just another node) • 1.7: Network Policy

It's time for upgrade • kubeup have no upgrade path
(but there were no other tools at that time) • To upgrade, I manually edit the launch configuration to point to new Kubernetes binary and roll the cluster ◦ Which is not easy because it is gzipped

Migrating to Kops • Kubeup was replaced by Kops (Kubernetes
Operator) • Kops does have an upgrade path • Expected time to migrate: 2 months • Actual time taken: 3.5 months

What went wrong • kubectl edit make it easy to
make changes, and so is the web interface • … but the changes are not tracked! All YAML files are outdated! ◦ I built a tool that do kubectl get pod -o yaml, run sanity pass, manual review then push it to new cluster • Release scheduling issue

But we had (almost) no downtime • We can move
traffic using ALB host-based routing • Broken deployment? Rollback to old server in 30 seconds - faster than DNS-based

Our deploying system • We have our own deploying system
- Project Eastern • No plan for open source yet - it's deeply integrated into our Jenkins instance (that's why we can't move to GitLab)

Project Eastern Architecture Jenkins UI Node lookup by environment Jenkins
Swarm (Kube1) K8S 1.3 API Templating Jenkins Swarm (Kube2) K8S 1.7 API Templating Jenkins Swarm (GKE) K8S 1.8 API Templating

Project Eastern Templating • Logicless apiVersion: autoscaling/v1 kind: HorizontalPodAutoscaler metadata:
name: wongnai-react labels: app: wongnai-react spec: targetCPUUtilizationPercentage: 100 maxReplicas: 15 minReplicas: 3 scaleTargetRef: kind: Deployment name: wongnai-react apiVersion: extensions/v1beta1 # load! overrides/hpa-${NAMESPACE}.yaml, overrides/hpa-default-val.yaml Load partials

Our templating system • Simple to write, no {{ partial
| indent:4 }} hack like Helm ◦ Partials are automatically indented to the load partial line • Basic condition by loading file by namespace ◦ We'll need complex conditions soon…. • Simple implementation: ◦ Read one line ◦ If the line begins with # load then recursively run this with the first file found ◦ Indent the partials to the number of spaces found before # • I'm considering open sourcing it, but it is low priority >_< ◦ Plus we are considering other solutions

Kubernetes Our current architecture ALB Autoscale Group EC2: Java EC2:
Java EC2: CMS (HungryFatGuy, WeKorat) EC2: Internal tools RDS EC2: Cassandra Traefik Ingress Controller nginx React NLB api-gateway Cooking Media Java (admin) Internal tools Chatbot RMS (LINEMAN) LINEMAN ALB ElastiCache

What we're working on • Autoscaling cluster • Proper resource
allocation and deployment autoscaler settings • EC2 <> K8S communication • New deployment tool?

Summary • Kubernetes is essential to our microservice architecture •
You can run Kubernetes without migrating everything • Kubernetes upgrade is still something that has to be planned for • You would need to build (or find) some DevOps tools, as there are no established solution yet

Thank you

A Year of Kubernetes at Wongnai

A Year of Kubernetes at Wongnai

Manatsawin Hanmongkolchai

More Decks by Manatsawin Hanmongkolchai

Other Decks in Programming

Featured

Transcript

A Year of Kubernetes at Wongnai GDG Cloud Bangkok Meetup

Me • Manatsawin Hanmongkolchai • Junior Architect at Wongnai •

Why Kubernetes? • A move to Docker is natural ◦

Why Kubernetes? • Our infrastructure, early 2016 L7 ELB Autoscale

Why Kubernetes? • We were transforming to a microservice world

kubeup • Around August 2016, I experimented with kube-aws and

kubeup • The first to hit production is LINE Chatbot

Production day 1 • Resource allocation were a huge pain.

Production day 1 • Influx of traffic to LINE Chatbot

We ❤ Kubernetes • Docker containers start really fast -

Kubernetes woes • There is no monitoring. If a pod

Hack we made in production • Changing scheduler policy https://life.wongnai.com/how-kubernetes-schedule-pods-352a7bb0eb10

What we were missing out • 1.4: Scheduled Job, Dynamic

It's time for upgrade • kubeup have no upgrade path

Migrating to Kops • Kubeup was replaced by Kops (Kubernetes

What went wrong • kubectl edit make it easy to

But we had (almost) no downtime • We can move

Our deploying system • We have our own deploying system

Project Eastern Architecture Jenkins UI Node lookup by environment Jenkins

Project Eastern Templating • Logicless apiVersion: autoscaling/v1 kind: HorizontalPodAutoscaler metadata:

Our templating system • Simple to write, no {{ partial

Kubernetes Our current architecture ALB Autoscale Group EC2: Java EC2:

What we're working on • Autoscaling cluster • Proper resource

Summary • Kubernetes is essential to our microservice architecture •

Thank you