Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A Year of Kubernetes at Wongnai

A Year of Kubernetes at Wongnai

GDG Cloud Bangkok Meetup 2

Manatsawin Hanmongkolchai

November 21, 2017
Tweet

More Decks by Manatsawin Hanmongkolchai

Other Decks in Programming

Transcript

  1. Me • Manatsawin Hanmongkolchai • Junior Architect at Wongnai •

    Follow me on Medium at life.wongnai.com and blog.whs.in.th
  2. Why Kubernetes? • A move to Docker is natural ◦

    Reproducible environment, who doesn't like it? • What about Kubernetes
  3. Why Kubernetes? • Our infrastructure, early 2016 L7 ELB Autoscale

    Group EC2: Java EC2: Java EC2: Java EC2: Java EC2: CMS (HungryFatGuy, WeKorat) EC2: Internal tools RDS EC2: Cassandra
  4. Why Kubernetes? • We were transforming to a microservice world

    • And also working on many new services ◦ Restaurant Management System (RMS) - used in LINE MAN ◦ LINE Chatbot • Autoscaling server per service doesn't scale - each service consume less than a server
  5. kubeup • Around August 2016, I experimented with kube-aws and

    kubeup • kube-aws (by CoreOS) feels experimental • kubeup seems to be supported by mainline
  6. Production day 1 • Resource allocation were a huge pain.

    • We had limited budget for cluster size (3 machines) for 2 apps • Very modest resource request were used - nobody know how much resource we're using exactly • This result in infighting - whenever Jenkins start building it will crash something
  7. Production day 1 • Influx of traffic to LINE Chatbot

    crashed our RMS development server • In the end we used node selector. RMS and Chatbot must not be on the same server
  8. We ❤ Kubernetes • Docker containers start really fast -

    faster than firing up EC2 instances • Simple deployment - edit the container tag and wait • Readiness check ensure basic stability (but don't rely on it much) • Web interface allow team members to skip learning kubectl (but not you)
  9. Kubernetes woes • There is no monitoring. If a pod

    goes into crash loop nobody know • I wrote kube-slack to send a message to slack. https://github.com/wongnai/kube-slack • It works, but the channel is so spammy
  10. Hack we made in production • Changing scheduler policy https://life.wongnai.com/how-kubernetes-schedule-pods-352a7bb0eb10

    • Sometimes pods of the same application schedule on one node. If it goes down, the whole thing goes down. • Most popular hack is to use inter-pod affinity, but that is available in 1.4 • I modified the scheduler policy to prioritize spreading instead of utilization
  11. What we were missing out • 1.4: Scheduled Job, Dynamic

    PVC Provisioning, Init containers, Pod affinity (that's why we modified scheduler), New interface • 1.6: Node Affinity (now master is just another node) • 1.7: Network Policy
  12. It's time for upgrade • kubeup have no upgrade path

    (but there were no other tools at that time) • To upgrade, I manually edit the launch configuration to point to new Kubernetes binary and roll the cluster ◦ Which is not easy because it is gzipped
  13. Migrating to Kops • Kubeup was replaced by Kops (Kubernetes

    Operator) • Kops does have an upgrade path • Expected time to migrate: 2 months • Actual time taken: 3.5 months
  14. What went wrong • kubectl edit make it easy to

    make changes, and so is the web interface • … but the changes are not tracked! All YAML files are outdated! ◦ I built a tool that do kubectl get pod -o yaml, run sanity pass, manual review then push it to new cluster • Release scheduling issue
  15. But we had (almost) no downtime • We can move

    traffic using ALB host-based routing • Broken deployment? Rollback to old server in 30 seconds - faster than DNS-based
  16. Our deploying system • We have our own deploying system

    - Project Eastern • No plan for open source yet - it's deeply integrated into our Jenkins instance (that's why we can't move to GitLab)
  17. Project Eastern Architecture Jenkins UI Node lookup by environment Jenkins

    Swarm (Kube1) K8S 1.3 API Templating Jenkins Swarm (Kube2) K8S 1.7 API Templating Jenkins Swarm (GKE) K8S 1.8 API Templating
  18. Project Eastern Templating • Logicless apiVersion: autoscaling/v1 kind: HorizontalPodAutoscaler metadata:

    name: wongnai-react labels: app: wongnai-react spec: targetCPUUtilizationPercentage: 100 maxReplicas: 15 minReplicas: 3 scaleTargetRef: kind: Deployment name: wongnai-react apiVersion: extensions/v1beta1 # load! overrides/hpa-${NAMESPACE}.yaml, overrides/hpa-default-val.yaml Load partials
  19. Our templating system • Simple to write, no {{ partial

    | indent:4 }} hack like Helm ◦ Partials are automatically indented to the load partial line • Basic condition by loading file by namespace ◦ We'll need complex conditions soon…. • Simple implementation: ◦ Read one line ◦ If the line begins with # load then recursively run this with the first file found ◦ Indent the partials to the number of spaces found before # • I'm considering open sourcing it, but it is low priority >_< ◦ Plus we are considering other solutions
  20. Kubernetes Our current architecture ALB Autoscale Group EC2: Java EC2:

    Java EC2: CMS (HungryFatGuy, WeKorat) EC2: Internal tools RDS EC2: Cassandra Traefik Ingress Controller nginx React NLB api-gateway Cooking Media Java (admin) Internal tools Chatbot RMS (LINEMAN) LINEMAN ALB ElastiCache
  21. What we're working on • Autoscaling cluster • Proper resource

    allocation and deployment autoscaler settings • EC2 <> K8S communication • New deployment tool?
  22. Summary • Kubernetes is essential to our microservice architecture •

    You can run Kubernetes without migrating everything • Kubernetes upgrade is still something that has to be planned for • You would need to build (or find) some DevOps tools, as there are no established solution yet