Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[2018.09 Meetup] [TALK #2] Jannis Rake-Revelant - Kubernetes at Zalando

B4086361d83c89424ab054791077b36c?s=47 DevOps Lisbon
September 17, 2018

[2018.09 Meetup] [TALK #2] Jannis Rake-Revelant - Kubernetes at Zalando

Kubernetes was introduced at Zalando for a multitude of reasons, like infrastructure efficiency and reliability of services. This was also an opportunity for a change in our build and deployment processes.
It is our purpose that developers should focus on the needs of business applications they’re developing. However, things like compliance, security and not so easy to use tooling can get in the way. With this in mind, we made a radical change in how we build and ship software to AWS, specifically to Kubernetes.

Jannis is an Engineering Lead at Zalando for the team that is responsible for providing an elastic, reliable and automated Platform-as-a-Service infrastructure for developers to build and run applications on. He has been working with Kubernetes since before its GA. Jannis is always eager to talk distributed systems and any challenges related to them.


DevOps Lisbon

September 17, 2018

More Decks by DevOps Lisbon

Other Decks in Technology


  1. Kubernetes at Zalando DEVOPS MEETUP LISBON JANNIS RAKE-REVELANT @jannis_r 2018-09-17

  2. 2 ZALANDO AT A GLANCE ~ 4.5 billion EUR revenue

    2017 > 200 million visits per month > 15.000 employees in Europe > 70% of visits via mobile devices > 23 million active customers > 300.000 product choices ~ 2.000 brands 15 countries
  3. 3 Overview of Kubernetes at Zalando Architecture Open-Source AGENDA

  4. 4 ZALANDO TECH ~ 2.000 Employees in Tech > 200

    Delivery teams
  5. 5 SCALE 99 Clusters 380 Accounts

  6. 6 INFRASTRUCTURE @ ZALANDO STUPS (toolset around AWS) Kubernetes AWS

    accounts per team. All instances must run the same AMI. PowerUser access to Production. Clusters per product (multiple teams). Instances are not managed by teams. Hands off approach. You build it, you run EVERYTHING. A lot of stuff out of the box.
  7. 7 CURRENT STATUS • Kubernetes announced as "GA" • ADOPT

    on Zalando Tech Radar • Critical services on Kubernetes • Next: Kubernetes as default, sunset date for STUPS

  9. 9 KUBERNETES DEPLOYMENTS CDP/Kubernetes AWS/STUPS 55% 29% Dec 2017 July


  11. 11 “PHILOSOPHY” No pet clusters We don’t want to tweak

    custom settings for 90+ clusters. Always provide the latest stable Kubernetes version Oldest clusters were upgraded from v1.4 through v1.11. Continuous and non-disruptive cluster updates No maintenance windows. “Fully” automated operations Operators should only need to manually merge PRs.
  12. 12 CLUSTER SETUP • Provisioned in AWS via Cloudformation. •

    Etcd stack outside Kubernetes. • Distribution: Container Linux. • Multi AZ worker nodes. • Highly available control plane setup behind ELB. • Cluster configuration stored in git. • e2e tests run via Jenkins. • Changes rolled out via ‘Cluster Lifecycle Manager’.
  13. 13 CLUSTER LIFECYCLE MANAGER (CLM) github.com/zalando-incubator/cluster-lifecycle-manager


  15. 15 CLUSTER CHANNELS?? github.com/zalando-incubator/kubernetes-on-aws Channel Description Clusters dev Development and

    playground clusters. 3 alpha Main infrastructure cluster (important to us). 1 beta Product clusters for the rest of the organization (prod/test). 90+
  16. 16 E2E TESTS ON EVERY PR github.com/zalando-incubator/kubernetes-on-aws

  17. 17 E2E TESTS Conformance Tests Upstream Kubernetes e2e conformance tests

    ✓ 144 Zalando Tests (custom) Custom tests for ingress, external-dns, PSP etc. 4 StatefulSet Tests Rolling update of stateful sets including volume mounting ✓ 2 ✓
  18. 18

  19. 19 E2E TESTS FOR OUR INFRASTRUCTURE Control plane node node

    Control plane node node branch: alpha (base) branch: dev (head) Create Cluster Update Cluster Run e2e tests Delete Cluster Testing dev to alpha upgrade Control plane Control plane
  20. 20 ARCHITECTURE DECISIONS • One prod. cluster per AWS account

    / “product” • API server behind SSL ELB, OAuth webhook • Read only access to production • CI/CD for write access • etcd running separately on EC2 • Multi AZ clusters
  21. 21 - TECHNICAL - Backup your etcd (and monitor your

    backups) - Be careful using t instances in production - Disabling CPU throttling (CFS quota) to avoid latency issues - NON-TECHNICAL - Avoid friction for the users - Try to create a system "compliant by design” - Create a community inside the company LESSONS LEARNED
  22. 22 OPEN SOURCE

  23. 23 INGRESS CONTROLLER https://github.com/zalando-incubator/kube-ingress-aws-controller

  24. 24 POSTGRES OPERATOR github.com/zalando-incubator/postgres-operator Node pg role=master Node pg role=replica

    Node pg role=replica Node postgres operator Evict ✘ evict pg role=replica promote role=master role=replica drain ✓
  25. 25 KUBERNETES RESOURCE REPORT github.com/hjacobs/kube-resource-report

  26. 26 KUBERNETES DOWNSCALER github.com/hjacobs/kube-downscaler Weekend

  27. 27 OPEN SOURCE Kubernetes on AWS github.com/zalando-incubator/kubernetes-on-aws AWS ALB Ingress

    controller github.com/zalando-incubator/kube-ingress-aws-controller Skipper HTTP Router & Ingress controller github.com/zalando/skipper External DNS github.com/kubernetes-incubator/external-dns Postgres Operator github.com/zalando-incubator/postgres-operator Kubernetes Resource Report github.com/hjacobs/kube-resource-report Kubernetes Downscaler github.com/hjacobs/kube-downscaler
  28. QUESTIONS? JANNIS RAKE-REVELANT ENGINEERING LEAD TEAM TEAPOT (COMPUTE) jannis.rake-revelant@zalando.de @jannis_r