Slide 1

Slide 1 text

Kubernetes at Zalando DEVOPS MEETUP LISBON JANNIS RAKE-REVELANT @jannis_r 2018-09-17

Slide 2

Slide 2 text

2 ZALANDO AT A GLANCE ~ 4.5 billion EUR revenue 2017 > 200 million visits per month > 15.000 employees in Europe > 70% of visits via mobile devices > 23 million active customers > 300.000 product choices ~ 2.000 brands 15 countries

Slide 3

Slide 3 text

3 Overview of Kubernetes at Zalando Architecture Open-Source AGENDA

Slide 4

Slide 4 text

4 ZALANDO TECH ~ 2.000 Employees in Tech > 200 Delivery teams

Slide 5

Slide 5 text

5 SCALE 99 Clusters 380 Accounts

Slide 6

Slide 6 text

6 INFRASTRUCTURE @ ZALANDO STUPS (toolset around AWS) Kubernetes AWS accounts per team. All instances must run the same AMI. PowerUser access to Production. Clusters per product (multiple teams). Instances are not managed by teams. Hands off approach. You build it, you run EVERYTHING. A lot of stuff out of the box.

Slide 7

Slide 7 text

7 CURRENT STATUS • Kubernetes announced as "GA" • ADOPT on Zalando Tech Radar • Critical services on Kubernetes • Next: Kubernetes as default, sunset date for STUPS

Slide 8

Slide 8 text

8 DEVELOPERS USING KUBERNETES

Slide 9

Slide 9 text

9 KUBERNETES DEPLOYMENTS CDP/Kubernetes AWS/STUPS 55% 29% Dec 2017 July 2018

Slide 10

Slide 10 text

10 ARCHITECTURE

Slide 11

Slide 11 text

11 “PHILOSOPHY” No pet clusters We don’t want to tweak custom settings for 90+ clusters. Always provide the latest stable Kubernetes version Oldest clusters were upgraded from v1.4 through v1.11. Continuous and non-disruptive cluster updates No maintenance windows. “Fully” automated operations Operators should only need to manually merge PRs.

Slide 12

Slide 12 text

12 CLUSTER SETUP ● Provisioned in AWS via Cloudformation. ● Etcd stack outside Kubernetes. ● Distribution: Container Linux. ● Multi AZ worker nodes. ● Highly available control plane setup behind ELB. ● Cluster configuration stored in git. ● e2e tests run via Jenkins. ● Changes rolled out via ‘Cluster Lifecycle Manager’.

Slide 13

Slide 13 text

13 CLUSTER LIFECYCLE MANAGER (CLM) github.com/zalando-incubator/cluster-lifecycle-manager

Slide 14

Slide 14 text

14 CLUSTER UPGRADE FLOW

Slide 15

Slide 15 text

15 CLUSTER CHANNELS?? github.com/zalando-incubator/kubernetes-on-aws Channel Description Clusters dev Development and playground clusters. 3 alpha Main infrastructure cluster (important to us). 1 beta Product clusters for the rest of the organization (prod/test). 90+

Slide 16

Slide 16 text

16 E2E TESTS ON EVERY PR github.com/zalando-incubator/kubernetes-on-aws

Slide 17

Slide 17 text

17 E2E TESTS Conformance Tests Upstream Kubernetes e2e conformance tests ✓ 144 Zalando Tests (custom) Custom tests for ingress, external-dns, PSP etc. 4 StatefulSet Tests Rolling update of stateful sets including volume mounting ✓ 2 ✓

Slide 18

Slide 18 text

18

Slide 19

Slide 19 text

19 E2E TESTS FOR OUR INFRASTRUCTURE Control plane node node Control plane node node branch: alpha (base) branch: dev (head) Create Cluster Update Cluster Run e2e tests Delete Cluster Testing dev to alpha upgrade Control plane Control plane

Slide 20

Slide 20 text

20 ARCHITECTURE DECISIONS • One prod. cluster per AWS account / “product” • API server behind SSL ELB, OAuth webhook • Read only access to production • CI/CD for write access • etcd running separately on EC2 • Multi AZ clusters

Slide 21

Slide 21 text

21 - TECHNICAL - Backup your etcd (and monitor your backups) - Be careful using t instances in production - Disabling CPU throttling (CFS quota) to avoid latency issues - NON-TECHNICAL - Avoid friction for the users - Try to create a system "compliant by design” - Create a community inside the company LESSONS LEARNED

Slide 22

Slide 22 text

22 OPEN SOURCE

Slide 23

Slide 23 text

23 INGRESS CONTROLLER https://github.com/zalando-incubator/kube-ingress-aws-controller

Slide 24

Slide 24 text

24 POSTGRES OPERATOR github.com/zalando-incubator/postgres-operator Node pg role=master Node pg role=replica Node pg role=replica Node postgres operator Evict ✘ evict pg role=replica promote role=master role=replica drain ✓

Slide 25

Slide 25 text

25 KUBERNETES RESOURCE REPORT github.com/hjacobs/kube-resource-report

Slide 26

Slide 26 text

26 KUBERNETES DOWNSCALER github.com/hjacobs/kube-downscaler Weekend

Slide 27

Slide 27 text

27 OPEN SOURCE Kubernetes on AWS github.com/zalando-incubator/kubernetes-on-aws AWS ALB Ingress controller github.com/zalando-incubator/kube-ingress-aws-controller Skipper HTTP Router & Ingress controller github.com/zalando/skipper External DNS github.com/kubernetes-incubator/external-dns Postgres Operator github.com/zalando-incubator/postgres-operator Kubernetes Resource Report github.com/hjacobs/kube-resource-report Kubernetes Downscaler github.com/hjacobs/kube-downscaler

Slide 28

Slide 28 text

QUESTIONS? JANNIS RAKE-REVELANT ENGINEERING LEAD TEAM TEAPOT (COMPUTE) jannis.rake-revelant@zalando.de @jannis_r