Slide 1

Slide 1 text

Preparing guardrails for Istio at scale

Slide 2

Slide 2 text

2 About me @la1nra (Twitter) Raphael Fraysse Github / @lainra SRE at Mercari microservices platform team

Slide 3

Slide 3 text

3 Today’s agenda ● Mercari in few numbers ● Refining the Istio maintenance experience ● Making Istio changes observable ● Protecting Istio and its users from themselves

Slide 4

Slide 4 text

Mercari in few numbers

Slide 5

Slide 5 text

5 ● 150+ microservices (150+ namespaces) ● 100K RPS at peak on API Gateway ● 1 main production Google Kubernetes Engine (GKE) cluster ● 300+ developers ● 3k+ pods Mercari in few numbers

Slide 6

Slide 6 text

Refining the Istio maintenance experience

Slide 7

Slide 7 text

7 How to install and configure Istio for production-use? Refining the Istio maintenance experience Helm-based 1.0 1.1 1.2 1.3 1.4 1.5 1.6 kubectl apply Istioctl Istio Operator Helm-based Istioctl upgrade

Slide 8

Slide 8 text

8 Refining the Istio maintenance experience Microservices Migration Istio Feature Adoption (Enabling Istio in namespaces gradually) Our approach to adopt Istio

Slide 9

Slide 9 text

9 We started Istio around 1.1 Refining the Istio maintenance experience ● Only Helm option available ○ Afraid of Istio unstability + Helm magic ➔ Decided to apply manifests ourselves to have at least a sense of control ◆ Generate manifests from Helm template ◆ Manually review and merge PR then apply all manifests ◆ Only use minimal components (Pilot, Sidecar Injector)

Slide 10

Slide 10 text

10 It was very painful... Refining the Istio maintenance experience ➔ Terrible review cost ➔ Error-prone ➔ Lead time for minor version upgrade is too long

Slide 11

Slide 11 text

11 Had to wait until Istio 1.4 to reconsider it Refining the Istio maintenance experience ● istioctl graduated ○ Using CustomResource to declare Istio state (IstioControlPlane) ● istioctl experimental upgrade ○ New tool to streamline changes in Istio state ○ Still experimental though (in 1.4) ● Istio Operator released (experimental) but ○ State reconciliation is scary ○ We are used to applying state (Terraform) ➔ We’d better use istioctl for changing/upgrading our installation

Slide 12

Slide 12 text

12 ● Migrating Helm’s values.yaml ○ istioctl manifest migrate values.yaml > istiocontrolplane.yaml ○ This is not straightforward! Some values are deprecated/modified, i.e How to convert our process to the IstioControlPlane format? Refining the Istio maintenance experience certManager: enabled: false ingressGateways: - name: istio-ingressgateway enabled: false egressGateways: - name: istio-egressgateway enabled: false ● Making sure to disable all components not used ● Moving some `values.global` parameters inside k8s for future upward compatibility (helm values deprecation)

Slide 13

Slide 13 text

13 How to apply the new configuration? Refining the Istio maintenance experience ➜ ./bin/istioctl experimental upgrade -f ./refined-istio-ICP.yaml Client - istioctl version: 1.4.3 Upgrade - target version: 1.4.3 Control Plane - citadel pod - istio-citadel-6f4659b5d8-f7gh6 - version: 1.4.3 Control Plane - pilot pod - istio-pilot-768bc95fbd-26xd7 - version: 1.4.3 Control Plane - pilot pod - istio-pilot-768bc95fbd-6wshp - version: 1.4.3 Control Plane - pilot pod - istio-pilot-768bc95fbd-v6vmh - version: 1.4.3 Control Plane - sidecar-injector pod - istio-sidecar-injector-85c665d7-6xlqw - version: 1.4.3 Control Plane - sidecar-injector pod - istio-sidecar-injector-85c665d7-mzrsd - version: 1.4.3 Control Plane - sidecar-injector pod - istio-sidecar-injector-85c665d7-wmk6f - version: 1.4.3 Upgrade version check passed: 1.4.3 -> 1.4.3. Upgrade check: Warning!!! The following values will be changed as part of upgrade. If you have not overridden these values, they will change in your cluster. Please double check they are correct: Lots of diffs!

Slide 14

Slide 14 text

14 Same process for changes or upgrades Refining the Istio maintenance experience 1. `--dry-run` the istioctl upgrade command 2. Send diff output to the PR along changes to the configuration 3. Get reviewed, approval and merge 4. Apply the merged configuration using istioctl without `--dry-run` 5. Pray for things to not break unexpectedly

Slide 15

Slide 15 text

15 Refining the Istio maintenance experience Sample PR body

Slide 16

Slide 16 text

16 Refining the Istio maintenance experience After moving to the new process, our lead time for upgrading versions was shortened by 300% and we gained much more confidence into changing/upgrading Istio more often.

Slide 17

Slide 17 text

17 Takeaways Refining the Istio maintenance experience ● Helm is now deprecated, istioctl is much easier to use (but still some nits) ● kubectl apply is too costly to maintain, use istioctl instead ● Need to convert Helm configuration to istioctl CRDs (IstioControlPlane, IstioOperator) ● Combine versioning with istioctl to get a safer environment

Slide 18

Slide 18 text

18 Ok so now we applied our changes! Refining the Istio maintenance experience But... wait a minute, how can we know if everything’s going fine?

Slide 19

Slide 19 text

Making Istio changes observable

Slide 20

Slide 20 text

20 Understand what happens when Istio changes Making Istio changes observable 1. Monitoring Istio ○ Prometheus / Grafana / Jaeger ○ Datadog, Lightstep, etc...

Slide 21

Slide 21 text

21 Monitoring Istio Making Istio changes observable ● Istio native stack (Prometheus/Grafana/Jaeger) is rich and default dashboards are helpful ○ Fits most use cases ● However, YMMV, especially when already using an observability solution

Slide 22

Slide 22 text

22 Making Istio changes observable Do you want to expose new dashboards, new UI to your users? In many cases, integrating with existing solution is the cheapest decision, especially the UX. ➔ Integrated Istio monitoring with our existing Datadog. https://docs.datadoghq.com/integrations/istio/

Slide 23

Slide 23 text

23 Making Istio changes observable

Slide 24

Slide 24 text

24 Understand what happens when Istio changes Making Istio changes observable 1. Monitoring Istio ○ Prometheus / Grafana / Jaeger / Zipkin ○ Datadog, Lightstep, etc… 2. What do we need to observe?

Slide 25

Slide 25 text

25 What do we need to observe? Making Istio changes observable ● Control plane (istiod, istio-pilot and co) ● Data plane (Envoy proxies)

Slide 26

Slide 26 text

26 Control plane (istiod, istio-pilot and co) Making Istio changes observable ● USE (Utilization, Saturation, Errors) ○ CPU ○ Memory ○ Concurrency ● RED (Rate, Errors, Duration) ○ Config pushes count ○ Config pushes duration ○ Config pushes errors

Slide 27

Slide 27 text

27 Data plane (Envoy proxies) Making Istio changes observable ● USE (Utilization, Saturation, Errors) ○ CPU ○ Memory ○ Concurrency ● RED (Rate, Errors, Duration) ○ Requests count ○ Requests error count ○ Requests duration

Slide 28

Slide 28 text

28 It’s easier to observe the control plane Making Istio changes observable ● One component vs many proxies in the data plane ➔ We can regroup all metrics into one dashboard for the control plane

Slide 29

Slide 29 text

29 How to observe the data plane Making Istio changes observable ● Having an overview of sidecars in the same dashboard as control plane helps ○ Number of sidecars in the cluster ○ Average CPU usage per sidecar ○ Average Memory usage per sidecar ○ Heat maps to check outliers But it only covers USE. What about RED?

Slide 30

Slide 30 text

30 How to observe the data plane Making Istio changes observable ● Checking all traffic for all proxies is impossible ○ Most metrics are application-specific ○ Configuration may change per namespace ○ Regrouping into one dashboard is hard How can we solve it? Let’s define basic acceptance tests!

Slide 31

Slide 31 text

31 Acceptance tests Making Istio changes observable ● Need to be ○ Simple ○ Easy to observe ○ Covering main scenarios

Slide 32

Slide 32 text

32 Making Istio changes observable loadtester gRPC service Proxy Proxy loadtester HTTP/1.1 service Proxy Proxy loadtester gRPC service Proxy loadtester HTTP/1.1 service Proxy loadtester gRPC service Proxy loadtester HTTP/1.1 service Proxy loadtester Headless service Proxy Proxy loadtester Headless service Proxy

Slide 33

Slide 33 text

33 Acceptance tests Making Istio changes observable ● Ideally, based on existing services ● Or you can replicate a miniature of your main services and test against it ➔ We do a mix of both Generate light traffic continuously against these and get your success rates

Slide 34

Slide 34 text

34 All green! Making Istio changes observable

Slide 35

Slide 35 text

35 Takeaways Making Istio changes observable ● Choose the observability stack for Istio while keeping cost in mind ● Control plane is easier to observe than data plane ● Simplify data plane observability with acceptance tests ● Keeping a centralized place to observe changes make them easier

Slide 36

Slide 36 text

Protecting Istio and its users from themselves

Slide 37

Slide 37 text

37 Shared service mesh Protecting Istio and its users from themselves ● The service mesh is common to all users ● Any change to it spreads across the whole mesh ○ Any misconfiguration spread too, be it intentional or not Humans are error-prone, both users and operators are humans so: Errors will happen, with a large blast radius!

Slide 38

Slide 38 text

38 Misconfiguration example Protecting Istio and its users from themselves Ingress Gateway payment v1 Proxy Proxy payment v1 Proxy 100% Client VirtualService spec: hosts: - payment.payment-prod.svc.cluster.local http: - name: payment route: - destination: host: payment.payment-prod.svc.cluster.local subset: v1

Slide 39

Slide 39 text

39 Misconfiguration example Protecting Istio and its users from themselves Ingress Gateway payment v1 Proxy Proxy payment v1 Proxy 100% Client VirtualService spec: hosts: - payment.payment-prod.svc.cluster.local http: - name: payment route: - destination: host: payment.payment-prod.svc.cluster.local subset: v2

Slide 40

Slide 40 text

40 Misconfiguration example Protecting Istio and its users from themselves Ingress Gateway payment v1 Proxy Proxy payment v1 Proxy 100% Client VirtualService spec: hosts: - payment.payment-prod.svc.cluster.local http: - name: payment route: - destination: host: payment.payment-prod.svc.cluster.local subset: v2 payment v2???

Slide 41

Slide 41 text

41 Misconfiguration example Protecting Istio and its users from themselves Ingress Gateway payment v1 Proxy Proxy payment v1 Proxy 0% Client VirtualService spec: hosts: - payment.payment-prod.svc.cluster.local http: - name: payment route: - destination: host: payment.payment-prod.svc.cluster.local subset: v2 payment v2??? 0%

Slide 42

Slide 42 text

42 Oops, we just dropped all our production payment traffic !!! Protecting Istio and its users from themselves Let’s calm down, it didn’t happen to you yet (hopefully) We will explain how to prevent it from happening in the next part.

Slide 43

Slide 43 text

43 How can we mitigate errors and their impact? Protecting Istio and its users from themselves 1. Understand how the resources are changed 2. Create appropriate safeguards

Slide 44

Slide 44 text

44 Protecting Istio and its users from themselves

Slide 45

Slide 45 text

45 Protecting Istio and its users from themselves

Slide 46

Slide 46 text

46 Protecting Istio and its users from themselves

Slide 47

Slide 47 text

47 Protecting Istio and its users from themselves

Slide 48

Slide 48 text

48 Protecting Istio and its users from themselves Linters

Slide 49

Slide 49 text

49 Protecting Istio and its users from themselves Linters Admission Controllers

Slide 50

Slide 50 text

50 Protecting Istio and its users from themselves Linters Admission Controllers Feedback loop

Slide 51

Slide 51 text

51 Protecting Istio and its users from themselves Linters Admission Controllers Feedback loop Best

Slide 52

Slide 52 text

52 Protecting Istio and its users from themselves Linters Admission Controllers Feedback loop Best Worst

Slide 53

Slide 53 text

53 Linters (conftest, stein) Protecting Istio and its users from themselves ● Early feedback in the PR if the user misconfigured/missed something ● Checks the configuration with validation logic

Slide 54

Slide 54 text

54 Admission Controllers (gatekeeper) Protecting Istio and its users from themselves ● Late feedback after configuration merge (apply), need new PR to fix ● Checks the configuration with validation logic + reconcile with existing config

Slide 55

Slide 55 text

55 Open Policy Agent (OPA) Protecting Istio and its users from themselves - Common language: Rego - Linters: conftest - Admission Controllers: gatekeeper Rego policies can be used by both. Gatekeeper can cache Kubernetes resources (inventory) to compare input with existing resources ➔ This is an important point for Istio policies https://github.com/open-policy-agent/opa#-open-policy-agent

Slide 56

Slide 56 text

56 Rules Protecting Istio and its users from themselves 1. Restrict Istio VirtualServices and DestinationRules to use local namespace hostnames

Slide 57

Slide 57 text

57 Rule 1: Restrict Istio VirtualServices and DestinationRules to use local namespace hostnames Protecting Istio and its users from themselves Enforce ownership of defined paths under one namespace and prevents path hijacking namespace: namespace-a host: service-a.namespace-a.svc.cluster.local namespace: namespace-a host: service-a.namespace-b.svc.cluster.local

Slide 58

Slide 58 text

58 Rule 1: Restrict Istio VirtualServices and DestinationRules to use local namespace hostnames Protecting Istio and its users from themselves Enforce ownership of defined paths under one namespace and prevents path hijacking namespace: namespace-a host: service-a.namespace-a.svc.cluster.local namespace: namespace-a host: service-a.namespace-b.svc.cluster.local We enforce FQDN notation with a combined rule to support this.

Slide 59

Slide 59 text

59 Rules Protecting Istio and its users from themselves 1. Restrict Istio VirtualServices and DestinationRules to use local namespace hostnames 2. Restrict in-use DestinationRule subsets modification

Slide 60

Slide 60 text

60 Rule 2: Restrict in-use DestinationRule subsets modification Protecting Istio and its users from themselves Prevents traffic disruption caused by subset modification when used by existing VirtualServices DestinationRule spec: host: test.default.svc.cluster.local subsets: - name: v1 labels: version: v1 VirtualService spec: hosts: - test.default.svc.cluster.local http: - name: default route: - destination: host: test.default.svc.cluster.local subset: v1

Slide 61

Slide 61 text

61 Rule 2: Restrict in-use DestinationRule subsets modification Protecting Istio and its users from themselves Prevents traffic disruption caused by subset modification when used by existing VirtualServices DestinationRule spec: host: test.default.svc.cluster.local subsets: - name: v1 -> v2 labels: version: v1 -> v2 VirtualService spec: hosts: - test.default.svc.cluster.local http: - name: default route: - destination: host: test.default.svc.cluster.local subset: v1

Slide 62

Slide 62 text

62 Rules Protecting Istio and its users from themselves 1. Restrict Istio VirtualServices and DestinationRules to use local namespace hostnames 2. Restrict in-use DestinationRule subsets modification 3. Restrict configurable fields in VirtualServices and DestinationRules

Slide 63

Slide 63 text

63 Rule 3: Restrict configurable fields in VirtualServices and DestinationRules Protecting Istio and its users from themselves Prevents unsupported fields to be modified in VirtualServices and DestinationRules DestinationRule spec: host: test.default.svc.cluster.local - name: v1 labels: version: v1 Whitelist - spec.host - spec.host.name - spec.host.labels - spec.host.labels.version

Slide 64

Slide 64 text

64 Rule 3: Restrict configurable fields in VirtualServices and DestinationRules Protecting Istio and its users from themselves Prevents unsupported fields to be used in VirtualServices and DestinationRules DestinationRule spec: host: test.default.svc.cluster.local - name: v1 labels: version: v1 exportTo: [“*”] Whitelist - spec.host - spec.host.name - spec.host.labels - spec.host.labels.version

Slide 65

Slide 65 text

65 Rules Protecting Istio and its users from themselves 1. Restrict Istio VirtualServices and DestinationRules to use local namespace hostnames 2. Restrict in-use DestinationRule subsets modification 3. Restrict configurable fields in VirtualServices and DestinationRules 4. Restrict VirtualServices from using non-existing subsets

Slide 66

Slide 66 text

66 Rule 4: Restrict VirtualServices from using non-existing subsets Protecting Istio and its users from themselves Prevents traffic disruption caused by sinking traffic into non-existing subset DestinationRule spec: host: test.default.svc.cluster.local subsets: - name: v1 labels: version: v1 VirtualService spec: hosts: - test.default.svc.cluster.local http: - name: default route: - destination: host: test.default.svc.cluster.local subset: v1

Slide 67

Slide 67 text

67 Rule 4: Restrict VirtualServices from using non-existing subsets Protecting Istio and its users from themselves Prevents traffic disruption caused by sinking traffic into a non-existing subset DestinationRule spec: host: test.default.svc.cluster.local subsets: - name: v1 labels: version: v1 VirtualService spec: hosts: - test.default.svc.cluster.local http: - name: default route: - destination: host: test.default.svc.cluster.local subset: v2

Slide 68

Slide 68 text

68 Rules Protecting Istio and its users from themselves 1. Restrict Istio VirtualServices and DestinationRules to use local namespace hostnames 2. Restrict in-use DestinationRule subsets modification 3. Restrict configurable fields in VirtualServices and DestinationRules 4. Restrict VirtualServices from using non-existing subsets 5. Restrict duplicate hosts in VirtualServices and DestinationRules

Slide 69

Slide 69 text

69 Rule 5: Restrict duplicate hosts in VirtualServices and DestinationRules Protecting Istio and its users from themselves Prevents unexpected/unpredictable traffic hijacking caused by having multiple VirtualServices and DestinationRules for the same host VirtualService spec: hosts: - test.default.svc.cluster.local http: - name: default route: - destination: host: test.default.svc.cluster.local subset: v1 VirtualService spec: hosts: - test-v2.default.svc.cluster.local http: - name: default route: - destination: host: test.default.svc.cluster.local subset: v1

Slide 70

Slide 70 text

70 Rule 5: Restrict duplicate hosts in VirtualServices and DestinationRules Protecting Istio and its users from themselves Prevents unexpected traffic hijacking caused by multiple VirtualServices and DestinationRules having the same host. VirtualService spec: hosts: - test.default.svc.cluster.local http: - name: default route: - destination: host: test.default.svc.cluster.local subset: v1 VirtualService spec: hosts: - test.default.svc.cluster.local http: - name: default route: - destination: host: test.default.svc.cluster.local subset: v1

Slide 71

Slide 71 text

71 Takeaways Protecting Istio and its users from themselves ● Make all Istio changes through versioning, with PR process ● Enforce the use of Continuous Integration tools ● Leverage linters to catch issues at CI-level, keeping the feedback loop short ● Leverage admission webhooks to ○ protect the resources ○ check what cannot be checked at linter-level (inventory)

Slide 72

Slide 72 text

Summary

Slide 73

Slide 73 text

73 Takeaways Summary ● Guardrails are for both operators and users ● Operators ○ Maintenance processes ○ Observability ○ Acceptance testing ● Users ○ Limit the exposure of resources ○ Make safe policies at both CI and cluster level ○ Keep a good user experience to retain your users with short feedback loops

Slide 74

Slide 74 text

Thank you for coming!

Slide 75

Slide 75 text

75 Protecting Istio and its users from themselves Linters Admission Controllers Feedback loop Best Worst

Slide 76

Slide 76 text

76 How can we mitigate errors and their impact? Protecting Istio and its users from themselves ● Make all changes through versioning, with PR ○ Alone, it is not enough since reviewers are humans ● Enforce the use of Continuous Integration tools ○ Changes can only be made there ● Leverage linters to catch issues at CI-level ● Leverage admission webhooks to protect the resources

Slide 77

Slide 77 text

77 Misconfiguration example Protecting Istio and its users from themselves Ingress Gateway payment v1 Proxy Proxy payment v1 Proxy payment v2 Proxy 100% Client