Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Preparing guardrails for Istio at scale

Preparing guardrails for Istio at scale

At Mercari, we started to implement Istio a year ago in our microservices environment with 100+ services. By adding few features one step at a time, we could manage to make it work for only several services.
Few months ago, lifting Istio up to the next step has required a lot of work on processes and guardrails to prevent users from being left in the wild and potentially harming themselves and others while simplifying the service mesh maintenance.

This talk will focus on explaining what we achieved to make our path towards a safe multi-tenant service mesh a reality, more specifically:

- How to migrate the Istio maintenance from plain manifests-based to istioctl-based
- Create continuous pseudo-acceptance tests to validate Istio after changes
- Explain the rules we are using to protect users and Istio using Gatekeeper and GitOps

Raphael Fraysse

June 13, 2020
Tweet

More Decks by Raphael Fraysse

Other Decks in Technology

Transcript

  1. 2 About me @la1nra (Twitter) Raphael Fraysse Github / @lainra

    SRE at Mercari microservices platform team
  2. 3 Today’s agenda • Mercari in few numbers • Refining

    the Istio maintenance experience • Making Istio changes observable • Protecting Istio and its users from themselves
  3. 5 • 150+ microservices (150+ namespaces) • 100K RPS at

    peak on API Gateway • 1 main production Google Kubernetes Engine (GKE) cluster • 300+ developers • 3k+ pods Mercari in few numbers
  4. 7 How to install and configure Istio for production-use? Refining

    the Istio maintenance experience Helm-based 1.0 1.1 1.2 1.3 1.4 1.5 1.6 kubectl apply Istioctl Istio Operator Helm-based Istioctl upgrade
  5. 8 Refining the Istio maintenance experience Microservices Migration Istio Feature

    Adoption (Enabling Istio in namespaces gradually) Our approach to adopt Istio
  6. 9 We started Istio around 1.1 Refining the Istio maintenance

    experience • Only Helm option available ◦ Afraid of Istio unstability + Helm magic ➔ Decided to apply manifests ourselves to have at least a sense of control ◆ Generate manifests from Helm template ◆ Manually review and merge PR then apply all manifests ◆ Only use minimal components (Pilot, Sidecar Injector)
  7. 10 It was very painful... Refining the Istio maintenance experience

    ➔ Terrible review cost ➔ Error-prone ➔ Lead time for minor version upgrade is too long
  8. 11 Had to wait until Istio 1.4 to reconsider it

    Refining the Istio maintenance experience • istioctl graduated ◦ Using CustomResource to declare Istio state (IstioControlPlane) • istioctl experimental upgrade ◦ New tool to streamline changes in Istio state ◦ Still experimental though (in 1.4) • Istio Operator released (experimental) but ◦ State reconciliation is scary ◦ We are used to applying state (Terraform) ➔ We’d better use istioctl for changing/upgrading our installation
  9. 12 • Migrating Helm’s values.yaml ◦ istioctl manifest migrate values.yaml

    > istiocontrolplane.yaml ◦ This is not straightforward! Some values are deprecated/modified, i.e How to convert our process to the IstioControlPlane format? Refining the Istio maintenance experience certManager: enabled: false ingressGateways: - name: istio-ingressgateway enabled: false egressGateways: - name: istio-egressgateway enabled: false • Making sure to disable all components not used • Moving some `values.global` parameters inside k8s for future upward compatibility (helm values deprecation)
  10. 13 How to apply the new configuration? Refining the Istio

    maintenance experience ➜ ./bin/istioctl experimental upgrade -f ./refined-istio-ICP.yaml Client - istioctl version: 1.4.3 Upgrade - target version: 1.4.3 Control Plane - citadel pod - istio-citadel-6f4659b5d8-f7gh6 - version: 1.4.3 Control Plane - pilot pod - istio-pilot-768bc95fbd-26xd7 - version: 1.4.3 Control Plane - pilot pod - istio-pilot-768bc95fbd-6wshp - version: 1.4.3 Control Plane - pilot pod - istio-pilot-768bc95fbd-v6vmh - version: 1.4.3 Control Plane - sidecar-injector pod - istio-sidecar-injector-85c665d7-6xlqw - version: 1.4.3 Control Plane - sidecar-injector pod - istio-sidecar-injector-85c665d7-mzrsd - version: 1.4.3 Control Plane - sidecar-injector pod - istio-sidecar-injector-85c665d7-wmk6f - version: 1.4.3 Upgrade version check passed: 1.4.3 -> 1.4.3. Upgrade check: Warning!!! The following values will be changed as part of upgrade. If you have not overridden these values, they will change in your cluster. Please double check they are correct: Lots of diffs!
  11. 14 Same process for changes or upgrades Refining the Istio

    maintenance experience 1. `--dry-run` the istioctl upgrade command 2. Send diff output to the PR along changes to the configuration 3. Get reviewed, approval and merge 4. Apply the merged configuration using istioctl without `--dry-run` 5. Pray for things to not break unexpectedly
  12. 16 Refining the Istio maintenance experience After moving to the

    new process, our lead time for upgrading versions was shortened by 300% and we gained much more confidence into changing/upgrading Istio more often.
  13. 17 Takeaways Refining the Istio maintenance experience • Helm is

    now deprecated, istioctl is much easier to use (but still some nits) • kubectl apply is too costly to maintain, use istioctl instead • Need to convert Helm configuration to istioctl CRDs (IstioControlPlane, IstioOperator) • Combine versioning with istioctl to get a safer environment
  14. 18 Ok so now we applied our changes! Refining the

    Istio maintenance experience But... wait a minute, how can we know if everything’s going fine?
  15. 20 Understand what happens when Istio changes Making Istio changes

    observable 1. Monitoring Istio ◦ Prometheus / Grafana / Jaeger ◦ Datadog, Lightstep, etc...
  16. 21 Monitoring Istio Making Istio changes observable • Istio native

    stack (Prometheus/Grafana/Jaeger) is rich and default dashboards are helpful ◦ Fits most use cases • However, YMMV, especially when already using an observability solution
  17. 22 Making Istio changes observable Do you want to expose

    new dashboards, new UI to your users? In many cases, integrating with existing solution is the cheapest decision, especially the UX. ➔ Integrated Istio monitoring with our existing Datadog. https://docs.datadoghq.com/integrations/istio/
  18. 24 Understand what happens when Istio changes Making Istio changes

    observable 1. Monitoring Istio ◦ Prometheus / Grafana / Jaeger / Zipkin ◦ Datadog, Lightstep, etc… 2. What do we need to observe?
  19. 25 What do we need to observe? Making Istio changes

    observable • Control plane (istiod, istio-pilot and co) • Data plane (Envoy proxies)
  20. 26 Control plane (istiod, istio-pilot and co) Making Istio changes

    observable • USE (Utilization, Saturation, Errors) ◦ CPU ◦ Memory ◦ Concurrency • RED (Rate, Errors, Duration) ◦ Config pushes count ◦ Config pushes duration ◦ Config pushes errors
  21. 27 Data plane (Envoy proxies) Making Istio changes observable •

    USE (Utilization, Saturation, Errors) ◦ CPU ◦ Memory ◦ Concurrency • RED (Rate, Errors, Duration) ◦ Requests count ◦ Requests error count ◦ Requests duration
  22. 28 It’s easier to observe the control plane Making Istio

    changes observable • One component vs many proxies in the data plane ➔ We can regroup all metrics into one dashboard for the control plane
  23. 29 How to observe the data plane Making Istio changes

    observable • Having an overview of sidecars in the same dashboard as control plane helps ◦ Number of sidecars in the cluster ◦ Average CPU usage per sidecar ◦ Average Memory usage per sidecar ◦ Heat maps to check outliers But it only covers USE. What about RED?
  24. 30 How to observe the data plane Making Istio changes

    observable • Checking all traffic for all proxies is impossible ◦ Most metrics are application-specific ◦ Configuration may change per namespace ◦ Regrouping into one dashboard is hard How can we solve it? Let’s define basic acceptance tests!
  25. 31 Acceptance tests Making Istio changes observable • Need to

    be ◦ Simple ◦ Easy to observe ◦ Covering main scenarios
  26. 32 Making Istio changes observable loadtester gRPC service Proxy Proxy

    loadtester HTTP/1.1 service Proxy Proxy loadtester gRPC service Proxy loadtester HTTP/1.1 service Proxy loadtester gRPC service Proxy loadtester HTTP/1.1 service Proxy loadtester Headless service Proxy Proxy loadtester Headless service Proxy
  27. 33 Acceptance tests Making Istio changes observable • Ideally, based

    on existing services • Or you can replicate a miniature of your main services and test against it ➔ We do a mix of both Generate light traffic continuously against these and get your success rates
  28. 35 Takeaways Making Istio changes observable • Choose the observability

    stack for Istio while keeping cost in mind • Control plane is easier to observe than data plane • Simplify data plane observability with acceptance tests • Keeping a centralized place to observe changes make them easier
  29. 37 Shared service mesh Protecting Istio and its users from

    themselves • The service mesh is common to all users • Any change to it spreads across the whole mesh ◦ Any misconfiguration spread too, be it intentional or not Humans are error-prone, both users and operators are humans so: Errors will happen, with a large blast radius!
  30. 38 Misconfiguration example Protecting Istio and its users from themselves

    Ingress Gateway payment v1 Proxy Proxy payment v1 Proxy 100% Client VirtualService spec: hosts: - payment.payment-prod.svc.cluster.local http: - name: payment route: - destination: host: payment.payment-prod.svc.cluster.local subset: v1
  31. 39 Misconfiguration example Protecting Istio and its users from themselves

    Ingress Gateway payment v1 Proxy Proxy payment v1 Proxy 100% Client VirtualService spec: hosts: - payment.payment-prod.svc.cluster.local http: - name: payment route: - destination: host: payment.payment-prod.svc.cluster.local subset: v2
  32. 40 Misconfiguration example Protecting Istio and its users from themselves

    Ingress Gateway payment v1 Proxy Proxy payment v1 Proxy 100% Client VirtualService spec: hosts: - payment.payment-prod.svc.cluster.local http: - name: payment route: - destination: host: payment.payment-prod.svc.cluster.local subset: v2 payment v2???
  33. 41 Misconfiguration example Protecting Istio and its users from themselves

    Ingress Gateway payment v1 Proxy Proxy payment v1 Proxy 0% Client VirtualService spec: hosts: - payment.payment-prod.svc.cluster.local http: - name: payment route: - destination: host: payment.payment-prod.svc.cluster.local subset: v2 payment v2??? 0%
  34. 42 Oops, we just dropped all our production payment traffic

    !!! Protecting Istio and its users from themselves Let’s calm down, it didn’t happen to you yet (hopefully) We will explain how to prevent it from happening in the next part.
  35. 43 How can we mitigate errors and their impact? Protecting

    Istio and its users from themselves 1. Understand how the resources are changed 2. Create appropriate safeguards
  36. 53 Linters (conftest, stein) Protecting Istio and its users from

    themselves • Early feedback in the PR if the user misconfigured/missed something • Checks the configuration with validation logic
  37. 54 Admission Controllers (gatekeeper) Protecting Istio and its users from

    themselves • Late feedback after configuration merge (apply), need new PR to fix • Checks the configuration with validation logic + reconcile with existing config
  38. 55 Open Policy Agent (OPA) Protecting Istio and its users

    from themselves - Common language: Rego - Linters: conftest - Admission Controllers: gatekeeper Rego policies can be used by both. Gatekeeper can cache Kubernetes resources (inventory) to compare input with existing resources ➔ This is an important point for Istio policies https://github.com/open-policy-agent/opa#-open-policy-agent
  39. 56 Rules Protecting Istio and its users from themselves 1.

    Restrict Istio VirtualServices and DestinationRules to use local namespace hostnames
  40. 57 Rule 1: Restrict Istio VirtualServices and DestinationRules to use

    local namespace hostnames Protecting Istio and its users from themselves Enforce ownership of defined paths under one namespace and prevents path hijacking namespace: namespace-a host: service-a.namespace-a.svc.cluster.local namespace: namespace-a host: service-a.namespace-b.svc.cluster.local
  41. 58 Rule 1: Restrict Istio VirtualServices and DestinationRules to use

    local namespace hostnames Protecting Istio and its users from themselves Enforce ownership of defined paths under one namespace and prevents path hijacking namespace: namespace-a host: service-a.namespace-a.svc.cluster.local namespace: namespace-a host: service-a.namespace-b.svc.cluster.local We enforce FQDN notation with a combined rule to support this.
  42. 59 Rules Protecting Istio and its users from themselves 1.

    Restrict Istio VirtualServices and DestinationRules to use local namespace hostnames 2. Restrict in-use DestinationRule subsets modification
  43. 60 Rule 2: Restrict in-use DestinationRule subsets modification Protecting Istio

    and its users from themselves Prevents traffic disruption caused by subset modification when used by existing VirtualServices DestinationRule spec: host: test.default.svc.cluster.local subsets: - name: v1 labels: version: v1 VirtualService spec: hosts: - test.default.svc.cluster.local http: - name: default route: - destination: host: test.default.svc.cluster.local subset: v1
  44. 61 Rule 2: Restrict in-use DestinationRule subsets modification Protecting Istio

    and its users from themselves Prevents traffic disruption caused by subset modification when used by existing VirtualServices DestinationRule spec: host: test.default.svc.cluster.local subsets: - name: v1 -> v2 labels: version: v1 -> v2 VirtualService spec: hosts: - test.default.svc.cluster.local http: - name: default route: - destination: host: test.default.svc.cluster.local subset: v1
  45. 62 Rules Protecting Istio and its users from themselves 1.

    Restrict Istio VirtualServices and DestinationRules to use local namespace hostnames 2. Restrict in-use DestinationRule subsets modification 3. Restrict configurable fields in VirtualServices and DestinationRules
  46. 63 Rule 3: Restrict configurable fields in VirtualServices and DestinationRules

    Protecting Istio and its users from themselves Prevents unsupported fields to be modified in VirtualServices and DestinationRules DestinationRule spec: host: test.default.svc.cluster.local - name: v1 labels: version: v1 Whitelist - spec.host - spec.host.name - spec.host.labels - spec.host.labels.version
  47. 64 Rule 3: Restrict configurable fields in VirtualServices and DestinationRules

    Protecting Istio and its users from themselves Prevents unsupported fields to be used in VirtualServices and DestinationRules DestinationRule spec: host: test.default.svc.cluster.local - name: v1 labels: version: v1 exportTo: [“*”] Whitelist - spec.host - spec.host.name - spec.host.labels - spec.host.labels.version
  48. 65 Rules Protecting Istio and its users from themselves 1.

    Restrict Istio VirtualServices and DestinationRules to use local namespace hostnames 2. Restrict in-use DestinationRule subsets modification 3. Restrict configurable fields in VirtualServices and DestinationRules 4. Restrict VirtualServices from using non-existing subsets
  49. 66 Rule 4: Restrict VirtualServices from using non-existing subsets Protecting

    Istio and its users from themselves Prevents traffic disruption caused by sinking traffic into non-existing subset DestinationRule spec: host: test.default.svc.cluster.local subsets: - name: v1 labels: version: v1 VirtualService spec: hosts: - test.default.svc.cluster.local http: - name: default route: - destination: host: test.default.svc.cluster.local subset: v1
  50. 67 Rule 4: Restrict VirtualServices from using non-existing subsets Protecting

    Istio and its users from themselves Prevents traffic disruption caused by sinking traffic into a non-existing subset DestinationRule spec: host: test.default.svc.cluster.local subsets: - name: v1 labels: version: v1 VirtualService spec: hosts: - test.default.svc.cluster.local http: - name: default route: - destination: host: test.default.svc.cluster.local subset: v2
  51. 68 Rules Protecting Istio and its users from themselves 1.

    Restrict Istio VirtualServices and DestinationRules to use local namespace hostnames 2. Restrict in-use DestinationRule subsets modification 3. Restrict configurable fields in VirtualServices and DestinationRules 4. Restrict VirtualServices from using non-existing subsets 5. Restrict duplicate hosts in VirtualServices and DestinationRules
  52. 69 Rule 5: Restrict duplicate hosts in VirtualServices and DestinationRules

    Protecting Istio and its users from themselves Prevents unexpected/unpredictable traffic hijacking caused by having multiple VirtualServices and DestinationRules for the same host VirtualService spec: hosts: - test.default.svc.cluster.local http: - name: default route: - destination: host: test.default.svc.cluster.local subset: v1 VirtualService spec: hosts: - test-v2.default.svc.cluster.local http: - name: default route: - destination: host: test.default.svc.cluster.local subset: v1
  53. 70 Rule 5: Restrict duplicate hosts in VirtualServices and DestinationRules

    Protecting Istio and its users from themselves Prevents unexpected traffic hijacking caused by multiple VirtualServices and DestinationRules having the same host. VirtualService spec: hosts: - test.default.svc.cluster.local http: - name: default route: - destination: host: test.default.svc.cluster.local subset: v1 VirtualService spec: hosts: - test.default.svc.cluster.local http: - name: default route: - destination: host: test.default.svc.cluster.local subset: v1
  54. 71 Takeaways Protecting Istio and its users from themselves •

    Make all Istio changes through versioning, with PR process • Enforce the use of Continuous Integration tools • Leverage linters to catch issues at CI-level, keeping the feedback loop short • Leverage admission webhooks to ◦ protect the resources ◦ check what cannot be checked at linter-level (inventory)
  55. 73 Takeaways Summary • Guardrails are for both operators and

    users • Operators ◦ Maintenance processes ◦ Observability ◦ Acceptance testing • Users ◦ Limit the exposure of resources ◦ Make safe policies at both CI and cluster level ◦ Keep a good user experience to retain your users with short feedback loops
  56. 76 How can we mitigate errors and their impact? Protecting

    Istio and its users from themselves • Make all changes through versioning, with PR ◦ Alone, it is not enough since reviewers are humans • Enforce the use of Continuous Integration tools ◦ Changes can only be made there • Leverage linters to catch issues at CI-level • Leverage admission webhooks to protect the resources
  57. 77 Misconfiguration example Protecting Istio and its users from themselves

    Ingress Gateway payment v1 Proxy Proxy payment v1 Proxy payment v2 Proxy 100% Client