Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Preparing guardrails for Istio at scale

Preparing guardrails for Istio at scale

At Mercari, we started to implement Istio a year ago in our microservices environment with 100+ services. By adding few features one step at a time, we could manage to make it work for only several services.
Few months ago, lifting Istio up to the next step has required a lot of work on processes and guardrails to prevent users from being left in the wild and potentially harming themselves and others while simplifying the service mesh maintenance.

This talk will focus on explaining what we achieved to make our path towards a safe multi-tenant service mesh a reality, more specifically:

- How to migrate the Istio maintenance from plain manifests-based to istioctl-based
- Create continuous pseudo-acceptance tests to validate Istio after changes
- Explain the rules we are using to protect users and Istio using Gatekeeper and GitOps

Raphael Fraysse

June 13, 2020
Tweet

More Decks by Raphael Fraysse

Other Decks in Technology

Transcript

  1. Preparing guardrails for Istio at scale

    View full-size slide

  2. 2
    About me
    @la1nra (Twitter)
    Raphael Fraysse
    Github / @lainra
    SRE at Mercari microservices
    platform team

    View full-size slide

  3. 3
    Today’s agenda
    ● Mercari in few numbers
    ● Refining the Istio maintenance experience
    ● Making Istio changes observable
    ● Protecting Istio and its users from themselves

    View full-size slide

  4. Mercari in few numbers

    View full-size slide

  5. 5
    ● 150+ microservices (150+ namespaces)
    ● 100K RPS at peak on API Gateway
    ● 1 main production Google Kubernetes Engine (GKE) cluster
    ● 300+ developers
    ● 3k+ pods
    Mercari in few numbers

    View full-size slide

  6. Refining the Istio maintenance experience

    View full-size slide

  7. 7
    How to install and configure Istio for production-use?
    Refining the Istio maintenance experience
    Helm-based
    1.0 1.1 1.2 1.3 1.4 1.5 1.6
    kubectl apply
    Istioctl
    Istio
    Operator
    Helm-based
    Istioctl upgrade

    View full-size slide

  8. 8
    Refining the Istio maintenance experience
    Microservices Migration
    Istio Feature
    Adoption
    (Enabling Istio in namespaces gradually)
    Our approach to adopt Istio

    View full-size slide

  9. 9
    We started Istio around 1.1
    Refining the Istio maintenance experience
    ● Only Helm option available
    ○ Afraid of Istio unstability + Helm magic
    ➔ Decided to apply manifests ourselves to have at least a sense of control
    ◆ Generate manifests from Helm template
    ◆ Manually review and merge PR then apply all manifests
    ◆ Only use minimal components (Pilot, Sidecar Injector)

    View full-size slide

  10. 10
    It was very painful...
    Refining the Istio maintenance experience
    ➔ Terrible review cost
    ➔ Error-prone
    ➔ Lead time for minor version upgrade is too long

    View full-size slide

  11. 11
    Had to wait until Istio 1.4 to reconsider it
    Refining the Istio maintenance experience
    ● istioctl graduated
    ○ Using CustomResource to declare Istio state (IstioControlPlane)
    ● istioctl experimental upgrade
    ○ New tool to streamline changes in Istio state
    ○ Still experimental though (in 1.4)
    ● Istio Operator released (experimental) but
    ○ State reconciliation is scary
    ○ We are used to applying state (Terraform)
    ➔ We’d better use istioctl for changing/upgrading our installation

    View full-size slide

  12. 12
    ● Migrating Helm’s values.yaml
    ○ istioctl manifest migrate values.yaml > istiocontrolplane.yaml
    ○ This is not straightforward! Some values are deprecated/modified, i.e
    How to convert our process to the IstioControlPlane format?
    Refining the Istio maintenance experience
    certManager:
    enabled: false
    ingressGateways:
    - name: istio-ingressgateway
    enabled: false
    egressGateways:
    - name: istio-egressgateway
    enabled: false
    ● Making sure to disable all components not used
    ● Moving some `values.global` parameters inside k8s for future upward compatibility
    (helm values deprecation)

    View full-size slide

  13. 13
    How to apply the new configuration?
    Refining the Istio maintenance experience
    ➜ ./bin/istioctl experimental upgrade -f ./refined-istio-ICP.yaml
    Client - istioctl version: 1.4.3
    Upgrade - target version: 1.4.3
    Control Plane - citadel pod - istio-citadel-6f4659b5d8-f7gh6 - version: 1.4.3
    Control Plane - pilot pod - istio-pilot-768bc95fbd-26xd7 - version: 1.4.3
    Control Plane - pilot pod - istio-pilot-768bc95fbd-6wshp - version: 1.4.3
    Control Plane - pilot pod - istio-pilot-768bc95fbd-v6vmh - version: 1.4.3
    Control Plane - sidecar-injector pod - istio-sidecar-injector-85c665d7-6xlqw - version: 1.4.3
    Control Plane - sidecar-injector pod - istio-sidecar-injector-85c665d7-mzrsd - version: 1.4.3
    Control Plane - sidecar-injector pod - istio-sidecar-injector-85c665d7-wmk6f - version: 1.4.3
    Upgrade version check passed: 1.4.3 -> 1.4.3.
    Upgrade check: Warning!!! The following values will be changed as part of upgrade. If you have not overridden these values, they will
    change in your cluster. Please double check they are correct:
    Lots of diffs!

    View full-size slide

  14. 14
    Same process for changes or upgrades
    Refining the Istio maintenance experience
    1. `--dry-run` the istioctl upgrade command
    2. Send diff output to the PR along changes to the configuration
    3. Get reviewed, approval and merge
    4. Apply the merged configuration using istioctl without `--dry-run`
    5. Pray for things to not break unexpectedly

    View full-size slide

  15. 15
    Refining the Istio maintenance experience
    Sample
    PR body

    View full-size slide

  16. 16
    Refining the Istio maintenance experience
    After moving to the new process, our lead time for upgrading versions
    was shortened by 300% and we gained much more confidence into
    changing/upgrading Istio more often.

    View full-size slide

  17. 17
    Takeaways
    Refining the Istio maintenance experience
    ● Helm is now deprecated, istioctl is much easier to use (but still some nits)
    ● kubectl apply is too costly to maintain, use istioctl instead
    ● Need to convert Helm configuration to istioctl CRDs (IstioControlPlane,
    IstioOperator)
    ● Combine versioning with istioctl to get a safer environment

    View full-size slide

  18. 18
    Ok so now we applied our changes!
    Refining the Istio maintenance experience
    But... wait a minute, how can we know if everything’s going fine?

    View full-size slide

  19. Making Istio changes observable

    View full-size slide

  20. 20
    Understand what happens when Istio changes
    Making Istio changes observable
    1. Monitoring Istio
    ○ Prometheus / Grafana / Jaeger
    ○ Datadog, Lightstep, etc...

    View full-size slide

  21. 21
    Monitoring Istio
    Making Istio changes observable
    ● Istio native stack (Prometheus/Grafana/Jaeger) is rich and default
    dashboards are helpful
    ○ Fits most use cases
    ● However, YMMV, especially when already using an observability solution

    View full-size slide

  22. 22
    Making Istio changes observable
    Do you want to expose new dashboards, new UI to your users?
    In many cases, integrating with existing solution is the cheapest decision,
    especially the UX.
    ➔ Integrated Istio monitoring with our existing Datadog.
    https://docs.datadoghq.com/integrations/istio/

    View full-size slide

  23. 23
    Making Istio changes observable

    View full-size slide

  24. 24
    Understand what happens when Istio changes
    Making Istio changes observable
    1. Monitoring Istio
    ○ Prometheus / Grafana / Jaeger / Zipkin
    ○ Datadog, Lightstep, etc…
    2. What do we need to observe?

    View full-size slide

  25. 25
    What do we need to observe?
    Making Istio changes observable
    ● Control plane (istiod, istio-pilot and co)
    ● Data plane (Envoy proxies)

    View full-size slide

  26. 26
    Control plane (istiod, istio-pilot and co)
    Making Istio changes observable
    ● USE (Utilization, Saturation, Errors)
    ○ CPU
    ○ Memory
    ○ Concurrency
    ● RED (Rate, Errors, Duration)
    ○ Config pushes count
    ○ Config pushes duration
    ○ Config pushes errors

    View full-size slide

  27. 27
    Data plane (Envoy proxies)
    Making Istio changes observable
    ● USE (Utilization, Saturation, Errors)
    ○ CPU
    ○ Memory
    ○ Concurrency
    ● RED (Rate, Errors, Duration)
    ○ Requests count
    ○ Requests error count
    ○ Requests duration

    View full-size slide

  28. 28
    It’s easier to observe the control plane
    Making Istio changes observable
    ● One component vs many proxies in the data plane
    ➔ We can regroup all metrics into one dashboard for the control plane

    View full-size slide

  29. 29
    How to observe the data plane
    Making Istio changes observable
    ● Having an overview of sidecars in the same dashboard as control plane
    helps
    ○ Number of sidecars in the cluster
    ○ Average CPU usage per sidecar
    ○ Average Memory usage per sidecar
    ○ Heat maps to check outliers
    But it only covers USE.
    What about RED?

    View full-size slide

  30. 30
    How to observe the data plane
    Making Istio changes observable
    ● Checking all traffic for all proxies is impossible
    ○ Most metrics are application-specific
    ○ Configuration may change per namespace
    ○ Regrouping into one dashboard is hard
    How can we solve it?
    Let’s define basic acceptance tests!

    View full-size slide

  31. 31
    Acceptance tests
    Making Istio changes observable
    ● Need to be
    ○ Simple
    ○ Easy to observe
    ○ Covering main scenarios

    View full-size slide

  32. 32
    Making Istio changes observable
    loadtester
    gRPC
    service
    Proxy Proxy
    loadtester
    HTTP/1.1
    service
    Proxy Proxy
    loadtester
    gRPC
    service
    Proxy
    loadtester
    HTTP/1.1
    service
    Proxy
    loadtester
    gRPC
    service
    Proxy
    loadtester
    HTTP/1.1
    service
    Proxy
    loadtester
    Headless
    service
    Proxy Proxy
    loadtester
    Headless
    service
    Proxy

    View full-size slide

  33. 33
    Acceptance tests
    Making Istio changes observable
    ● Ideally, based on existing services
    ● Or you can replicate a miniature of your main services and test against it
    ➔ We do a mix of both
    Generate light traffic continuously against these and get your success rates

    View full-size slide

  34. 34
    All green!
    Making Istio changes observable

    View full-size slide

  35. 35
    Takeaways
    Making Istio changes observable
    ● Choose the observability stack for Istio while keeping cost in mind
    ● Control plane is easier to observe than data plane
    ● Simplify data plane observability with acceptance tests
    ● Keeping a centralized place to observe changes make them easier

    View full-size slide

  36. Protecting Istio and its users from themselves

    View full-size slide

  37. 37
    Shared service mesh
    Protecting Istio and its users from themselves
    ● The service mesh is common to all users
    ● Any change to it spreads across the whole mesh
    ○ Any misconfiguration spread too, be it intentional or not
    Humans are error-prone, both users and operators are humans so:
    Errors will happen, with a large blast radius!

    View full-size slide

  38. 38
    Misconfiguration example
    Protecting Istio and its users from themselves
    Ingress
    Gateway
    payment
    v1
    Proxy Proxy
    payment
    v1
    Proxy
    100%
    Client
    VirtualService
    spec:
    hosts:
    - payment.payment-prod.svc.cluster.local
    http:
    - name: payment
    route:
    - destination:
    host: payment.payment-prod.svc.cluster.local
    subset: v1

    View full-size slide

  39. 39
    Misconfiguration example
    Protecting Istio and its users from themselves
    Ingress
    Gateway
    payment
    v1
    Proxy Proxy
    payment
    v1
    Proxy
    100%
    Client
    VirtualService
    spec:
    hosts:
    - payment.payment-prod.svc.cluster.local
    http:
    - name: payment
    route:
    - destination:
    host: payment.payment-prod.svc.cluster.local
    subset: v2

    View full-size slide

  40. 40
    Misconfiguration example
    Protecting Istio and its users from themselves
    Ingress
    Gateway
    payment
    v1
    Proxy Proxy
    payment
    v1
    Proxy
    100%
    Client
    VirtualService
    spec:
    hosts:
    - payment.payment-prod.svc.cluster.local
    http:
    - name: payment
    route:
    - destination:
    host: payment.payment-prod.svc.cluster.local
    subset: v2
    payment v2???

    View full-size slide

  41. 41
    Misconfiguration example
    Protecting Istio and its users from themselves
    Ingress
    Gateway
    payment
    v1
    Proxy Proxy
    payment
    v1
    Proxy
    0%
    Client
    VirtualService
    spec:
    hosts:
    - payment.payment-prod.svc.cluster.local
    http:
    - name: payment
    route:
    - destination:
    host: payment.payment-prod.svc.cluster.local
    subset: v2
    payment v2???
    0%

    View full-size slide

  42. 42
    Oops, we just dropped all our production payment traffic !!!
    Protecting Istio and its users from themselves
    Let’s calm down, it didn’t happen to you yet (hopefully)
    We will explain how to prevent it from happening in the next part.

    View full-size slide

  43. 43
    How can we mitigate errors and their impact?
    Protecting Istio and its users from themselves
    1. Understand how the resources are changed
    2. Create appropriate safeguards

    View full-size slide

  44. 44
    Protecting Istio and its users from themselves

    View full-size slide

  45. 45
    Protecting Istio and its users from themselves

    View full-size slide

  46. 46
    Protecting Istio and its users from themselves

    View full-size slide

  47. 47
    Protecting Istio and its users from themselves

    View full-size slide

  48. 48
    Protecting Istio and its users from themselves
    Linters

    View full-size slide

  49. 49
    Protecting Istio and its users from themselves
    Linters
    Admission
    Controllers

    View full-size slide

  50. 50
    Protecting Istio and its users from themselves
    Linters
    Admission
    Controllers
    Feedback loop

    View full-size slide

  51. 51
    Protecting Istio and its users from themselves
    Linters
    Admission
    Controllers
    Feedback loop
    Best

    View full-size slide

  52. 52
    Protecting Istio and its users from themselves
    Linters
    Admission
    Controllers
    Feedback loop
    Best Worst

    View full-size slide

  53. 53
    Linters (conftest, stein)
    Protecting Istio and its users from themselves
    ● Early feedback in the PR if the user misconfigured/missed something
    ● Checks the configuration with validation logic

    View full-size slide

  54. 54
    Admission Controllers (gatekeeper)
    Protecting Istio and its users from themselves
    ● Late feedback after configuration merge (apply), need new PR to fix
    ● Checks the configuration with validation logic + reconcile with existing config

    View full-size slide

  55. 55
    Open Policy Agent (OPA)
    Protecting Istio and its users from themselves
    - Common language: Rego
    - Linters: conftest
    - Admission Controllers: gatekeeper
    Rego policies can be used by both.
    Gatekeeper can cache Kubernetes resources (inventory) to compare input with
    existing resources
    ➔ This is an important point for Istio policies
    https://github.com/open-policy-agent/opa#-open-policy-agent

    View full-size slide

  56. 56
    Rules
    Protecting Istio and its users from themselves
    1. Restrict Istio VirtualServices and DestinationRules to use local namespace
    hostnames

    View full-size slide

  57. 57
    Rule 1: Restrict Istio VirtualServices and DestinationRules to use local
    namespace hostnames
    Protecting Istio and its users from themselves
    Enforce ownership of defined paths under one namespace and prevents path
    hijacking
    namespace: namespace-a
    host: service-a.namespace-a.svc.cluster.local
    namespace: namespace-a
    host: service-a.namespace-b.svc.cluster.local

    View full-size slide

  58. 58
    Rule 1: Restrict Istio VirtualServices and DestinationRules to use local
    namespace hostnames
    Protecting Istio and its users from themselves
    Enforce ownership of defined paths under one namespace and prevents path
    hijacking
    namespace: namespace-a
    host: service-a.namespace-a.svc.cluster.local
    namespace: namespace-a
    host: service-a.namespace-b.svc.cluster.local
    We enforce FQDN notation with a combined rule to support this.

    View full-size slide

  59. 59
    Rules
    Protecting Istio and its users from themselves
    1. Restrict Istio VirtualServices and DestinationRules to use local namespace
    hostnames
    2. Restrict in-use DestinationRule subsets modification

    View full-size slide

  60. 60
    Rule 2: Restrict in-use DestinationRule subsets modification
    Protecting Istio and its users from themselves
    Prevents traffic disruption caused by subset modification when used by existing
    VirtualServices
    DestinationRule
    spec:
    host: test.default.svc.cluster.local
    subsets:
    - name: v1
    labels:
    version: v1
    VirtualService
    spec:
    hosts:
    - test.default.svc.cluster.local
    http:
    - name: default
    route:
    - destination:
    host: test.default.svc.cluster.local
    subset: v1

    View full-size slide

  61. 61
    Rule 2: Restrict in-use DestinationRule subsets modification
    Protecting Istio and its users from themselves
    Prevents traffic disruption caused by subset modification when used by existing
    VirtualServices
    DestinationRule
    spec:
    host: test.default.svc.cluster.local
    subsets:
    - name: v1 -> v2
    labels:
    version: v1 -> v2
    VirtualService
    spec:
    hosts:
    - test.default.svc.cluster.local
    http:
    - name: default
    route:
    - destination:
    host: test.default.svc.cluster.local
    subset: v1

    View full-size slide

  62. 62
    Rules
    Protecting Istio and its users from themselves
    1. Restrict Istio VirtualServices and DestinationRules to use local namespace
    hostnames
    2. Restrict in-use DestinationRule subsets modification
    3. Restrict configurable fields in VirtualServices and DestinationRules

    View full-size slide

  63. 63
    Rule 3: Restrict configurable fields in VirtualServices and DestinationRules
    Protecting Istio and its users from themselves
    Prevents unsupported fields to be modified in VirtualServices and DestinationRules
    DestinationRule
    spec:
    host: test.default.svc.cluster.local
    - name: v1
    labels:
    version: v1
    Whitelist
    - spec.host
    - spec.host.name
    - spec.host.labels
    - spec.host.labels.version

    View full-size slide

  64. 64
    Rule 3: Restrict configurable fields in VirtualServices and DestinationRules
    Protecting Istio and its users from themselves
    Prevents unsupported fields to be used in VirtualServices and DestinationRules
    DestinationRule
    spec:
    host: test.default.svc.cluster.local
    - name: v1
    labels:
    version: v1
    exportTo: [“*”]
    Whitelist
    - spec.host
    - spec.host.name
    - spec.host.labels
    - spec.host.labels.version

    View full-size slide

  65. 65
    Rules
    Protecting Istio and its users from themselves
    1. Restrict Istio VirtualServices and DestinationRules to use local namespace
    hostnames
    2. Restrict in-use DestinationRule subsets modification
    3. Restrict configurable fields in VirtualServices and DestinationRules
    4. Restrict VirtualServices from using non-existing subsets

    View full-size slide

  66. 66
    Rule 4: Restrict VirtualServices from using non-existing subsets
    Protecting Istio and its users from themselves
    Prevents traffic disruption caused by sinking traffic into non-existing subset
    DestinationRule
    spec:
    host: test.default.svc.cluster.local
    subsets:
    - name: v1
    labels:
    version: v1
    VirtualService
    spec:
    hosts:
    - test.default.svc.cluster.local
    http:
    - name: default
    route:
    - destination:
    host: test.default.svc.cluster.local
    subset: v1

    View full-size slide

  67. 67
    Rule 4: Restrict VirtualServices from using non-existing subsets
    Protecting Istio and its users from themselves
    Prevents traffic disruption caused by sinking traffic into a non-existing subset
    DestinationRule
    spec:
    host: test.default.svc.cluster.local
    subsets:
    - name: v1
    labels:
    version: v1
    VirtualService
    spec:
    hosts:
    - test.default.svc.cluster.local
    http:
    - name: default
    route:
    - destination:
    host: test.default.svc.cluster.local
    subset: v2

    View full-size slide

  68. 68
    Rules
    Protecting Istio and its users from themselves
    1. Restrict Istio VirtualServices and DestinationRules to use local namespace
    hostnames
    2. Restrict in-use DestinationRule subsets modification
    3. Restrict configurable fields in VirtualServices and DestinationRules
    4. Restrict VirtualServices from using non-existing subsets
    5. Restrict duplicate hosts in VirtualServices and DestinationRules

    View full-size slide

  69. 69
    Rule 5: Restrict duplicate hosts in VirtualServices and DestinationRules
    Protecting Istio and its users from themselves
    Prevents unexpected/unpredictable traffic hijacking caused by having multiple
    VirtualServices and DestinationRules for the same host
    VirtualService
    spec:
    hosts:
    - test.default.svc.cluster.local
    http:
    - name: default
    route:
    - destination:
    host: test.default.svc.cluster.local
    subset: v1
    VirtualService
    spec:
    hosts:
    - test-v2.default.svc.cluster.local
    http:
    - name: default
    route:
    - destination:
    host: test.default.svc.cluster.local
    subset: v1

    View full-size slide

  70. 70
    Rule 5: Restrict duplicate hosts in VirtualServices and DestinationRules
    Protecting Istio and its users from themselves
    Prevents unexpected traffic hijacking caused by multiple VirtualServices and
    DestinationRules having the same host.
    VirtualService
    spec:
    hosts:
    - test.default.svc.cluster.local
    http:
    - name: default
    route:
    - destination:
    host: test.default.svc.cluster.local
    subset: v1
    VirtualService
    spec:
    hosts:
    - test.default.svc.cluster.local
    http:
    - name: default
    route:
    - destination:
    host: test.default.svc.cluster.local
    subset: v1

    View full-size slide

  71. 71
    Takeaways
    Protecting Istio and its users from themselves
    ● Make all Istio changes through versioning, with PR process
    ● Enforce the use of Continuous Integration tools
    ● Leverage linters to catch issues at CI-level, keeping the feedback loop short
    ● Leverage admission webhooks to
    ○ protect the resources
    ○ check what cannot be checked at linter-level (inventory)

    View full-size slide

  72. 73
    Takeaways
    Summary
    ● Guardrails are for both operators and users
    ● Operators
    ○ Maintenance processes
    ○ Observability
    ○ Acceptance testing
    ● Users
    ○ Limit the exposure of resources
    ○ Make safe policies at both CI and cluster level
    ○ Keep a good user experience to retain your users with short feedback loops

    View full-size slide

  73. Thank you for coming!

    View full-size slide

  74. 75
    Protecting Istio and its users from themselves
    Linters
    Admission
    Controllers
    Feedback loop
    Best Worst

    View full-size slide

  75. 76
    How can we mitigate errors and their impact?
    Protecting Istio and its users from themselves
    ● Make all changes through versioning, with PR
    ○ Alone, it is not enough since reviewers are humans
    ● Enforce the use of Continuous Integration tools
    ○ Changes can only be made there
    ● Leverage linters to catch issues at CI-level
    ● Leverage admission webhooks to protect the resources

    View full-size slide

  76. 77
    Misconfiguration example
    Protecting Istio and its users from themselves
    Ingress
    Gateway
    payment
    v1
    Proxy Proxy
    payment
    v1
    Proxy
    payment
    v2
    Proxy
    100%
    Client

    View full-size slide