Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Tales of deploying Istio Ingress

Tales of deploying Istio Ingress

This talk presents Workday’s journey towards deploying Istio Ingress to our Public Cloud environments. As we transferred our services from our legacy ingress to our new Istio ingress solution, a number of platform and application-layer issues surfaced. This talk presents how browser policy, HSTS, cookie stickiness, and headers can break applications, how we debugged those issues, and how we resolved them. Attendees can expect to learn some common and less common pitfalls of updating platform and infrastructure, the tools and techniques to triage them, and how they can impact the underlying applications.

Pauline Lallinec

May 11, 2021
Tweet

More Decks by Pauline Lallinec

Other Decks in Programming

Transcript

  1. Tales of deploying Istio Ingress --- 1. Intro 2. What

    is Istio? What is Envoy? 3. Istio resources overview 4. Workday infrastructure change overview 5. Rollout plans 6. Lessons learnt
  2. Data centers in Asia, Canada, Europe, USA 95.14% of transactions

    had a response time of less than 1 second 195 billion transactions FY2020 Service uptime > 99.98% Workday community = 45 million workers
  3. Software Engineer - DevOps Non-stop karaoke machine @plallin Workday +

    k8s platform = Scylla Public + Private cloud (Asia, NA, Europe) 5 teams, 2 continents
  4. Workday + Public Infrastructure = Pi Aka. Infrastructure Public Cloud

    2 teams, 2 continents Software Engineer - DevOps Non-stop karaoke machine @plallin
  5. Tales of deploying Istio Ingress --- 1. Intro 2. What

    is Istio? What is Envoy? 3. Istio resources overview 4. Workday infrastructure change overview 5. Rollout plans 6. Lessons learnt
  6. What is Istio? Istio is an open-source platform-independent service mesh

    that provides traffic management, policy enforcement, and telemetry collection. • Developed by Google + IBM + Lyft • Uses the Envoy proxy • It lets you manage your ingress resources as a set of Kubernetes resources • It runs inside the Kubernetes cluster • Istio ingress =/= Istio Mesh ◦ Istio Ingress: manages external traffic to pods ◦ Istio Mesh: manages pod-to-pod traffic
  7. What is Envoy? Envoy is a high-performance proxy designed for

    cloud-native applications • Originally developed by Lyft in C++ • Open source • Cloud-agnostic • Manage all inbound and outbound traffic • In Istio, Envoy proxies are deployed as sidecar containers
  8. What is Envoy? Envoy Features: • Dynamic service discovery •

    Load balancing • TLS termination • HTTP/2 and gRPC proxies • Circuit breakers • Health checks • Staged rollouts with %-based traffic split • Fault injection • & more
  9. Tales of deploying Istio Ingress --- 1. Intro 2. What

    is Istio? What is Envoy? 3. Istio resources overview 4. Workday infrastructure change overview 5. Rollout plans 6. Lessons learnt
  10. Istio Resources • high level routing rules are configured via

    custom resources • We will focus on 3 of them: • Istio gateway Describes a load balancer operating at the edge of the mesh receiving incoming or outgoing HTTP/TCP connections • Virtual services A set of traffic routing rules to apply when a host is addressed • Destination rules Defines policies that apply to traffic after routing has occurred (such as load balancing, connection pool size, etc)
  11. Istio Resources • Istio is logically split into a control

    plane and a data plane • Istiod: the control plane ◦ Use the CRDs to convert high level routing rules into Envoy-specific configurations ◦ Push Envoy configuration to Envoy proxy sidecars ◦ Discovery + certificate management (out of scope for this talk) • Envoy sidecars: the data plane ◦ Receive and routes traffic as per the configuration received from the control plane • Istio ingress gateway (part of the data plane) ◦ runs an Envoy proxy sidecar ◦ Routes ingress traffic according to the ingress configuration
  12. Istio Resources: gateway apiVersion: networking.istio.io/v1alpha3 kind: Gateway metadata: name: my-gateway

    namespace: some-config-namespace spec: selector: app: my-gateway-controller servers: - port: number: 443 name: https-443 protocol: HTTPS hosts: - uk.bookinfo.com - eu.bookinfo.com tls: mode: SIMPLE # enables HTTPS on this port serverCertificate : /etc/certs/servercert.pem privateKey: /etc/certs/privatekey.pem
  13. Istio Resources: virtual service apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name:

    reviews-route spec: gateway: - my-gateway hosts: - uk.bookinfo.com http: - name: "booking-routes" match: - uri: prefix: "/booking" - uri: prefix: "/user-profile" route: - destination: host: bookings.my-namespace.svc.cluster.local port-number: 8080
  14. Istio Resources: destination rule apiVersion: networking.istio.io/v1alpha3 kind: DestinationRule metadata: name:

    bookinfo spec: host: bookings.my-namespace.svc.cluster.local trafficPolicy : loadBalancer: simple: LEAST_CONN
  15. Tales of deploying Istio Ingress --- 1. Intro 2. What

    is Istio? What is Envoy? 3. Istio resources overview 4. Workday infrastructure change overview 5. Rollout plans 6. Lessons learnt
  16. What is Istio Replacing? Past Ingress solution Current Ingress solution

    Future Ingress solution Internet Internet Internet Legacy load balancer Legacy load balancer AWS Load Balancer Istio Istio EC2 instances
  17. What is Istio Replacing? Past Ingress solution Current Ingress solution

    Future Ingress solution Internet Internet Internet Legacy load balancer Legacy load balancer AWS Load Balancer Istio Istio EC2 instances This talk
  18. What is Istio Replacing? Past Ingress solution Current Ingress solution

    Future Ingress solution Internet Internet Internet Legacy load balancer Legacy load balancer AWS Load Balancer Istio Istio EC2 instances This talk Future
  19. Why are we implementing Istio Ingress? Dev and Prod parity

    • Legacy Load balancer too costly to run in dev • Dev load balancer = AWS Application Load balancer • This represents a delta between dev and prod
  20. Why are we implementing Istio Ingress? Dev and Prod parity

    • Legacy Load balancer too costly to run in dev • Dev load balancer = AWS Application Load balancer • This represents a delta between dev and prod Reduce coupling with infrastructure team • The legacy load balancer: managed by the Infrastructure team. • Applications run on platforms managed by the platform team • Istio: also managed by the platform team
  21. Why are we implementing Istio Ingress? Dev and Prod parity

    • Legacy Load balancer too costly to run in dev • Dev load balancer = AWS Application Load balancer • This represents a delta between dev and prod Reduce coupling with infrastructure team • The legacy load balancer: managed by the Infrastructure team. • Applications run on platforms managed by the platform team • Istio: also managed by the platform team Istio is under Kubernetes control • Easier for platform team to manage • Stepping stone for the implementation for Istio Mesh.
  22. Tales of deploying Istio Ingress --- 1. Intro 2. What

    is Istio? What is Envoy? 3. Istio resources overview 4. Workday infrastructure change overview 5. Rollout plans 6. Lessons learnt
  23. Environments overview Staging, preprod, prod Legacy Internet Legacy LB EC2

    (including routing) New ingress infrastructure Internet Legacy LB Istio (No routing)
  24. Rollout plan in development Step 1: “Dark launch” “Dark launching

    allows development teams to test the efficacy of new, production-ready features without releasing them to an entire user base.” (launchdarkly.com) • Deploy Istio Ingress with no traffic • Add network load balancer to route traffic to Istio • Implement a test ingress service to ensure traffic flows as expected
  25. Rollout plan in development Step 2: pre-flight checks • Review

    current configuration • Translate them into Istio configuration • Ask SME to review configuration
  26. Rollout plan in development Step 3: Progressive rollout • Retire

    AWS application load balancers one by one • Route traffic to Istio instead • Ask SME to run regression tests
  27. Rollout plan in production Phased release Pair with SMEs to

    test the change and ensure there is no regressions Regression tests Forward a subset of the traffic to Istio Move subset of traffic in staging And repeat the process until all subsets of the network are forwarded to Istio Roll out in further environments
  28. Tales of deploying Istio Ingress --- 1. Intro 2. What

    is Istio? What is Envoy? 3. Istio resources overview 4. Workday infrastructure change overview 5. Rollout plans 6. Lessons learnt
  29. Lesson #1: Make allies & educate them Identifies your key

    stakeholders If possible, make a list of key allies so you have escalation points straight away when needed Educate them about the change The more they understand about the change, the more they will be able to test the change Make allies out of them This will allow you to roll out faster by reducing friction / resistance to change as well as making sure they understand the scope of the change and tested it
  30. Lesson #2: Double-check the configuration with your SMEs Service SME

    should double-check the configuration Send a copy of the new configuration to the relevant team and ask them to verify it Verify the permissions As well as the routes, ask them to verify the permissions against the route as well (e.g. HEAD, GET, POST, DELETE) Prepare for the future Depending on the company size, eventually you might need teams to own their own configuration, so it’s a good idea to make them familiar with it
  31. Lesson #3: Accept there will be “unknown unknowns” Legacy is

    everywhere It’s likely the legacy piece of infrastructure has some feature that too many teams depend on and that needs to be ported, replaced, or mitigated in your new infrastructure Cutting edge is not everywhere! Give yourself & your team the time to learn, find issues, and mitigate them Examples: the “pinback feature”
  32. Istio Resources: gateway apiVersion: networking.istio.io/v1alpha3 kind: Gateway metadata: name: my-gateway

    namespace: some-config-namespace spec: selector: app: my-gateway-controller servers: - port: number: 443 name: https-443 protocol: HTTPS hosts: - example.com tls: mode: SIMPLE # enables HTTPS on this port serverCertificate : /etc/certs/servercert.pem privateKey: /etc/certs/privatekey.pem
  33. Istio Resources: virtual service apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name:

    example spec: gateway: - my-gateway hosts: - example.com http: - name: "booking-routes" match: - uri: prefix: "/foo" - uri: prefix: "/bar" route: - destination: host: example.my-namespace.svc.cluster.local port-number: 8080
  34. Example: “pinback feature” Can I get: example.com/internalURI ? No! Can

    I get: example.my-namespace.svc.cluster.local/internalURI? OK!
  35. Lesson #4: Verify your assumptions The behavior of your new

    infrastructure might not be the one you expect it to have. Check your assumptions to avoid surprises.
  36. Lesson #4: Verify your assumptions The behavior of your new

    infrastructure might not be the one you expect it to have. Check your assumptions to avoid disappointments. Example: sticky cookies in Istio
  37. Lesson #4: Verify your assumptions The behavior of your new

    infrastructure might not be the one you expect it to have. Check your assumptions to avoid disappointments. Example: sticky cookies in Istio Assumptions: • Path will be defaulted to “/” • This will make my sticky cookie available on the entire website, ensuring stickiness
  38. Lesson #4: Verify your assumptions The behavior of your new

    infrastructure might not be the one you expect it to have. Check your assumptions to avoid disappointments. Example: sticky cookies in Istio Assumptions • Path will be defaulted to “/” • This will make my sticky cookie available on the entire website, ensuring stickiness Reality • Path is defaulted to the URI being hit • If you go to example.com/foo, the cookie is set at “/foo” • Which means, if you go to “example.com/bar”, stickiness is not guaranteed.
  39. Lesson #5: Be ready to develop new solutions (Again) Legacy

    is everywhere Whereby the legacy infrastructure you are replacing is well installed in your company, it might have features / custom tooling that your new infrastructure can’t replace Cutting edge is not everywhere :( Be ready for the possibility that you might have to develop a new software, or get new pieces of infrastructure, in order to maintain some functionality Example: Development of a PrivateLink operator for cluster-to-cluster comm, to ensure isolation of some traffic from the Internet AWS DC 1 AWS DC 2 Internet
  40. Lesson #5: Be ready to develop new solutions (Again) Legacy

    is everywhere Whereby the legacy infrastructure you are replacing is well installed in your company, it might have features / custom tooling that your new infrastructure can’t replace Cutting edge is not everywhere :( Be ready for the possibility that you might have to develop a new software, or get new pieces of infrastructure, in order to maintain some functionality Example: Development of a PrivateLink operator for cluster-to-cluster comm, to ensure isolation of some traffic from the Internet AWS DC 1 AWS DC 2 PrivateLink
  41. Lesson #6: Don’t carry over legacy (And again) Legacy is

    everywhere If the infrastructure you are replacing is old, it’s likely there is piece of legacy involved! Don’t carry them over to your new infra! Take this opportunity to address legacy Liaise with the team whose infrastructure you’re replacing and take notes of the pieces of legacy to address Similarly, liaise with the service teams to ensure their configuration is up to date and that you’re not porting legacy to your new app Example: HTTP Headers overflow
  42. Lesson #7: Test at scale • Test at scale, with

    the same service redundancy and performance as you would expect in your current production environment • Test the performance of your current infrastructure as a baseline • Record the performance of the new infrastructure to ensure no regression, or mitigate prior to rollout
  43. Lesson #8: Be ready for failure • Defensive approach: Assume

    you will have to revert the change and make it as easy as possible to do that • Prepare a rollback plan in advance and make all stakeholders familiar with it • Ultimate goal: minimize disruption!
  44. Lesson #9: Take all opportunities to learn about your new

    tool One last time: legacy is everywhere; cutting edge is not! If you are replacing legacy infrastructure with new age infrastructure, it’s likely there isn’t as much knowledge about it yet. Take all opportunities you can to learn about it! • experiment with it, participate in debugging issues, read the documentation etc • Document the bugs so that team members who could not participate in debugging can familiarize themselves with it • Share what you can with the wider community!
  45. Tales of deploying Istio Ingress --- Summary Before developing a

    new solution • Identify stakeholders to make this change successful and make them your allies • Double check the current configuration to make sure you start with a correct one • Accept there will be unknown unknowns While rolling out the solution • Verify your assumptions • Be willing to develop new solutions • Don’t carry over legacy Get ready for production rollout • Test at scale • Be prepared for failures At all times, take all the opportunities to learn!
  46. This presentation features not only my work, but my entire

    team’s work, and therefore I would like to recognize their contribution :-) Thank you Scylla + Fabrication Team INF + Pi Team Slide not included in the presentation
  47. Tales of deploying Istio Ingress --- Thank you! Learn more

    more about engineering at Workday! medium.com/workday-engineering Learn more about opportunities at Workday! workday.com/careers Learn more about me! @plallin plallin.dev