Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Tales of deploying Istio Ingress

Tales of deploying Istio Ingress

This talk presents Workday’s journey towards deploying Istio Ingress to our Public Cloud environments. As we transferred our services from our legacy ingress to our new Istio ingress solution, a number of platform and application-layer issues surfaced. This talk presents how browser policy, HSTS, cookie stickiness, and headers can break applications, how we debugged those issues, and how we resolved them. Attendees can expect to learn some common and less common pitfalls of updating platform and infrastructure, the tools and techniques to triage them, and how they can impact the underlying applications.

Pauline Lallinec

May 11, 2021
Tweet

More Decks by Pauline Lallinec

Other Decks in Programming

Transcript

  1. View Slide

  2. Tales of deploying Istio Ingress
    ---
    Continuous Lifecycle London 2021
    Pauline Lallinec

    View Slide

  3. Tales of deploying Istio Ingress
    ---
    1. Intro
    2. What is Istio? What is Envoy?
    3. Istio resources overview
    4. Workday infrastructure change overview
    5. Rollout plans
    6. Lessons learnt

    View Slide

  4. Data centers
    in Asia,
    Canada,
    Europe, USA
    95.14% of
    transactions had
    a response time
    of less than 1
    second
    195 billion
    transactions
    FY2020
    Service
    uptime >
    99.98%
    Workday
    community =
    45 million
    workers

    View Slide

  5. Software Engineer - DevOps
    Non-stop karaoke machine
    @plallin
    Workday + k8s platform = Scylla
    Public + Private cloud (Asia, NA, Europe)
    5 teams, 2 continents

    View Slide

  6. Workday + Public Infrastructure = Pi
    Aka. Infrastructure Public Cloud
    2 teams, 2 continents
    Software Engineer - DevOps
    Non-stop karaoke machine
    @plallin

    View Slide

  7. Tales of deploying Istio Ingress
    ---
    1. Intro
    2. What is Istio? What is Envoy?
    3. Istio resources overview
    4. Workday infrastructure change overview
    5. Rollout plans
    6. Lessons learnt

    View Slide

  8. What is Istio?
    Istio is an open-source platform-independent service
    mesh that provides traffic management, policy
    enforcement, and telemetry collection.
    ● Developed by Google + IBM + Lyft
    ● Uses the Envoy proxy
    ● It lets you manage your ingress resources as a set of
    Kubernetes resources
    ● It runs inside the Kubernetes cluster
    ● Istio ingress =/= Istio Mesh
    ○ Istio Ingress: manages external traffic to pods
    ○ Istio Mesh: manages pod-to-pod traffic

    View Slide

  9. What is Envoy?
    Envoy is a high-performance proxy designed for
    cloud-native applications
    ● Originally developed by Lyft in C++
    ● Open source
    ● Cloud-agnostic
    ● Manage all inbound and outbound traffic
    ● In Istio, Envoy proxies are deployed as sidecar containers

    View Slide

  10. What is Envoy?
    Envoy Features:
    ● Dynamic service discovery
    ● Load balancing
    ● TLS termination
    ● HTTP/2 and gRPC proxies
    ● Circuit breakers
    ● Health checks
    ● Staged rollouts with %-based traffic split
    ● Fault injection
    ● & more

    View Slide

  11. View Slide

  12. Tales of deploying Istio Ingress
    ---
    1. Intro
    2. What is Istio? What is Envoy?
    3. Istio resources overview
    4. Workday infrastructure change overview
    5. Rollout plans
    6. Lessons learnt

    View Slide

  13. Istio Resources
    ● high level routing rules are configured via custom
    resources
    ● We will focus on 3 of them:
    ● Istio gateway
    Describes a load balancer operating at the edge of
    the mesh receiving incoming or outgoing HTTP/TCP
    connections
    ● Virtual services
    A set of traffic routing rules to apply when a host is
    addressed
    ● Destination rules
    Defines policies that apply to traffic after routing has
    occurred (such as load balancing, connection pool
    size, etc)

    View Slide

  14. Istio Resources
    ● Istio is logically split into a control plane and a data plane
    ● Istiod: the control plane
    ○ Use the CRDs to convert high level routing rules into
    Envoy-specific configurations
    ○ Push Envoy configuration to Envoy proxy sidecars
    ○ Discovery + certificate management (out of scope for
    this talk)
    ● Envoy sidecars: the data plane
    ○ Receive and routes traffic as per the configuration
    received from the control plane
    ● Istio ingress gateway (part of the data plane)
    ○ runs an Envoy proxy sidecar
    ○ Routes ingress traffic according to the ingress
    configuration

    View Slide

  15. Istio Resources

    View Slide

  16. Istio Resources: gateway
    apiVersion: networking.istio.io/v1alpha3
    kind: Gateway
    metadata:
    name: my-gateway
    namespace: some-config-namespace
    spec:
    selector:
    app: my-gateway-controller
    servers:
    - port:
    number: 443
    name: https-443
    protocol: HTTPS
    hosts:
    - uk.bookinfo.com
    - eu.bookinfo.com
    tls:
    mode: SIMPLE # enables HTTPS on this port
    serverCertificate : /etc/certs/servercert.pem
    privateKey: /etc/certs/privatekey.pem

    View Slide

  17. Istio Resources: virtual service
    apiVersion: networking.istio.io/v1alpha3
    kind: VirtualService
    metadata:
    name: reviews-route
    spec:
    gateway:
    - my-gateway
    hosts:
    - uk.bookinfo.com
    http:
    - name: "booking-routes"
    match:
    - uri:
    prefix: "/booking"
    - uri:
    prefix: "/user-profile"
    route:
    - destination:
    host: bookings.my-namespace.svc.cluster.local
    port-number: 8080

    View Slide

  18. Istio Resources: destination rule
    apiVersion: networking.istio.io/v1alpha3
    kind: DestinationRule
    metadata:
    name: bookinfo
    spec:
    host: bookings.my-namespace.svc.cluster.local
    trafficPolicy :
    loadBalancer:
    simple: LEAST_CONN

    View Slide

  19. Tales of deploying Istio Ingress
    ---
    1. Intro
    2. What is Istio? What is Envoy?
    3. Istio resources overview
    4. Workday infrastructure change overview
    5. Rollout plans
    6. Lessons learnt

    View Slide

  20. What is Istio Replacing?
    Past Ingress solution
    Current Ingress solution
    Future Ingress solution
    Internet
    Internet
    Internet
    Legacy load balancer
    Legacy load balancer
    AWS Load Balancer
    Istio
    Istio
    EC2 instances

    View Slide

  21. What is Istio Replacing?
    Past Ingress solution
    Current Ingress solution
    Future Ingress solution
    Internet
    Internet
    Internet
    Legacy load balancer
    Legacy load balancer
    AWS Load Balancer
    Istio
    Istio
    EC2 instances
    This talk

    View Slide

  22. What is Istio Replacing?
    Past Ingress solution
    Current Ingress solution
    Future Ingress solution
    Internet
    Internet
    Internet
    Legacy load balancer
    Legacy load balancer
    AWS Load Balancer
    Istio
    Istio
    EC2 instances
    This talk
    Future

    View Slide

  23. Why are we implementing Istio Ingress?
    Dev and Prod parity
    ● Legacy Load balancer too costly to run in dev
    ● Dev load balancer = AWS Application Load balancer
    ● This represents a delta between dev and prod

    View Slide

  24. Why are we implementing Istio Ingress?
    Dev and Prod parity
    ● Legacy Load balancer too costly to run in dev
    ● Dev load balancer = AWS Application Load balancer
    ● This represents a delta between dev and prod
    Reduce coupling with infrastructure team
    ● The legacy load balancer: managed by the Infrastructure
    team.
    ● Applications run on platforms managed by the platform
    team
    ● Istio: also managed by the platform team

    View Slide

  25. Why are we implementing Istio Ingress?
    Dev and Prod parity
    ● Legacy Load balancer too costly to run in dev
    ● Dev load balancer = AWS Application Load balancer
    ● This represents a delta between dev and prod
    Reduce coupling with infrastructure team
    ● The legacy load balancer: managed by the Infrastructure
    team.
    ● Applications run on platforms managed by the platform
    team
    ● Istio: also managed by the platform team
    Istio is under Kubernetes control
    ● Easier for platform team to manage
    ● Stepping stone for the implementation for Istio Mesh.

    View Slide

  26. Tales of deploying Istio Ingress
    ---
    1. Intro
    2. What is Istio? What is Envoy?
    3. Istio resources overview
    4. Workday infrastructure change overview
    5. Rollout plans
    6. Lessons learnt

    View Slide

  27. Environments overview
    Development
    Legacy
    Internet AWS ALB EC2
    New Ingress infrastructure
    Internet AWS NLB Istio

    View Slide

  28. Environments overview
    Staging, preprod, prod
    Legacy
    Internet Legacy LB EC2
    (including routing)
    New ingress infrastructure
    Internet Legacy LB Istio
    (No routing)

    View Slide

  29. Rollout plan in development
    Step 1: “Dark launch”
    “Dark launching allows development teams to test the efficacy of
    new, production-ready features without releasing them to an entire
    user base.” (launchdarkly.com)
    ● Deploy Istio Ingress with no traffic
    ● Add network load balancer to route traffic to Istio
    ● Implement a test ingress service to ensure traffic flows as
    expected

    View Slide

  30. Rollout plan in development
    Step 2: pre-flight checks
    ● Review current configuration
    ● Translate them into Istio configuration
    ● Ask SME to review configuration

    View Slide

  31. Rollout plan in development
    Step 3: Progressive rollout
    ● Retire AWS application load balancers one by one
    ● Route traffic to Istio instead
    ● Ask SME to run regression tests

    View Slide

  32. Rollout plan in production
    Phased release
    Pair with SMEs to test
    the change and ensure
    there is no regressions
    Regression tests
    Forward a subset of the
    traffic to Istio
    Move subset of traffic
    in staging
    And repeat the process
    until all subsets of the
    network are forwarded
    to Istio
    Roll out in further
    environments

    View Slide

  33. Tales of deploying Istio Ingress
    ---
    1. Intro
    2. What is Istio? What is Envoy?
    3. Istio resources overview
    4. Workday infrastructure change overview
    5. Rollout plans
    6. Lessons learnt

    View Slide

  34. Lesson #1: Make allies & educate them
    Identifies your key stakeholders
    If possible, make a list of key allies so you have escalation points
    straight away when needed
    Educate them about the change
    The more they understand about the change, the more they will
    be able to test the change
    Make allies out of them
    This will allow you to roll out faster by reducing friction /
    resistance to change as well as making sure they understand the
    scope of the change and tested it

    View Slide

  35. Lesson #2: Double-check the
    configuration with your SMEs
    Service SME should double-check the configuration
    Send a copy of the new configuration to the relevant team and
    ask them to verify it
    Verify the permissions
    As well as the routes, ask them to verify the permissions against
    the route as well (e.g. HEAD, GET, POST, DELETE)
    Prepare for the future
    Depending on the company size, eventually you might need
    teams to own their own configuration, so it’s a good idea to make
    them familiar with it

    View Slide

  36. Lesson #3: Accept there will be “unknown
    unknowns”
    Legacy is everywhere
    It’s likely the legacy piece of infrastructure has some feature that
    too many teams depend on and that needs to be ported,
    replaced, or mitigated in your new infrastructure
    Cutting edge is not everywhere!
    Give yourself & your team the time to learn, find issues, and
    mitigate them
    Examples: the “pinback feature”

    View Slide

  37. Example: “pinback feature”
    Can I get:
    example.com/internalURI
    ?
    No!

    View Slide

  38. Example: “pinback feature”
    Can I get:
    example.com/internalURI
    ?
    No!
    Can I get:
    example.com/internalURI
    ?
    OK!

    View Slide

  39. Example: “pinback feature”
    Can I get:
    example.com/internalURI
    ?
    No!
    Can I get:
    example.com/internalURI
    ?
    Also no!

    View Slide

  40. Istio Resources: gateway
    apiVersion: networking.istio.io/v1alpha3
    kind: Gateway
    metadata:
    name: my-gateway
    namespace: some-config-namespace
    spec:
    selector:
    app: my-gateway-controller
    servers:
    - port:
    number: 443
    name: https-443
    protocol: HTTPS
    hosts:
    - example.com
    tls:
    mode: SIMPLE # enables HTTPS on this port
    serverCertificate : /etc/certs/servercert.pem
    privateKey: /etc/certs/privatekey.pem

    View Slide

  41. Istio Resources: virtual service
    apiVersion: networking.istio.io/v1alpha3
    kind: VirtualService
    metadata:
    name: example
    spec:
    gateway:
    - my-gateway
    hosts:
    - example.com
    http:
    - name: "booking-routes"
    match:
    - uri:
    prefix: "/foo"
    - uri:
    prefix: "/bar"
    route:
    - destination:
    host: example.my-namespace.svc.cluster.local
    port-number: 8080

    View Slide

  42. Example: “pinback feature”
    Can I get:
    example.com/internalURI
    ?
    No!
    Can I get:
    example.my-namespace.svc.cluster.local/internalURI?
    OK!

    View Slide

  43. Lesson #4: Verify your assumptions
    The behavior of your new infrastructure might not be the one you
    expect it to have. Check your assumptions to avoid surprises.

    View Slide

  44. Lesson #4: Verify your assumptions
    The behavior of your new infrastructure might not be the one you
    expect it to have. Check your assumptions to avoid
    disappointments.
    Example: sticky cookies in Istio

    View Slide

  45. Lesson #4: Verify your assumptions
    The behavior of your new infrastructure might not be the one you
    expect it to have. Check your assumptions to avoid
    disappointments.
    Example: sticky cookies in Istio
    Assumptions:
    ● Path will be defaulted to “/”
    ● This will make my sticky cookie available on the entire
    website, ensuring stickiness

    View Slide

  46. Lesson #4: Verify your assumptions
    The behavior of your new infrastructure might not be the one you
    expect it to have. Check your assumptions to avoid
    disappointments.
    Example: sticky cookies in Istio
    Assumptions
    ● Path will be defaulted to “/”
    ● This will make my sticky cookie available on the entire
    website, ensuring stickiness
    Reality
    ● Path is defaulted to the URI being hit
    ● If you go to example.com/foo, the cookie is set at “/foo”
    ● Which means, if you go to “example.com/bar”, stickiness is
    not guaranteed.

    View Slide

  47. Lesson #5: Be ready to develop new
    solutions
    (Again) Legacy is everywhere
    Whereby the legacy infrastructure you are replacing is well installed in your
    company, it might have features / custom tooling that your new infrastructure
    can’t replace
    Cutting edge is not everywhere :(
    Be ready for the possibility that you might have to develop a new software,
    or get new pieces of infrastructure, in order to maintain some functionality
    Example: Development of a PrivateLink operator for cluster-to-cluster
    comm, to ensure isolation of some traffic from the Internet
    AWS DC 1 AWS DC 2
    Internet

    View Slide

  48. Lesson #5: Be ready to develop new
    solutions
    (Again) Legacy is everywhere
    Whereby the legacy infrastructure you are replacing is well installed in your
    company, it might have features / custom tooling that your new infrastructure
    can’t replace
    Cutting edge is not everywhere :(
    Be ready for the possibility that you might have to develop a new software,
    or get new pieces of infrastructure, in order to maintain some functionality
    Example: Development of a PrivateLink operator for cluster-to-cluster
    comm, to ensure isolation of some traffic from the Internet
    AWS DC 1 AWS DC 2
    PrivateLink

    View Slide

  49. Lesson #6: Don’t carry over legacy
    (And again) Legacy is everywhere
    If the infrastructure you are replacing is old, it’s likely there is
    piece of legacy involved! Don’t carry them over to your new infra!
    Take this opportunity to address legacy
    Liaise with the team whose infrastructure you’re replacing and
    take notes of the pieces of legacy to address
    Similarly, liaise with the service teams to ensure their
    configuration is up to date and that you’re not porting legacy to
    your new app
    Example: HTTP Headers overflow

    View Slide

  50. Example: HTTP header overflow

    View Slide

  51. Example: HTTP header overflow

    View Slide

  52. Example: HTTP header overflow

    View Slide

  53. Example: HTTP header overflow
    Why am I
    writing so many
    headers?
    I should fix
    that...

    View Slide

  54. Lesson #7: Test at scale
    ● Test at scale, with the same service redundancy and
    performance as you would expect in your current
    production environment
    ● Test the performance of your current infrastructure as a
    baseline
    ● Record the performance of the new infrastructure to ensure
    no regression, or mitigate prior to rollout

    View Slide

  55. Lesson #8: Be ready for failure
    ● Defensive approach: Assume you will have to revert the
    change and make it as easy as possible to do that
    ● Prepare a rollback plan in advance and make all
    stakeholders familiar with it
    ● Ultimate goal: minimize disruption!

    View Slide

  56. Lesson #9: Take all opportunities to learn
    about your new tool
    One last time: legacy is everywhere; cutting edge is not!
    If you are replacing legacy infrastructure with new age
    infrastructure, it’s likely there isn’t as much knowledge about it
    yet. Take all opportunities you can to learn about it!
    ● experiment with it, participate in debugging issues, read
    the documentation etc
    ● Document the bugs so that team members who could not
    participate in debugging can familiarize themselves with it
    ● Share what you can with the wider community!

    View Slide

  57. Tales of deploying Istio Ingress
    ---
    Summary
    Before developing a new solution
    ● Identify stakeholders to make this change successful and make them your allies
    ● Double check the current configuration to make sure you start with a correct one
    ● Accept there will be unknown unknowns
    While rolling out the solution
    ● Verify your assumptions
    ● Be willing to develop new solutions
    ● Don’t carry over legacy
    Get ready for production rollout
    ● Test at scale
    ● Be prepared for failures
    At all times, take all the opportunities to learn!

    View Slide

  58. This presentation features not only my work,
    but my entire team’s work, and therefore I
    would like to recognize their contribution :-)
    Thank you
    Scylla + Fabrication Team
    INF + Pi Team
    Slide not included in the presentation

    View Slide

  59. Farouq
    Cathal
    Adrian
    Sathish
    David
    Rob
    Lucas
    Declan
    Dave
    John
    Jamie

    View Slide

  60. Tales of deploying Istio Ingress
    ---
    Thank you!
    Learn more more about engineering at Workday!
    medium.com/workday-engineering
    Learn more about opportunities at Workday!
    workday.com/careers
    Learn more about me!
    @plallin
    plallin.dev

    View Slide