Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Secure your cluster-to-cluster traffic, the agnostic way - Pauline Lallinec & Dave Kerr, Workday

Secure your cluster-to-cluster traffic, the agnostic way - Pauline Lallinec & Dave Kerr, Workday

Workday is shifting to a multi-cloud approach whereby its Kubernetes platform known as Scylla can be deployed to public cloud providers as well as Workday’s own data centers. To achieve this, we needed to route tenant data across existing AWS clusters in different regions, to Workday’s own data centre, and potentially in the future, to other public cloud providers. While cloud providers usually have solutions to migrate data to their own cloud, Workday aims to be cloud-agnostic, and as such needs a solution to migrate data cross-clouds. The infrastructure, platform, and application development teams cooperated to develop a solution relying on Kubernetes operators, Istio, Consul, and Helm-delivered application configuration. This talk will give an overview of the tools and technologies used to migrate tenant data to other clusters, wherever they are deployed. We will additionally review the learnings from this experience and give an overview of the future work.

Pauline Lallinec

October 15, 2021
Tweet

More Decks by Pauline Lallinec

Other Decks in Programming

Transcript

  1. View Slide

  2. Secure your cluster to cluster traffic, the agnostic way
    ---
    Dave Kerr
    Pauline Lallinec
    Workday

    View Slide

  3. Data centers
    in Canada,
    Europe, USA
    95.18% of
    transactions had
    a response time
    of less than 1
    second
    265 billion
    transactions
    FY2021
    Service
    uptime >
    99.97%
    Workday
    community =
    55 million
    workers

    View Slide

  4. Software Engineer - DevOps
    Gaming, Tabletop and CTF stuff
    @davekerr95
    Workday + K8s platform = Scylla
    Public + Private Cloud (NA, EMEA)
    6 teams, 2 continents

    View Slide

  5. Workday + Public Infrastructure = Pi
    Aka. Infrastructure Public Cloud
    2 teams, 2 continents
    Software Engineer - DevOps
    (powered off) non-stop karaoke machine
    @plallin

    View Slide

  6. What to expect from this presentation
    ● A use-case for provider-agnostic cluster
    communication
    ● An overview of a solution to secure cluster to
    cluster communication
    ● A story of how Workday incrementally
    implemented that solution
    ● Overview of custom Kubernetes resources &
    operators developed by Workday
    ● “Food for thoughts” at the end of this slidedeck
    to go further / drive your own research

    View Slide

  7. View Slide

  8. Cluster to cluster comm:
    why we need it?

    View Slide

  9. Hey! My name is OX.
    I’m a service living in
    Cluster-1 in DC1.
    Sometimes, I want to
    send messages to my
    family in DC2.

    View Slide

  10. In the old days, I used the
    public-facing load balancer
    to reach my family.
    But securing communication
    over the public internet was
    very hard and required my
    carers to maintain complex
    configuration.

    View Slide

  11. You know what would
    make an old ox happy?
    Secure,
    easy-to-configure
    cluster-to-cluster
    communication

    View Slide

  12. Cluster to Cluster Communications
    Service Teams Use-Cases
    The need to communicate to services living in another
    cluster
    ● Services may be deployed in specific clusters based on
    their workload type
    ● Services need to transfer tenanted data between services
    running in different clusters
    The need to be able to easily configure these
    communication routes
    ● Should be easy to configure and deploy as a Service Team
    ● Should be low touch for Infrastructure & Platform
    Engineering teams

    View Slide

  13. Moving Tenanted data between
    Workday Instances
    The Problem
    As a Service we need to move tenanted data between instances of Workday
    living in different clusters.
    Communication to Internal APIs
    In order to move tenanted data between clusters we rely on internal api’s
    that should not be available externally on the public gateway
    Hey! How do I get access to
    https://dc2.workday.com/internalapi?

    View Slide

  14. Cluster to cluster comm:
    Legacy solution

    View Slide

  15. Legacy solution
    Hey can I get access to
    https://dc2.workday.com/internalapi?
    Sure I can do that for
    you
    Service in Cluster 1 in
    DC 1
    Public Load Balancer of
    Cluster 2 in DC 2
    Private DC comm

    View Slide

  16. Legacy solution
    Sure I can do that for
    you
    Service in Cluster 1 in
    DC 1
    Public Load Balancer of
    Cluster 2 in DC 2
    Private DC comm
    This is a legacy load
    balancer
    Hey can I get access to
    https://dc2.workday.com/internalapi?

    View Slide

  17. Legacy solution
    Sure I can do that for
    you
    Service in Cluster 1 in
    DC 1
    Public Load Balancer of
    Cluster 2 in DC 2
    Private DC comm
    This is a legacy load
    balancer
    We want to retire them and
    replace them with Istio
    Hey can I get access to
    https://dc2.workday.com/internalapi?

    View Slide

  18. Cluster to cluster comm:
    Introducing Istio

    View Slide

  19. What is Istio?
    Istio is an open-source platform-independent service
    mesh that provides traffic management, policy
    enforcement, and telemetry collection.
    ● Developed by Google + IBM + Lyft
    ● Uses the Envoy proxy
    ● It lets you manage your ingress resources as a set of
    Kubernetes resources
    ● It runs inside the Kubernetes cluster

    View Slide

  20. What is Envoy?
    Envoy is a high-performance proxy designed for
    cloud-native applications
    ● Originally developed by Lyft in C++
    ● Open source
    ● Cloud-agnostic
    ● Manage all inbound and outbound traffic
    ● In Istio, Envoy proxies are deployed as sidecar containers

    View Slide

  21. Source: CNCF survey 2020 https://www.cncf.io/wp-content/uploads/2020/11/CNCF_Survey_Report_2020.pdf

    View Slide

  22. Source: CNCF survey 2020 https://www.cncf.io/wp-content/uploads/2020/11/CNCF_Survey_Report_2020.pdf

    View Slide

  23. This sounds great!
    So how can I use it to
    send messages to my
    family located in
    Cluster-2 in DC2?

    View Slide

  24. Istio: public ingress gateway
    Can I get:
    example.com/internalURI
    ?
    No!

    View Slide

  25. Istio: public ingress gateway
    Can I get:
    example.com/internalURI
    ?
    No!
    Can I get:
    example.com/internalURI
    ?
    OK!

    View Slide

  26. Istio: public ingress gateway
    Can I get:
    example.com/internalURI
    ?
    No!
    Can I get:
    example.com/internalURI
    ?
    Also no!
    Istio public gateway

    View Slide

  27. Solution: private Istio gateway
    Can I get:
    example.com/internalURI
    ?
    PrivateLink
    OK!
    Istio private gateway
    AWS DC1 AWS DC2

    View Slide

  28. Finally, I can securely
    talk to my family in AWS
    cluster2!

    View Slide

  29. Wait a minute!
    You promised us a provider-agnostic
    talk and now you’re talking of AWS
    Private Link
    Therefore, isn’t your talk basically...

    View Slide

  30. Wait a minute!
    You promised us a provider-agnostic
    talk and now you’re talking of AWS
    Private Link
    Therefore, isn’t your talk basically...

    View Slide

  31. View Slide

  32. ● We needed to solve the issue for the AWS use case
    ● Developing a new solution for entirely agnostic
    cluster-to-cluster comm takes time, and we needed
    a solution for AWS sooner than this
    Decision
    ● Develop solution for AWS use case
    ● Then develop solution that works cross-cloud
    providers (agnostic)
    Solution, V1: AWS PrivateLink Operator

    View Slide

  33. Solution, V1: AWS PrivateLink Operator
    privatelink-operator
    PrivateLinkEgress
    PrivateLinkIngress
    watches
    AWS Resources
    creates

    View Slide

  34. Delivering Cluster to Cluster
    Communications for Service Teams
    Service Onboarding should be easy and as automated as
    possible
    ● Service teams deliver configuration via a mixture of per
    cluster configurations as well as Helm templated
    configurations
    ● Service teams should be able to deliver majority of
    configuration without requiring Platform Engineering or
    Infrastructure teams.

    View Slide

  35. Delivering PrivateLinkIngress
    Configuration
    A Kubernetes CRD for delivering and building AWS
    PrivateLink connectivity
    ● Service teams deliver configuration as raw or templated
    Kubernetes resources specifying just the Amazon
    Resource Names (ARNs) of the Clusters they need to
    allow Ingress from.
    ● Service teams can deploy and test these easily by using
    standard Kubernetes toolchains (kubectl, helm etc.)

    View Slide

  36. Example - PrivateLinkIngress

    View Slide

  37. Ok so now I know how
    to messages to my
    family but how can I be
    sure they receive them?

    View Slide

  38. Delivering Istio Internal Gateway
    Resources
    Service Onboarding should be easy and as automated as
    possible
    ● Service teams deliver configuration to the target cluster via
    Helm templated VirtualServices
    ● Service teams have the flexibility to add and remove
    routable paths, request headers etc with a Pull Request
    ● Service teams follow an existing naming convention for
    PrivateLink fully qualified domain names (FQDNs) used by
    their application and by Istio!
    ● Easy to test with standard Kubernetes toolchain!

    View Slide

  39. Example:
    Istio VirtualService
    In Istio the VirtualService resource
    describes how network traffic should be
    routed around the cluster

    View Slide

  40. Did I mention I have a
    very large family spread
    across many different
    datacenters?
    Sometimes, I don’t even
    know if they live in an
    AWS datacenter or not...

    View Slide

  41. Cluster to cluster comm:
    Introducing Consul

    View Slide

  42. What is Consul?
    Consul automates networking for simple and secure
    application delivery.
    ● Developed by Hashicorp (joined CNCF in 2020)
    ● Uses the Envoy proxy (just like Istio!)
    ● Offers a solution for multi-platform service mesh
    ● It runs inside or outside the Kubernetes cluster

    View Slide

  43. Solution, V2: Implementation of
    Consul Service Mesh
    From our V1 solution, we keep:
    ● Routing of traffic within cluster via Istio
    ● Helm for packaging and delivering configuration via
    operators
    We introduce:
    ● A consul cluster running outside the Kubernetes cluster

    View Slide

  44. Why did we implement Consul?
    Cloud independent
    Consul is a service mesh for any runtime and cloud; as part of
    Workday’s cloud strategy, we needed a solution that allowed to
    us to link different clouds together.

    View Slide

  45. Why did we implement Consul?
    Cloud independent
    Consul is a service mesh for any runtime and cloud; as part of
    Workday’s cloud strategy, we needed a solution that allowed to
    us to link different clouds together.
    Kubernetes independent
    We needed to link to services that are not (yet?) deployed on
    Kubernetes. This allows us to have applications running on
    Kubernetes linked to applications running outside Kubernetes.

    View Slide

  46. Why did we implement Consul?
    Cloud independent
    Consul is a service mesh for any runtime and cloud; as part of
    Workday’s cloud strategy, we needed a solution that allowed to
    us to link different clouds together.
    Kubernetes independent
    We needed to link to services that are not (yet?) deployed on
    Kubernetes. This allows us to have applications running on
    Kubernetes linked to applications running outside Kubernetes.
    Fully-featured with the power of Envoy
    Istio and Consul both use Envoy which means we aren’t running
    different proxies, just different control planes.

    View Slide

  47. Why is it interesting to adopt a
    platform/provider-agnostic solution?
    ● Services may be exposed from any datacenter
    ● May link services offered by different cloud providers
    ● Easier to deploy or move services across datacenter /
    providers

    View Slide

  48. Consul & Istio
    ● It is entirely possible to use Istio to achieve what we’re
    doing with Consul, and to use Consul to achieve what
    we’re doing in Istio

    View Slide

  49. Consul & Istio
    ● It is entirely possible to use Istio to achieve what we’re
    doing with Consul, and to use Consul to achieve what
    we’re doing in Istio
    ● In our setup, Istio is a lot closer to the application level than
    Consul is - and we wanted some of Istio’s features that
    aren’t available (or not as easy to set up) in Consul

    View Slide

  50. Consul & Istio
    ● It is entirely possible to use Istio to achieve what we’re
    doing with Consul, and to use Consul to achieve what
    we’re doing in Istio
    ● In our setup, Istio is a lot closer to the application level than
    Consul is - and we wanted some of Istio’s features that
    aren’t available (or not as easy to set up) in Consul
    ● Likewise, reviewing the documentation in Istio and Consul,
    we found that Consul adoption outside Kubernetes was
    better documented that Istio’s

    View Slide

  51. Consul & Istio
    ● It is entirely possible to use Istio to achieve what we’re
    doing with Consul, and to use Consul to achieve what
    we’re doing in Istio
    ● In our setup, Istio is a lot closer to the application level than
    Consul is - and we wanted some of Istio’s features that
    aren’t available (or not as easy to set up) in Consul
    ● Likewise, reviewing the documentation in Istio and Consul,
    we found that Consul adoption outside Kubernetes was
    better documented that Istio’s
    ● Both solutions use Envoy as their proxy; we’re only using
    the control plane that best fit our needs at different layers
    of the stack

    View Slide

  52. Consul & Istio
    In summary:
    ● Istio: our choice for our Kubernetes platform, easy to
    implement in K8s & comes with a lot of interesting features
    for the service teams deploying on Kubernetes
    ● Consul: securing inter-DC connections is very well
    documented; it’s a core use case of Consul (cf Consul’s
    Mesh Gateway)
    ● This represents Workday’s analysis at the time when we
    were reviewing solutions; both Istio and Consul are
    evolving and it’s possible that in the future, we will review
    our architecture to uptake new/other solutions

    View Slide

  53. How does Consul work?
    Components of the Consul mesh
    ● Service catalogue offered via a cluster
    of consul servers
    ● Ingress Gateway for incoming traffic to
    the Consul mesh
    ● Terminating gateway for outgoing traffic
    from the Consul mesh (e.g. to send traffic
    to Istio)
    ● Mesh Gateway for datacenter to
    datacenter comm
    Ref: https://www.consul.io/docs/architecture + https://www.consul.io/docs/connect/gateways/mesh-gateway

    View Slide

  54. How does Consul work?
    What happens when service1 in DC1 need to talk to
    service2 in DC2?
    ● Consul’s agent query the Consul server to
    know where the service is
    Ref: https://www.consul.io/docs/architecture + https://www.consul.io/docs/connect/gateways/mesh-gateway

    View Slide

  55. How does Consul work?
    What happens when service1 in DC1 need to talk to
    service2 in DC2?
    ● Consul’s agent query the Consul server to
    know where the service is
    ● If found within DC: query is routed to the
    relevant service via the ingress gateway
    Ref: https://www.consul.io/docs/architecture + https://www.consul.io/docs/connect/gateways/mesh-gateway

    View Slide

  56. How does Consul work?
    What happens when service1 in DC1 need to talk to
    service2 in DC2?
    ● Consul’s agent query the Consul server to
    know where the service is
    ● If found within DC: query is routed to the
    relevant service via the ingress gateway
    ● If not found within DC: query is routed to the
    mesh gateway
    Ref: https://www.consul.io/docs/architecture + https://www.consul.io/docs/connect/gateways/mesh-gateway

    View Slide

  57. How does Consul work?
    What happens when service1 in DC1 need to talk to
    service2 in DC2?
    ● Consul’s agent query the Consul server to
    know where the service is
    ● If found within DC: query is routed to the
    relevant service via the ingress gateway
    ● If not found within DC: query is routed to the
    mesh gateway
    ● The mesh gateways know where services are
    located (routing based on server name via SNI
    sniffing)
    Ref: https://www.consul.io/docs/architecture + https://www.consul.io/docs/connect/gateways/mesh-gateway

    View Slide

  58. How does Consul work?
    What happens when service1 in DC1 need to talk to
    service2 in DC2?
    ● Consul agent query the Consul server to know
    where the service is
    ● If found within DC: query is routed to the
    relevant service via the ingress gateway
    ● If not found within DC: query is routed to the
    mesh gateway
    ● The mesh gateways know where services are
    located (routing based on server name via SNI
    sniffing)
    ● Traffic is routed to another mesh gateway.
    Mesh gateways use mTLS to establish trust;
    and they don’t decrypt the traffic Ref: https://www.consul.io/docs/architecture + https://www.consul.io/docs/connect/gateways/mesh-gateway

    View Slide

  59. Comms-operator
    overview
    comms-operator
    Istio resources
    service definition
    watches
    private-ingress/egress
    creates
    creates

    View Slide

  60. Cluster-to-cluster comm: overview
    DC1 DC2

    View Slide

  61. Public cloud, private
    cloud, it’s all the same to
    me

    View Slide

  62. I’m so happy I could cry

    View Slide

  63. Summary
    ● Implementation of a global addressing scheme
    ● Implementation of a private Istio Gateway for
    routing traffic between clusters
    ● Implementation of Consul Mesh for securing DC
    to DC communication
    ● Teams deploying on Kubernetes deliver their
    configuration via Helm
    ● Use of operators to dynamically create, tear down
    and update ingress/egress as per configuration

    View Slide

  64. This presentation features not only our
    work, and not only our teams, but many
    many teams across different organizations!
    Thank you
    CDT
    Pi
    Scylla
    & friends
    Slide not included in the presentation

    View Slide

  65. If you want to get to
    know me better,
    visit
    workday.com/careers
    Thank you for
    attending this
    talk!
    You can follow my
    carers on Twitter at
    @davekerr95 and
    @plallin

    View Slide

  66. Resources to go further
    More info on why Workday decided to adopt Istio
    https://speakerdeck.com/plallin/tales-of-deploying-istio-ingress
    Consul Deep Dive (Cody de Arkland, Luke Kysow, Erik Veld, Hashicorp, Kubecon EU 2020):
    https://www.youtube.com/watch?v=RhYujICfNoA
    Kubernetes on Istio deep dive (Daneyon Hansen, Cisco, Kubecon NA 2017)
    https://www.youtube.com/watch?v=Dd6xveYONgU
    & many more Kubecon talks :)

    View Slide