Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Secure your cluster-to-cluster traffic, the agnostic way - Pauline Lallinec & Dave Kerr, Workday

Secure your cluster-to-cluster traffic, the agnostic way - Pauline Lallinec & Dave Kerr, Workday

Workday is shifting to a multi-cloud approach whereby its Kubernetes platform known as Scylla can be deployed to public cloud providers as well as Workday’s own data centers. To achieve this, we needed to route tenant data across existing AWS clusters in different regions, to Workday’s own data centre, and potentially in the future, to other public cloud providers. While cloud providers usually have solutions to migrate data to their own cloud, Workday aims to be cloud-agnostic, and as such needs a solution to migrate data cross-clouds. The infrastructure, platform, and application development teams cooperated to develop a solution relying on Kubernetes operators, Istio, Consul, and Helm-delivered application configuration. This talk will give an overview of the tools and technologies used to migrate tenant data to other clusters, wherever they are deployed. We will additionally review the learnings from this experience and give an overview of the future work.

Pauline Lallinec

October 15, 2021
Tweet

More Decks by Pauline Lallinec

Other Decks in Programming

Transcript

  1. Data centers in Canada, Europe, USA 95.18% of transactions had

    a response time of less than 1 second 265 billion transactions FY2021 Service uptime > 99.97% Workday community = 55 million workers
  2. Software Engineer - DevOps Gaming, Tabletop and CTF stuff @davekerr95

    Workday + K8s platform = Scylla Public + Private Cloud (NA, EMEA) 6 teams, 2 continents
  3. Workday + Public Infrastructure = Pi Aka. Infrastructure Public Cloud

    2 teams, 2 continents Software Engineer - DevOps (powered off) non-stop karaoke machine @plallin
  4. What to expect from this presentation • A use-case for

    provider-agnostic cluster communication • An overview of a solution to secure cluster to cluster communication • A story of how Workday incrementally implemented that solution • Overview of custom Kubernetes resources & operators developed by Workday • “Food for thoughts” at the end of this slidedeck to go further / drive your own research
  5. Hey! My name is OX. I’m a service living in

    Cluster-1 in DC1. Sometimes, I want to send messages to my family in DC2.
  6. In the old days, I used the public-facing load balancer

    to reach my family. But securing communication over the public internet was very hard and required my carers to maintain complex configuration.
  7. You know what would make an old ox happy? Secure,

    easy-to-configure cluster-to-cluster communication
  8. Cluster to Cluster Communications Service Teams Use-Cases The need to

    communicate to services living in another cluster • Services may be deployed in specific clusters based on their workload type • Services need to transfer tenanted data between services running in different clusters The need to be able to easily configure these communication routes • Should be easy to configure and deploy as a Service Team • Should be low touch for Infrastructure & Platform Engineering teams
  9. Moving Tenanted data between Workday Instances The Problem As a

    Service we need to move tenanted data between instances of Workday living in different clusters. Communication to Internal APIs In order to move tenanted data between clusters we rely on internal api’s that should not be available externally on the public gateway Hey! How do I get access to https://dc2.workday.com/internalapi?
  10. Legacy solution Hey can I get access to https://dc2.workday.com/internalapi? Sure

    I can do that for you Service in Cluster 1 in DC 1 Public Load Balancer of Cluster 2 in DC 2 Private DC comm
  11. Legacy solution Sure I can do that for you Service

    in Cluster 1 in DC 1 Public Load Balancer of Cluster 2 in DC 2 Private DC comm This is a legacy load balancer Hey can I get access to https://dc2.workday.com/internalapi?
  12. Legacy solution Sure I can do that for you Service

    in Cluster 1 in DC 1 Public Load Balancer of Cluster 2 in DC 2 Private DC comm This is a legacy load balancer We want to retire them and replace them with Istio Hey can I get access to https://dc2.workday.com/internalapi?
  13. What is Istio? Istio is an open-source platform-independent service mesh

    that provides traffic management, policy enforcement, and telemetry collection. • Developed by Google + IBM + Lyft • Uses the Envoy proxy • It lets you manage your ingress resources as a set of Kubernetes resources • It runs inside the Kubernetes cluster
  14. What is Envoy? Envoy is a high-performance proxy designed for

    cloud-native applications • Originally developed by Lyft in C++ • Open source • Cloud-agnostic • Manage all inbound and outbound traffic • In Istio, Envoy proxies are deployed as sidecar containers
  15. This sounds great! So how can I use it to

    send messages to my family located in Cluster-2 in DC2?
  16. Istio: public ingress gateway Can I get: example.com/internalURI ? No!

    Can I get: example.com/internalURI ? Also no! Istio public gateway
  17. Wait a minute! You promised us a provider-agnostic talk and

    now you’re talking of AWS Private Link Therefore, isn’t your talk basically...
  18. Wait a minute! You promised us a provider-agnostic talk and

    now you’re talking of AWS Private Link Therefore, isn’t your talk basically...
  19. • We needed to solve the issue for the AWS

    use case • Developing a new solution for entirely agnostic cluster-to-cluster comm takes time, and we needed a solution for AWS sooner than this Decision • Develop solution for AWS use case • Then develop solution that works cross-cloud providers (agnostic) Solution, V1: AWS PrivateLink Operator
  20. Delivering Cluster to Cluster Communications for Service Teams Service Onboarding

    should be easy and as automated as possible • Service teams deliver configuration via a mixture of per cluster configurations as well as Helm templated configurations • Service teams should be able to deliver majority of configuration without requiring Platform Engineering or Infrastructure teams.
  21. Delivering PrivateLinkIngress Configuration A Kubernetes CRD for delivering and building

    AWS PrivateLink connectivity • Service teams deliver configuration as raw or templated Kubernetes resources specifying just the Amazon Resource Names (ARNs) of the Clusters they need to allow Ingress from. • Service teams can deploy and test these easily by using standard Kubernetes toolchains (kubectl, helm etc.)
  22. Ok so now I know how to messages to my

    family but how can I be sure they receive them?
  23. Delivering Istio Internal Gateway Resources Service Onboarding should be easy

    and as automated as possible • Service teams deliver configuration to the target cluster via Helm templated VirtualServices • Service teams have the flexibility to add and remove routable paths, request headers etc with a Pull Request • Service teams follow an existing naming convention for PrivateLink fully qualified domain names (FQDNs) used by their application and by Istio! • Easy to test with standard Kubernetes toolchain!
  24. Example: Istio VirtualService In Istio the VirtualService resource describes how

    network traffic should be routed around the cluster
  25. Did I mention I have a very large family spread

    across many different datacenters? Sometimes, I don’t even know if they live in an AWS datacenter or not...
  26. What is Consul? Consul automates networking for simple and secure

    application delivery. • Developed by Hashicorp (joined CNCF in 2020) • Uses the Envoy proxy (just like Istio!) • Offers a solution for multi-platform service mesh • It runs inside or outside the Kubernetes cluster
  27. Solution, V2: Implementation of Consul Service Mesh From our V1

    solution, we keep: • Routing of traffic within cluster via Istio • Helm for packaging and delivering configuration via operators We introduce: • A consul cluster running outside the Kubernetes cluster
  28. Why did we implement Consul? Cloud independent Consul is a

    service mesh for any runtime and cloud; as part of Workday’s cloud strategy, we needed a solution that allowed to us to link different clouds together.
  29. Why did we implement Consul? Cloud independent Consul is a

    service mesh for any runtime and cloud; as part of Workday’s cloud strategy, we needed a solution that allowed to us to link different clouds together. Kubernetes independent We needed to link to services that are not (yet?) deployed on Kubernetes. This allows us to have applications running on Kubernetes linked to applications running outside Kubernetes.
  30. Why did we implement Consul? Cloud independent Consul is a

    service mesh for any runtime and cloud; as part of Workday’s cloud strategy, we needed a solution that allowed to us to link different clouds together. Kubernetes independent We needed to link to services that are not (yet?) deployed on Kubernetes. This allows us to have applications running on Kubernetes linked to applications running outside Kubernetes. Fully-featured with the power of Envoy Istio and Consul both use Envoy which means we aren’t running different proxies, just different control planes.
  31. Why is it interesting to adopt a platform/provider-agnostic solution? •

    Services may be exposed from any datacenter • May link services offered by different cloud providers • Easier to deploy or move services across datacenter / providers
  32. Consul & Istio • It is entirely possible to use

    Istio to achieve what we’re doing with Consul, and to use Consul to achieve what we’re doing in Istio
  33. Consul & Istio • It is entirely possible to use

    Istio to achieve what we’re doing with Consul, and to use Consul to achieve what we’re doing in Istio • In our setup, Istio is a lot closer to the application level than Consul is - and we wanted some of Istio’s features that aren’t available (or not as easy to set up) in Consul
  34. Consul & Istio • It is entirely possible to use

    Istio to achieve what we’re doing with Consul, and to use Consul to achieve what we’re doing in Istio • In our setup, Istio is a lot closer to the application level than Consul is - and we wanted some of Istio’s features that aren’t available (or not as easy to set up) in Consul • Likewise, reviewing the documentation in Istio and Consul, we found that Consul adoption outside Kubernetes was better documented that Istio’s
  35. Consul & Istio • It is entirely possible to use

    Istio to achieve what we’re doing with Consul, and to use Consul to achieve what we’re doing in Istio • In our setup, Istio is a lot closer to the application level than Consul is - and we wanted some of Istio’s features that aren’t available (or not as easy to set up) in Consul • Likewise, reviewing the documentation in Istio and Consul, we found that Consul adoption outside Kubernetes was better documented that Istio’s • Both solutions use Envoy as their proxy; we’re only using the control plane that best fit our needs at different layers of the stack
  36. Consul & Istio In summary: • Istio: our choice for

    our Kubernetes platform, easy to implement in K8s & comes with a lot of interesting features for the service teams deploying on Kubernetes • Consul: securing inter-DC connections is very well documented; it’s a core use case of Consul (cf Consul’s Mesh Gateway) • This represents Workday’s analysis at the time when we were reviewing solutions; both Istio and Consul are evolving and it’s possible that in the future, we will review our architecture to uptake new/other solutions
  37. How does Consul work? Components of the Consul mesh •

    Service catalogue offered via a cluster of consul servers • Ingress Gateway for incoming traffic to the Consul mesh • Terminating gateway for outgoing traffic from the Consul mesh (e.g. to send traffic to Istio) • Mesh Gateway for datacenter to datacenter comm Ref: https://www.consul.io/docs/architecture + https://www.consul.io/docs/connect/gateways/mesh-gateway
  38. How does Consul work? What happens when service1 in DC1

    need to talk to service2 in DC2? • Consul’s agent query the Consul server to know where the service is Ref: https://www.consul.io/docs/architecture + https://www.consul.io/docs/connect/gateways/mesh-gateway
  39. How does Consul work? What happens when service1 in DC1

    need to talk to service2 in DC2? • Consul’s agent query the Consul server to know where the service is • If found within DC: query is routed to the relevant service via the ingress gateway Ref: https://www.consul.io/docs/architecture + https://www.consul.io/docs/connect/gateways/mesh-gateway
  40. How does Consul work? What happens when service1 in DC1

    need to talk to service2 in DC2? • Consul’s agent query the Consul server to know where the service is • If found within DC: query is routed to the relevant service via the ingress gateway • If not found within DC: query is routed to the mesh gateway Ref: https://www.consul.io/docs/architecture + https://www.consul.io/docs/connect/gateways/mesh-gateway
  41. How does Consul work? What happens when service1 in DC1

    need to talk to service2 in DC2? • Consul’s agent query the Consul server to know where the service is • If found within DC: query is routed to the relevant service via the ingress gateway • If not found within DC: query is routed to the mesh gateway • The mesh gateways know where services are located (routing based on server name via SNI sniffing) Ref: https://www.consul.io/docs/architecture + https://www.consul.io/docs/connect/gateways/mesh-gateway
  42. How does Consul work? What happens when service1 in DC1

    need to talk to service2 in DC2? • Consul agent query the Consul server to know where the service is • If found within DC: query is routed to the relevant service via the ingress gateway • If not found within DC: query is routed to the mesh gateway • The mesh gateways know where services are located (routing based on server name via SNI sniffing) • Traffic is routed to another mesh gateway. Mesh gateways use mTLS to establish trust; and they don’t decrypt the traffic Ref: https://www.consul.io/docs/architecture + https://www.consul.io/docs/connect/gateways/mesh-gateway
  43. Summary • Implementation of a global addressing scheme • Implementation

    of a private Istio Gateway for routing traffic between clusters • Implementation of Consul Mesh for securing DC to DC communication • Teams deploying on Kubernetes deliver their configuration via Helm • Use of operators to dynamically create, tear down and update ingress/egress as per configuration
  44. This presentation features not only our work, and not only

    our teams, but many many teams across different organizations! Thank you CDT Pi Scylla & friends Slide not included in the presentation
  45. If you want to get to know me better, visit

    workday.com/careers Thank you for attending this talk! You can follow my carers on Twitter at @davekerr95 and @plallin
  46. Resources to go further More info on why Workday decided

    to adopt Istio https://speakerdeck.com/plallin/tales-of-deploying-istio-ingress Consul Deep Dive (Cody de Arkland, Luke Kysow, Erik Veld, Hashicorp, Kubecon EU 2020): https://www.youtube.com/watch?v=RhYujICfNoA Kubernetes on Istio deep dive (Daneyon Hansen, Cisco, Kubecon NA 2017) https://www.youtube.com/watch?v=Dd6xveYONgU & many more Kubecon talks :)