Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Resilient Multi-Cloud Strategies: Harnessing Ku...

Resilient Multi-Cloud Strategies: Harnessing Kubernetes, Cluster API and Cell-Based Architecture

In today's multi-cloud world, resilience and high availability at scale are crucial. This session will cover how we utilized Kubernetes with Cluster API and other cloud native components, to deploy a cell-based architecture across multiple cloud providers, scaling to 270+ clusters and 18,000+ nodes, creating independent, isolated cells that limit failures and improve uptime, thus simplifying compliance, cost management, and disaster recovery planning.

We'll explore how Cluster API facilitates seamless automation of cluster creation and management across our multi-cloud setup, upgrades, enhancing autonomy and resilience. Moreover, we'll highlight real-world use cases sharing our learnings from automation built for efficient management of k8s clusters while limiting operational overhead.

End users will learn from this talk on how they can use ClusterAPI, to automate their multi cloud cluster lifecycle management and leverage cellular architecture to build a highly available setup.

Talk was given as part of the presentation https://kccnceu2025.sched.com/event/1txDE/resilient-multi-cloud-strategies-harnessing-kubernetes-cluster-api-and-cell-based-architecture-tasdik-rahman-javi-mosquera-new-relic at kubecon 2025 London.

Avatar for Tasdik Rahman

Tasdik Rahman

April 08, 2025
Tweet

More Decks by Tasdik Rahman

Other Decks in Technology

Transcript

  1. Javi Mosquera & Tasdik Rahman, New Relic Resilient Multi-Cloud Strategies:

    Harnessing Kubernetes, Cluster API and Cell-Based Architecture
  2. Outline • Context ◦ Scale at NR ◦ High Level

    Architecture ◦ Challenges faced • Solving the problem ◦ Moving to cellular architecture ◦ Standardising on Cluster API for Multi Cloud ▪ Cluster Bootstrapping automation ▪ Node Creation and Management ▪ Leveraging Karpenter ◦ Running Karpenter and Cluster API pools together ◦ Simplifying scheduling challenges faced • Lessons learned from our multi cloud setup
  3. 85k active customers 400M+ queries/day 3 Exabytes per year 12B

    events/minute Provide an intelligent observability platform that empowers developers to enhance digital experiences What we do @New Relic
  4. Across different environments of testing, staging, production. Typically between 50007000

    pods/cluster 280+ k8s clusters 500k+ pods 21k+ nodes Typically between 300500 nodes/cluster multiple Cloud Providers AWS, Azure, GCP K8s scale @New Relic
  5. New Relic Environment CDN HTTP Endpoints Ingest, process and store

    NRDB Alerts Product UIs and APIs Edge CDN Ingest path Query path High-level architecture
  6. • Monolithic infrastructure ◦ One huge Kafka cluster ◦ One

    multi-tenant DC/OS cluster • Hard to scale and operate ◦ Adding nodes ◦ Upgrade operations ◦ Huge blast radius Problem context
  7. Cells 101 • Smallest unit that can live on its

    own • Inside are the components necessary for it to carry out its functions. • They exchange energy and matter
  8. • Cell definition ◦ Self-contained installation that satisfies operations for

    a shard • Characteristics ◦ Independent units of scale ◦ Limited blast radius ◦ Repeatable pattern for scaling out ◦ Shard data across cell fleet → cell router • Workload → cell type Cell-based architecture
  9. • An AWS account /Azure subscription/GCP project ◦ 1 K8s

    cluster ◦ 1 Kafka cluster ◦ 1 VPC/Vnet • Peered with other cell types. • Ephemeral • Dozens of cell types • Cellular architecture continuously evolves Anatomy of a New Relic cell
  10. • Asset management ◦ Cell inventory ◦ K8s clusters lifecycle

    • Multi cloud K8s implementation • Scheduling offering • Cost of infrastructure Challenges at Scale
  11. • Lifecycle management of clusters • Declarative specification • Cloud

    agnostic • Manage K8s with K8s Why Cluster API
  12. • Command and Control cluster ◦ Cell inventory ◦ K8s

    lifecycle management • Multi cloud K8s implementation ◦ Bootstrapping process ◦ Infrastructure agnostic providers ◦ A homogeneous operational approach • Self-contained control plane CAPI clusters Cluster API @New Relic
  13. • What is it? ◦ Declarative way to express scheduling

    requirements of applications without the need to dive into the finer details of how they are implemented. • How does it work? ◦ Heart of it is an admission controller, in the form of a mutating webhook running on all resources of Kind Rollout, deployment, statefulset. • Design Goals ◦ Cloud agnostic. ◦ Built on top of scheduling primitives provided by k8s. ◦ Sane defaults. ◦ Allows application to have multiple scheduling classes attached. ◦ Attached scheduling classes can add/negate already present rules. ◦ A specific scheduling class can have only one priority. ◦ Deterministic. Scheduling Classes
  14. • K8s ControlPlane Creation managed ◦ Via argocd Applications with

    underlying objects for ▪ AWSManagedControlPlane ▪ KubeAdmControlPlane • Upgrades triggered via controllers acting on ◦ Homegrown ClusterLifeCycle CRD which tracks attributes like ▪ K8s Version ▪ AMIVersion ▪ CellSelector ControlPlane Maintenance Standardisation
  15. • Node Refresh could be due to ◦ Node Upgrades

    ◦ Patches • Upgrades triggered via controller acting on top of ◦ Homegrown WorkerConfiguration CRD which tracks attributes like ▪ amiVersion ▪ version ◦ standardisation on top of CAPI and CAPI Cloud Provider APIʼs ▪ Enabling Drift on • MachineDeployments • AWSMachinePools • AzureMachinePools • MachinePools ◦ Using Karpenterʼs Node Drift feature to do node rollout K8s Node Maintenance Standardisation
  16. • Different Cluster API maturity level for different Cloud Service

    Providers ◦ Managing forks is a challenge. ◦ Version disparity of CAPI and itʼs providers in different cloud providers • Managed services vs Self-managed ◦ Self managed being more work as a trade off for flexibility. ◦ Automation standardisation easier to maintain when expanding to new cloud provider with cloud agnostic APIʼs. ◦ Version Parity of Clusters challenging to maintain between different clouds if using managed offering. ◦ Lesser flexibility in customisation when using managed offering. • Karpenter Adoption ◦ Groupless autoscaling benefits, automatic handling of ICE events, efficient bin packing. ◦ Faster node autoscaling. ◦ Balancing reliability along with cost effectiveness tricky to get right. ◦ Giving teams to opt-out, if they canʼt tolerate karpenterʼs disruption rate of nodes. ◦ Expanding karpenter further to more cloud providers which we run. Lessons Learned
  17. Title Become a Mentor for Underrepresented Groups in Open Source!

    Passionate about fostering an inclusive open source community? Sign up today! https://bit.ly/cncf-inclusive-mentor