Resilient Multi-Cloud Strategies: Harnessing Kubernetes, Cluster API and Cell-Based Architecture

Javi Mosquera & Tasdik Rahman, New Relic Resilient Multi-Cloud Strategies:
Harnessing Kubernetes, Cluster API and Cell-Based Architecture

Javi Mosquera Principal Software Engineer New Relic Tasdik Rahman Senior
Software Engineer New Relic

Outline • Context ◦ Scale at NR ◦ High Level
Architecture ◦ Challenges faced • Solving the problem ◦ Moving to cellular architecture ◦ Standardising on Cluster API for Multi Cloud ▪ Cluster Bootstrapping automation ▪ Node Creation and Management ▪ Leveraging Karpenter ◦ Running Karpenter and Cluster API pools together ◦ Simplifying scheduling challenges faced • Lessons learned from our multi cloud setup

85k active customers 400M+ queries/day 3 Exabytes per year 12B
events/minute Provide an intelligent observability platform that empowers developers to enhance digital experiences What we do @New Relic

Across different environments of testing, staging, production. Typically between 50007000
pods/cluster 280+ k8s clusters 500k+ pods 21k+ nodes Typically between 300500 nodes/cluster multiple Cloud Providers AWS, Azure, GCP K8s scale @New Relic

New Relic Environment CDN HTTP Endpoints Ingest, process and store
NRDB Alerts Product UIs and APIs Edge CDN Ingest path Query path High-level architecture

• Monolithic infrastructure ◦ One huge Kafka cluster ◦ One
multi-tenant DC/OS cluster • Hard to scale and operate ◦ Adding nodes ◦ Upgrade operations ◦ Huge blast radius Problem context

Solving the problem

Cells 101 • Smallest unit that can live on its
own • Inside are the components necessary for it to carry out its functions. • They exchange energy and matter

• Cell definition ◦ Self-contained installation that satisfies operations for
a shard • Characteristics ◦ Independent units of scale ◦ Limited blast radius ◦ Repeatable pattern for scaling out ◦ Shard data across cell fleet → cell router • Workload → cell type Cell-based architecture

• An AWS account /Azure subscription/GCP project ◦ 1 K8s
cluster ◦ 1 Kafka cluster ◦ 1 VPC/Vnet • Peered with other cell types. • Ephemeral • Dozens of cell types • Cellular architecture continuously evolves Anatomy of a New Relic cell

Ingest path Query path New Relic Environment Cell traffic routing

Cell traffic routing

• Asset management ◦ Cell inventory ◦ K8s clusters lifecycle
• Multi cloud K8s implementation • Scheduling offering • Cost of infrastructure Challenges at Scale

• Lifecycle management of clusters • Declarative specification • Cloud
agnostic • Manage K8s with K8s Why Cluster API

• Command and Control cluster ◦ Cell inventory ◦ K8s
lifecycle management • Multi cloud K8s implementation ◦ Bootstrapping process ◦ Infrastructure agnostic providers ◦ A homogeneous operational approach • Self-contained control plane CAPI clusters Cluster API @New Relic

K8s Cluster Bootstrapping process inside New Relic

Overview

Azure Bootstrap

MachinePools creation

Node Pool Diversity

Scheduling workloads

• What is it? ◦ Declarative way to express scheduling
requirements of applications without the need to dive into the finer details of how they are implemented. • How does it work? ◦ Heart of it is an admission controller, in the form of a mutating webhook running on all resources of Kind Rollout, deployment, statefulset. • Design Goals ◦ Cloud agnostic. ◦ Built on top of scheduling primitives provided by k8s. ◦ Sane defaults. ◦ Allows application to have multiple scheduling classes attached. ◦ Attached scheduling classes can add/negate already present rules. ◦ A specific scheduling class can have only one priority. ◦ Deterministic. Scheduling Classes

Scheduling Classes

Running Karpenter and MachinePools

• K8s ControlPlane Creation managed ◦ Via argocd Applications with
underlying objects for ▪ AWSManagedControlPlane ▪ KubeAdmControlPlane • Upgrades triggered via controllers acting on ◦ Homegrown ClusterLifeCycle CRD which tracks attributes like ▪ K8s Version ▪ AMIVersion ▪ CellSelector ControlPlane Maintenance Standardisation

• Node Refresh could be due to ◦ Node Upgrades
◦ Patches • Upgrades triggered via controller acting on top of ◦ Homegrown WorkerConfiguration CRD which tracks attributes like ▪ amiVersion ▪ version ◦ standardisation on top of CAPI and CAPI Cloud Provider APIʼs ▪ Enabling Drift on • MachineDeployments • AWSMachinePools • AzureMachinePools • MachinePools ◦ Using Karpenterʼs Node Drift feature to do node rollout K8s Node Maintenance Standardisation

• Different Cluster API maturity level for different Cloud Service
Providers ◦ Managing forks is a challenge. ◦ Version disparity of CAPI and itʼs providers in different cloud providers • Managed services vs Self-managed ◦ Self managed being more work as a trade off for flexibility. ◦ Automation standardisation easier to maintain when expanding to new cloud provider with cloud agnostic APIʼs. ◦ Version Parity of Clusters challenging to maintain between different clouds if using managed offering. ◦ Lesser flexibility in customisation when using managed offering. • Karpenter Adoption ◦ Groupless autoscaling benefits, automatic handling of ICE events, efficient bin packing. ◦ Faster node autoscaling. ◦ Balancing reliability along with cost effectiveness tricky to get right. ◦ Giving teams to opt-out, if they canʼt tolerate karpenterʼs disruption rate of nodes. ◦ Expanding karpenter further to more cloud providers which we run. Lessons Learned

Thank you!

Title Become a Mentor for Underrepresented Groups in Open Source!
Passionate about fostering an inclusive open source community? Sign up today! https://bit.ly/cncf-inclusive-mentor

Resilient Multi-Cloud Strategies: Harnessing Ku...

Resilient Multi-Cloud Strategies: Harnessing Kubernetes, Cluster API and Cell-Based Architecture

Tasdik Rahman

More Decks by Tasdik Rahman

Other Decks in Technology

Featured

Transcript

Javi Mosquera & Tasdik Rahman, New Relic Resilient Multi-Cloud Strategies:

Javi Mosquera Principal Software Engineer New Relic Tasdik Rahman Senior

Outline • Context ◦ Scale at NR ◦ High Level

85k active customers 400M+ queries/day 3 Exabytes per year 12B

Across different environments of testing, staging, production. Typically between 50007000

New Relic Environment CDN HTTP Endpoints Ingest, process and store

• Monolithic infrastructure ◦ One huge Kafka cluster ◦ One

Solving the problem

Cells 101 • Smallest unit that can live on its

• Cell definition ◦ Self-contained installation that satisfies operations for

• An AWS account /Azure subscription/GCP project ◦ 1 K8s

Ingest path Query path New Relic Environment Cell traffic routing

Cell traffic routing

• Asset management ◦ Cell inventory ◦ K8s clusters lifecycle

• Lifecycle management of clusters • Declarative specification • Cloud

• Command and Control cluster ◦ Cell inventory ◦ K8s

K8s Cluster Bootstrapping process inside New Relic

Overview

Azure Bootstrap

MachinePools creation

Node Pool Diversity

Scheduling workloads

• What is it? ◦ Declarative way to express scheduling

Scheduling Classes

Running Karpenter and MachinePools

Running Karpenter and MachinePools

• K8s ControlPlane Creation managed ◦ Via argocd Applications with

• Node Refresh could be due to ◦ Node Upgrades

• Different Cluster API maturity level for different Cloud Service

Thank you!

Title Become a Mentor for Underrepresented Groups in Open Source!