Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

Javi Mosquera & Tasdik Rahman, New Relic Resilient Multi-Cloud Strategies: Harnessing Kubernetes, Cluster API and Cell-Based Architecture

Slide 3

Slide 3 text

Javi Mosquera Principal Software Engineer New Relic Tasdik Rahman Senior Software Engineer New Relic

Slide 4

Slide 4 text

Outline ● Context ○ Scale at NR ○ High Level Architecture ○ Challenges faced ● Solving the problem ○ Moving to cellular architecture ○ Standardising on Cluster API for Multi Cloud ■ Cluster Bootstrapping automation ■ Node Creation and Management ■ Leveraging Karpenter ○ Running Karpenter and Cluster API pools together ○ Simplifying scheduling challenges faced ● Lessons learned from our multi cloud setup

Slide 5

Slide 5 text

85k active customers 400M+ queries/day 3 Exabytes per year 12B events/minute Provide an intelligent observability platform that empowers developers to enhance digital experiences What we do @New Relic

Slide 6

Slide 6 text

Across different environments of testing, staging, production. Typically between 50007000 pods/cluster 280+ k8s clusters 500k+ pods 21k+ nodes Typically between 300500 nodes/cluster multiple Cloud Providers AWS, Azure, GCP K8s scale @New Relic

Slide 7

Slide 7 text

New Relic Environment CDN HTTP Endpoints Ingest, process and store NRDB Alerts Product UIs and APIs Edge CDN Ingest path Query path High-level architecture

Slide 8

Slide 8 text

● Monolithic infrastructure ○ One huge Kafka cluster ○ One multi-tenant DC/OS cluster ● Hard to scale and operate ○ Adding nodes ○ Upgrade operations ○ Huge blast radius Problem context

Slide 9

Slide 9 text

Solving the problem

Slide 10

Slide 10 text

Cells 101 ● Smallest unit that can live on its own ● Inside are the components necessary for it to carry out its functions. ● They exchange energy and matter

Slide 11

Slide 11 text

● Cell definition ○ Self-contained installation that satisfies operations for a shard ● Characteristics ○ Independent units of scale ○ Limited blast radius ○ Repeatable pattern for scaling out ○ Shard data across cell fleet → cell router ● Workload → cell type Cell-based architecture

Slide 12

Slide 12 text

● An AWS account /Azure subscription/GCP project ○ 1 K8s cluster ○ 1 Kafka cluster ○ 1 VPC/Vnet ● Peered with other cell types. ● Ephemeral ● Dozens of cell types ● Cellular architecture continuously evolves Anatomy of a New Relic cell

Slide 13

Slide 13 text

Ingest path Query path New Relic Environment Cell traffic routing

Slide 14

Slide 14 text

Cell traffic routing

Slide 15

Slide 15 text

● Asset management ○ Cell inventory ○ K8s clusters lifecycle ● Multi cloud K8s implementation ● Scheduling offering ● Cost of infrastructure Challenges at Scale

Slide 16

Slide 16 text

● Lifecycle management of clusters ● Declarative specification ● Cloud agnostic ● Manage K8s with K8s Why Cluster API

Slide 17

Slide 17 text

● Command and Control cluster ○ Cell inventory ○ K8s lifecycle management ● Multi cloud K8s implementation ○ Bootstrapping process ○ Infrastructure agnostic providers ○ A homogeneous operational approach ● Self-contained control plane CAPI clusters Cluster API @New Relic

Slide 18

Slide 18 text

K8s Cluster Bootstrapping process inside New Relic

Slide 19

Slide 19 text

Overview

Slide 20

Slide 20 text

Azure Bootstrap

Slide 21

Slide 21 text

MachinePools creation

Slide 22

Slide 22 text

Node Pool Diversity

Slide 23

Slide 23 text

Scheduling workloads

Slide 24

Slide 24 text

● What is it? ○ Declarative way to express scheduling requirements of applications without the need to dive into the finer details of how they are implemented. ● How does it work? ○ Heart of it is an admission controller, in the form of a mutating webhook running on all resources of Kind Rollout, deployment, statefulset. ● Design Goals ○ Cloud agnostic. ○ Built on top of scheduling primitives provided by k8s. ○ Sane defaults. ○ Allows application to have multiple scheduling classes attached. ○ Attached scheduling classes can add/negate already present rules. ○ A specific scheduling class can have only one priority. ○ Deterministic. Scheduling Classes

Slide 25

Slide 25 text

Scheduling Classes

Slide 26

Slide 26 text

Running Karpenter and MachinePools

Slide 27

Slide 27 text

Running Karpenter and MachinePools

Slide 28

Slide 28 text

● K8s ControlPlane Creation managed ○ Via argocd Applications with underlying objects for ■ AWSManagedControlPlane ■ KubeAdmControlPlane ● Upgrades triggered via controllers acting on ○ Homegrown ClusterLifeCycle CRD which tracks attributes like ■ K8s Version ■ AMIVersion ■ CellSelector ControlPlane Maintenance Standardisation

Slide 29

Slide 29 text

● Node Refresh could be due to ○ Node Upgrades ○ Patches ● Upgrades triggered via controller acting on top of ○ Homegrown WorkerConfiguration CRD which tracks attributes like ■ amiVersion ■ version ○ standardisation on top of CAPI and CAPI Cloud Provider APIʼs ■ Enabling Drift on ● MachineDeployments ● AWSMachinePools ● AzureMachinePools ● MachinePools ○ Using Karpenterʼs Node Drift feature to do node rollout K8s Node Maintenance Standardisation

Slide 30

Slide 30 text

● Different Cluster API maturity level for different Cloud Service Providers ○ Managing forks is a challenge. ○ Version disparity of CAPI and itʼs providers in different cloud providers ● Managed services vs Self-managed ○ Self managed being more work as a trade off for flexibility. ○ Automation standardisation easier to maintain when expanding to new cloud provider with cloud agnostic APIʼs. ○ Version Parity of Clusters challenging to maintain between different clouds if using managed offering. ○ Lesser flexibility in customisation when using managed offering. ● Karpenter Adoption ○ Groupless autoscaling benefits, automatic handling of ICE events, efficient bin packing. ○ Faster node autoscaling. ○ Balancing reliability along with cost effectiveness tricky to get right. ○ Giving teams to opt-out, if they canʼt tolerate karpenterʼs disruption rate of nodes. ○ Expanding karpenter further to more cloud providers which we run. Lessons Learned

Slide 31

Slide 31 text

Thank you!

Slide 32

Slide 32 text

Title Become a Mentor for Underrepresented Groups in Open Source! Passionate about fostering an inclusive open source community? Sign up today! https://bit.ly/cncf-inclusive-mentor