Managing Thousands of Edge k8s Clusters with GitOps

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

© 2019 Volterra Inc. All Rights Reserved. Volterra Backbone & CLoud Volterra SaaS Service VES Global Controller (SaaS) VES Global Controller (SaaS) Volterra Global Controller (GC) Volterra Regional Edge(RE) Volterra Regional Edge (RE) Volterra Regional Edge (RE) Volterra Regional Edge (RE) Customer Site (CE) Customer Site (CE) Customer Site (CE) Customer Site (CE) Customer Site (CE) Customer Site (CE) Customer Site (CE) Customer Site (CE) Customer Site (CE) Operations & SRE Portal, CRM, Billing Customer Portal IPSec/SSL IPSec/SSL IPSec/SSL IPSec/SSL IPSec/SSL Optional Site to Site Industrial-grade Volterra HW

Slide 4

Slide 4 text

Slide 5

Slide 5 text

© 2019 Volterra Inc. All Rights Reserved. Challenges and Goals in Edge management ● Scale for thousands of sites ● Fleet management (Installation, Upgrades) ○ Simple management of thousands of sites ○ Sites can be oﬄine or unavailable at time of requested change ● Zero Touch installation (anybody can bring site) ● All must be managed remotely ● Fault tolerant - system is operating even after failure of any component (factory reset, rebuild site in case of failure) 5

Slide 6

Slide 6 text

© 2019 Volterra Inc. All Rights Reserved. SRE Design Principles 6 ● The Entire system is described declaratively ● Immutable LCM - Everything is container (no packages or Mutable LCM such as Ansible or puppet) ● GitOps - Approvals, audit and workﬂows of changes must go over git ● No kubectl, no scripts run from central place ● Use what make sense (not only tools in hype)

Slide 7

Slide 7 text

© 2019 Volterra Inc. All Rights Reserved. VP-Manager 7 ● Go based daemon running as systemd docker container ● Manages several layers ○ OS conﬁgurations (hugepages allocation, /etc/hosts) ○ Kubernetes installation and LCM ○ Workload management ○ Ongoing API conﬁguration to various components (IPSec) ● Workload is based on Kubernetes client-go ● Optimistic vs pessimistic deployment ● Pre-update actions like pre-pull ● Retries and Rollbacks in case of apply failures

Slide 8

Slide 8 text

© 2019 Volterra Inc. All Rights Reserved. Azure CR & QUAY azurecr.io Customer Edge VPM Volterra Platform Manager GC VP-Controller RE1 RE2 Registration-Request with Token Registration-Conﬁg-Response x509 Download Containers Images CE Successfully provisioned and IPSec Tunnels are UP After images download start creating: K8s cluster Volterra infra software 1 1 3 3 4 4 5 5 Zero Touch Provisioning 2

Slide 9

Slide 9 text

© 2019 Volterra Inc. All Rights Reserved. Upgrade delivery - Pull Method 9 ● OS and Software upgrades ○ OS follow A/B Upgrades with partitions ○ Volterra Software is kubernetes workload ● Upgrades must be simple (Like cell phone updates) ● device can be oﬄine at the time of upgrade ● end user can decide when upgrade ● avoid centralize CD tool ● quick and scales easily

Slide 10

Slide 10 text

Slide 11

Slide 11 text

© 2019 Volterra Inc. All Rights Reserved. Volterra Scale Topology with 3k sites 11 VES Global Controller (SaaS) VES Global Controller (SaaS) Volterra SaaS (Global Controller) Volterra Regional Edge Volterra Regional Edge (RE2) Operations & SRE Portal, CRM, Billing Customer Portal Created all sites in locations Volterra Scaling APP IPSec/SSL IPSec/SSL Total Sites: 3000 Total IPSec/SSL Tunnels: 6000 IPSec/SSL 100 Sites 100 Sites 100 Sites 1 2 30 On-Prem

Slide 12

Slide 12 text

Slide 13

Slide 13 text

© 2019 Volterra Inc. All Rights Reserved. Key Findings at Scale - Management ` ● Optimize CE conﬁguration/certiﬁcate creation/delivery ○ Initially we processed registration serial way and each took around 2 minutes ○ After optimization it process CE in 20 seconds and with arbitrary number of workers ● Optimize delivery of Docker images ○ Reduce size of docker images and distribute them through REs ● Optimize Global Controller database operations ○ Optimise database operations (CE upgrades Status from 3k sites) ○ Split Status DB instances 13

Slide 14

Slide 14 text

© 2019 Volterra Inc. All Rights Reserved. Key Findings at Scale - Monitoring ● New Prometheus federation ﬁlters to drop unused metrics, labels ○ Initially we had around 50k time series per CE with average of 15 labels. ○ We optimized it to 2k per CE with average ○ Simple while-lists for metric names and black-lists for label names ● Move from global Prometheus federation to Cortex cluster ○ Centralized Prometheus scraped all REs and CEs prometheus, ○ At 1k CE, it becomes unsustainable. ○ Currently Prometheus per RE (federating connected CEs Promethei) with RW to Cortex ● Elasticsearch clusters and logs ○ Decentralized logging architecture ○ Fluentbit as collector on each node forwards logs into Fluentd (aggregator) in RE ○ Elasticsearch deployed in every RE, using remote cluster search to query logs from single Kibana instance 14

Slide 15

Slide 15 text

Slide 16

Slide 16 text

© 2019 Volterra Inc. All Rights Reserved. Git release new version 20191117-000043 Customer Edge SRE Daemons SRE Workflow for Customer Edges 16 Config API Artifact storage VPM VP Controller Executor Poll for Update Upload status of upgrade K8s Deploy Render k8s manifests with version in annotation Load configuration into Config API daemon //.yml ce01-site/20191107-000042/prometheus.yml ce01-site/20191117-000043/prometheus.yml version: 20191117-000043 New version to Upgrade I want upgrade