Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Managing Thousands of Edge k8s Clusters with GitOps

C36418c65a3dec7ef4ebe78888da7356?s=47 Jakub Pavlik
November 17, 2019

Managing Thousands of Edge k8s Clusters with GitOps

Cloud Native Rejekts 2019 - stream including demo is available at https://youtu.be/tw6O2nigVTk?t=5880

Demo itself starts at https://youtu.be/tw6O2nigVTk?t=6744

We will provide a comprehensive overview of how we’ve built a large scale, fully open sourced edge cloud platform. It maps the technology to real use cases and grows the community collaboration around realistic deployments. It will show real operational data at scale from one of the largest retailers in the world. The audience will see not only the k8s deployment but app orchestration across thousands of k8s clusters.

C36418c65a3dec7ef4ebe78888da7356?s=128

Jakub Pavlik

November 17, 2019
Tweet

Transcript

  1. © 2019 Volterra Inc. All Rights Reserved. Liberate your Infrastructure,

    Applications, and Data Managing Thousands of Edge k8s Clusters with GitOps Jakub Pavlik @JakubPav
  2. © 2019 Volterra Inc. All Rights Reserved. Agenda • Challenges

    in Edge management • SRE Design Principles • Lesson learned from scale of 3k Edges • Demo of fleet management 2
  3. © 2019 Volterra Inc. All Rights Reserved. Volterra Backbone &

    CLoud Volterra SaaS Service VES Global Controller (SaaS) VES Global Controller (SaaS) Volterra Global Controller (GC) Volterra Regional Edge(RE) Volterra Regional Edge (RE) Volterra Regional Edge (RE) Volterra Regional Edge (RE) Customer Site (CE) Customer Site (CE) Customer Site (CE) Customer Site (CE) Customer Site (CE) Customer Site (CE) Customer Site (CE) Customer Site (CE) Customer Site (CE) Operations & SRE Portal, CRM, Billing Customer Portal IPSec/SSL IPSec/SSL IPSec/SSL IPSec/SSL IPSec/SSL Optional Site to Site Industrial-grade Volterra HW
  4. © 2019 Volterra Inc. All Rights Reserved. What is meant

    by k8s Edge or CE site? 4 Industrial-grade Volterra HW • Intel NUC • Industrial-grade Volterra HW • AWS, Azure, VMWare, GKE Virtual Machine
  5. © 2019 Volterra Inc. All Rights Reserved. Challenges and Goals

    in Edge management • Scale for thousands of sites • Fleet management (Installation, Upgrades) ◦ Simple management of thousands of sites ◦ Sites can be offline or unavailable at time of requested change • Zero Touch installation (anybody can bring site) • All must be managed remotely • Fault tolerant - system is operating even after failure of any component (factory reset, rebuild site in case of failure) 5
  6. © 2019 Volterra Inc. All Rights Reserved. SRE Design Principles

    6 • The Entire system is described declaratively • Immutable LCM - Everything is container (no packages or Mutable LCM such as Ansible or puppet) • GitOps - Approvals, audit and workflows of changes must go over git • No kubectl, no scripts run from central place • Use what make sense (not only tools in hype)
  7. © 2019 Volterra Inc. All Rights Reserved. VP-Manager 7 •

    Go based daemon running as systemd docker container • Manages several layers ◦ OS configurations (hugepages allocation, /etc/hosts) ◦ Kubernetes installation and LCM ◦ Workload management ◦ Ongoing API configuration to various components (IPSec) • Workload is based on Kubernetes client-go • Optimistic vs pessimistic deployment • Pre-update actions like pre-pull • Retries and Rollbacks in case of apply failures
  8. © 2019 Volterra Inc. All Rights Reserved. Azure CR &

    QUAY azurecr.io Customer Edge VPM Volterra Platform Manager GC VP-Controller RE1 RE2 Registration-Request with Token Registration-Config-Response x509 Download Containers Images CE Successfully provisioned and IPSec Tunnels are UP After images download start creating: K8s cluster Volterra infra software 1 1 3 3 4 4 5 5 Zero Touch Provisioning 2
  9. © 2019 Volterra Inc. All Rights Reserved. Upgrade delivery -

    Pull Method 9 • OS and Software upgrades ◦ OS follow A/B Upgrades with partitions ◦ Volterra Software is kubernetes workload • Upgrades must be simple (Like cell phone updates) • device can be offline at the time of upgrade • end user can decide when upgrade • avoid centralize CD tool • quick and scales easily
  10. © 2019 Volterra Inc. All Rights Reserved. © 2019 Volterra

    Inc. All Rights Reserved. Liberate your Infrastructure, Applications, and Data 10 Lesson learned from scale of 3k Edges
  11. © 2019 Volterra Inc. All Rights Reserved. Volterra Scale Topology

    with 3k sites 11 VES Global Controller (SaaS) VES Global Controller (SaaS) Volterra SaaS (Global Controller) Volterra Regional Edge Volterra Regional Edge (RE2) Operations & SRE Portal, CRM, Billing Customer Portal Created all sites in locations Volterra Scaling APP IPSec/SSL IPSec/SSL Total Sites: 3000 Total IPSec/SSL Tunnels: 6000 IPSec/SSL 100 Sites 100 Sites 100 Sites 1 2 30 On-Prem
  12. © 2019 Volterra Inc. All Rights Reserved. 3k Customer Edges

    connected to single Regional Edge 12
  13. © 2019 Volterra Inc. All Rights Reserved. Key Findings at

    Scale - Management ` • Optimize CE configuration/certificate creation/delivery ◦ Initially we processed registration serial way and each took around 2 minutes ◦ After optimization it process CE in 20 seconds and with arbitrary number of workers • Optimize delivery of Docker images ◦ Reduce size of docker images and distribute them through REs • Optimize Global Controller database operations ◦ Optimise database operations (CE upgrades Status from 3k sites) ◦ Split Status DB instances 13
  14. © 2019 Volterra Inc. All Rights Reserved. Key Findings at

    Scale - Monitoring • New Prometheus federation filters to drop unused metrics, labels ◦ Initially we had around 50k time series per CE with average of 15 labels. ◦ We optimized it to 2k per CE with average ◦ Simple while-lists for metric names and black-lists for label names • Move from global Prometheus federation to Cortex cluster ◦ Centralized Prometheus scraped all REs and CEs prometheus, ◦ At 1k CE, it becomes unsustainable. ◦ Currently Prometheus per RE (federating connected CEs Promethei) with RW to Cortex • Elasticsearch clusters and logs ◦ Decentralized logging architecture ◦ Fluentbit as collector on each node forwards logs into Fluentd (aggregator) in RE ◦ Elasticsearch deployed in every RE, using remote cluster search to query logs from single Kibana instance 14
  15. © 2019 Volterra Inc. All Rights Reserved. © 2019 Volterra

    Inc. All Rights Reserved. Liberate your Infrastructure, Applications, and Data 15 DEMO - Fleet management of CE upgrade
  16. © 2019 Volterra Inc. All Rights Reserved. Git release new

    version 20191117-000043 Customer Edge SRE Daemons SRE Workflow for Customer Edges 16 Config API Artifact storage VPM VP Controller Executor Poll for Update Upload status of upgrade K8s Deploy Render k8s manifests with version in annotation Load configuration into Config API daemon <site-name>/<version>/<manifest-per-app>.yml ce01-site/20191107-000042/prometheus.yml ce01-site/20191117-000043/prometheus.yml version: 20191117-000043 New version to Upgrade I want upgrade
  17. © 2019 Volterra Inc. All Rights Reserved. Contacts 17 @JakubPav

    @Volterra_ https://medium.com/volterra-io https://gitlab.com/volterra.io