Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Managing Thousands of Edge k8s Clusters with GitOps

Jakub Pavlik
November 17, 2019

Managing Thousands of Edge k8s Clusters with GitOps

Cloud Native Rejekts 2019 - stream including demo is available at https://youtu.be/tw6O2nigVTk?t=5880

Demo itself starts at https://youtu.be/tw6O2nigVTk?t=6744

We will provide a comprehensive overview of how we’ve built a large scale, fully open sourced edge cloud platform. It maps the technology to real use cases and grows the community collaboration around realistic deployments. It will show real operational data at scale from one of the largest retailers in the world. The audience will see not only the k8s deployment but app orchestration across thousands of k8s clusters.

Jakub Pavlik

November 17, 2019
Tweet

Other Decks in Technology

Transcript

  1. © 2019 Volterra Inc. All Rights Reserved.
    Liberate your Infrastructure, Applications, and Data
    Managing Thousands of Edge k8s
    Clusters with GitOps
    Jakub Pavlik
    @JakubPav

    View Slide

  2. © 2019 Volterra Inc. All Rights Reserved.
    Agenda
    ● Challenges in Edge management
    ● SRE Design Principles
    ● Lesson learned from scale of 3k Edges
    ● Demo of fleet management
    2

    View Slide

  3. © 2019 Volterra Inc. All Rights Reserved.
    Volterra Backbone & CLoud
    Volterra SaaS Service
    VES Global Controller (SaaS)
    VES Global Controller (SaaS)
    Volterra Global Controller (GC)
    Volterra Regional
    Edge(RE)
    Volterra Regional
    Edge (RE)
    Volterra Regional
    Edge (RE)
    Volterra Regional
    Edge (RE)
    Customer
    Site (CE)
    Customer
    Site (CE)
    Customer
    Site (CE)
    Customer
    Site (CE)
    Customer
    Site (CE)
    Customer
    Site (CE)
    Customer
    Site (CE)
    Customer
    Site (CE)
    Customer
    Site (CE)
    Operations & SRE Portal, CRM, Billing Customer Portal
    IPSec/SSL
    IPSec/SSL
    IPSec/SSL
    IPSec/SSL
    IPSec/SSL
    Optional Site to Site
    Industrial-grade
    Volterra HW

    View Slide

  4. © 2019 Volterra Inc. All Rights Reserved.
    What is meant by k8s Edge or CE site?
    4
    Industrial-grade Volterra HW
    ● Intel NUC
    ● Industrial-grade Volterra HW
    ● AWS, Azure, VMWare, GKE Virtual Machine

    View Slide

  5. © 2019 Volterra Inc. All Rights Reserved.
    Challenges and Goals in Edge management
    ● Scale for thousands of sites
    ● Fleet management (Installation, Upgrades)
    ○ Simple management of thousands of sites
    ○ Sites can be offline or unavailable at time of requested change
    ● Zero Touch installation (anybody can bring site)
    ● All must be managed remotely
    ● Fault tolerant - system is operating even after failure of any
    component (factory reset, rebuild site in case of failure)
    5

    View Slide

  6. © 2019 Volterra Inc. All Rights Reserved.
    SRE Design Principles
    6
    ● The Entire system is described declaratively
    ● Immutable LCM - Everything is container (no packages or Mutable
    LCM such as Ansible or puppet)
    ● GitOps - Approvals, audit and workflows of changes must go over
    git
    ● No kubectl, no scripts run from central place
    ● Use what make sense (not only tools in hype)

    View Slide

  7. © 2019 Volterra Inc. All Rights Reserved.
    VP-Manager
    7
    ● Go based daemon running as systemd docker container
    ● Manages several layers
    ○ OS configurations (hugepages allocation, /etc/hosts)
    ○ Kubernetes installation and LCM
    ○ Workload management
    ○ Ongoing API configuration to various components (IPSec)
    ● Workload is based on Kubernetes client-go
    ● Optimistic vs pessimistic deployment
    ● Pre-update actions like pre-pull
    ● Retries and Rollbacks in case of apply failures

    View Slide

  8. © 2019 Volterra Inc. All Rights Reserved.
    Azure CR
    & QUAY
    azurecr.io
    Customer Edge
    VPM
    Volterra Platform Manager
    GC
    VP-Controller
    RE1 RE2
    Registration-Request with Token
    Registration-Config-Response
    x509
    Download
    Containers
    Images
    CE Successfully provisioned and
    IPSec Tunnels are UP
    After images
    download start
    creating:
    K8s cluster
    Volterra infra
    software
    1
    1
    3
    3
    4
    4
    5 5
    Zero Touch Provisioning
    2

    View Slide

  9. © 2019 Volterra Inc. All Rights Reserved.
    Upgrade delivery - Pull Method
    9
    ● OS and Software upgrades
    ○ OS follow A/B Upgrades with partitions
    ○ Volterra Software is kubernetes workload
    ● Upgrades must be simple (Like cell phone updates)
    ● device can be offline at the time of upgrade
    ● end user can decide when upgrade
    ● avoid centralize CD tool
    ● quick and scales easily

    View Slide

  10. © 2019 Volterra Inc. All Rights Reserved.
    © 2019 Volterra Inc. All Rights Reserved.
    Liberate your Infrastructure, Applications, and Data 10
    Lesson learned from scale of 3k Edges

    View Slide

  11. © 2019 Volterra Inc. All Rights Reserved.
    Volterra Scale Topology with 3k sites
    11
    VES Global Controller (SaaS)
    VES Global Controller (SaaS)
    Volterra SaaS (Global Controller)
    Volterra Regional Edge
    Volterra Regional
    Edge (RE2)
    Operations & SRE Portal, CRM, Billing Customer Portal
    Created all sites in
    locations
    Volterra Scaling APP
    IPSec/SSL
    IPSec/SSL
    Total Sites: 3000
    Total IPSec/SSL Tunnels: 6000
    IPSec/SSL
    100 Sites 100 Sites 100 Sites
    1 2 30 On-Prem

    View Slide

  12. © 2019 Volterra Inc. All Rights Reserved.
    3k Customer Edges connected to single Regional Edge
    12

    View Slide

  13. © 2019 Volterra Inc. All Rights Reserved.
    Key Findings at Scale - Management
    `
    ● Optimize CE configuration/certificate creation/delivery
    ○ Initially we processed registration serial way and each took around 2 minutes
    ○ After optimization it process CE in 20 seconds and with arbitrary number of workers
    ● Optimize delivery of Docker images
    ○ Reduce size of docker images and distribute them through REs
    ● Optimize Global Controller database operations
    ○ Optimise database operations (CE upgrades Status from 3k sites)
    ○ Split Status DB instances
    13

    View Slide

  14. © 2019 Volterra Inc. All Rights Reserved.
    Key Findings at Scale - Monitoring
    ● New Prometheus federation filters to drop unused metrics, labels
    ○ Initially we had around 50k time series per CE with average of 15 labels.
    ○ We optimized it to 2k per CE with average
    ○ Simple while-lists for metric names and black-lists for label names
    ● Move from global Prometheus federation to Cortex cluster
    ○ Centralized Prometheus scraped all REs and CEs prometheus,
    ○ At 1k CE, it becomes unsustainable.
    ○ Currently Prometheus per RE (federating connected CEs Promethei) with RW to Cortex
    ● Elasticsearch clusters and logs
    ○ Decentralized logging architecture
    ○ Fluentbit as collector on each node forwards logs into Fluentd (aggregator) in RE
    ○ Elasticsearch deployed in every RE, using remote cluster search to query logs from single
    Kibana instance
    14

    View Slide

  15. © 2019 Volterra Inc. All Rights Reserved.
    © 2019 Volterra Inc. All Rights Reserved.
    Liberate your Infrastructure, Applications, and Data 15
    DEMO - Fleet management of CE upgrade

    View Slide

  16. © 2019 Volterra Inc. All Rights Reserved.
    Git release new
    version
    20191117-000043
    Customer Edge
    SRE Daemons
    SRE Workflow for Customer Edges
    16
    Config API
    Artifact
    storage
    VPM
    VP
    Controller
    Executor
    Poll for Update
    Upload status of upgrade
    K8s Deploy
    Render k8s manifests with
    version in annotation
    Load configuration into
    Config API daemon
    //.yml
    ce01-site/20191107-000042/prometheus.yml
    ce01-site/20191117-000043/prometheus.yml
    version:
    20191117-000043
    New
    version to
    Upgrade
    I want
    upgrade

    View Slide

  17. © 2019 Volterra Inc. All Rights Reserved.
    Contacts
    17
    @JakubPav
    @Volterra_
    https://medium.com/volterra-io
    https://gitlab.com/volterra.io

    View Slide