Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Creating a kubernetes distribution (DevOpsLisbon)

drequena
November 23, 2023

Creating a kubernetes distribution (DevOpsLisbon)

These are the slides from the DevOpsLisbon presentation.

Here I talk about how we created a Kubernetes distribution (as in Linux Distribution) to solve our scale problem when it comes to install machinery apps (apps that make a kubernetes cluster production ready).

drequena

November 23, 2023
Tweet

More Decks by drequena

Other Decks in Technology

Transcript

  1. By the end of this presentation you will know one

    more way to manage kubernetes applications setups in a standardized, repeatable and scalable way.
  2. Agenda • $ whoami • Acknowledgements • Vocabulary • The

    problem • Our kubernetes distro • Links and references
  3. Whoami? Daniel Requena • Dad and Husband • Bachelor of

    Comp. Science / Masters Comp. Engineering • 20+ years on IT (Sysadmin/DevOPS/SRE/…) • Occasional speaker and podcasts participant • Currently: SRE @ iFood * 🤔 • Focus: Kubernetes / Service Mesh • Living in Lisbon: 2.5 years
  4. iFood? • Food tech company - based in Brazil •

    Employees: ~5300 • Tech team: ~2000 • Orders: +70.000.000 per month • Requests: 250k rps • Services: +3000 • Deployments: +300 per day • Kubernetes: ◦ +98% services ◦ +53 clusters
  5. Acknowledgment Rodrigo Watanabe Gabriel Tiossi João Marques Henrique Dalssaso Carlos

    Motta Thiago de Francisco (Nullock) Thales Lopes Henrique Duran
  6. Vocabulary Kubernetes distribution: • EKS, GKE, AKS , KubeADM, Cluster-API,

    Kops, etc… Distribution: (in this context) • A way to standardize setup and customization of packages. A tested set of software packages that, when installed, provide all (or almost all) of the functionalities for your needs.
  7. The Problem Kubernetes project - 2018 • 4 clusters (1

    dev, 3 BUs) ◦ Few apps, few tools Kubernetes project - 2019 • 11 clusters (1 dev, 10 BUs) ◦ ~20% apps, dozens of tools DEV BU-1 BU-2 BU-3
  8. The Problem • Logs • Scaling (horizontal e vertical) •

    Monitoring + Grafana + Alerting • Ingress • CNI • Security • Policies • Caos • Consul agent • External DNS • Kiam • Secrets Manager • Backup • Auth • Cost management • CertManager • External Controller • Service Mesh • Open Telemetry • Custom controllers/Operators {
  9. So many problems: • Values updates • Reconciliation • Same

    package, multiple installs • Lack of customizations • Not testable (k8s upgrades and package upgrades) • Slow (manual) • Centralized (k8s team use only) • Scalability • … The Problem (how we used to set up machinery apps?) tools … 01-nginx.sh 02-prometheus.sh 03-grafana.sh 04-certmanager.s h 05-accounts.sh
  10. The Problem BU-1 Sandbox account Sandbox cluster BU-1 Production account

    Production cluster BU-2 Sandbox account Sandbox cluster BU-2 Production account Production cluster BU-3 Sandbox account Sandbox cluster BU-3 Production account Production cluster BU-4 Sandbox account Sandbox cluster BU-4 Production account Production cluster …
  11. Our requirements: • 100% based on Helm • Standardized •

    Simple (1 command preferible) • Flexible • Extensible • Testable • Git flow oriented • Scalable The Solution
  12. A source of inspiration Linux Distros strengths: • Package Management

    * (Helm) • Stable (tests) • Standardized (interface) • Life cycle / Releases • Extensible (3rd party packages) • Community Oriented Revolves around a “central point”: • Linux Kernel / (Kubernetes)
  13. Schematics Distribution Default Values Version: 1.0 Mem: 1G Cpu: 100m

    Labels: A=1 Version: 1.3 Mem: 2G Hpa: min 3 Version: 2.0 PVC: 10G NodeSelector: Infra
  14. Default Values Schematics Distribution Version: 1.0 Mem: 1G Cpu: 100m

    Labels: A=1 Version: 1.3 Mem: 2G Hpa: min 3 Version: 2.0 PVC: 10G NodeSelector: Infra
  15. Default Values Schematics Distribution Version: 1.0 Mem: 1G Cpu: 100m

    Labels: A=1 Version: 1.3 Mem: 2G Hpa: min 3 Version: 2.0 PVC: 10G NodeSelector: Infra PVC: 15G Mem: 3G PVC: 15G Mem: 3G
  16. Default Values Schematics Distribution Version: 2.0 PVC: 10G NodeSelector: Infra

    Core Team Packages Version: 1.0 Mem: 1G Cpu: 100m Labels: A=1 Sec Team Packages Ingress Team Packages …
  17. Default Values Schematics Distribution Stable Version: 2.0 PVC: 10G NodeSelector:

    Infra Core Team Packages Sec Team Packages Ingress Team Packages Version: 1.0 Mem: 1G Cpu: 100m Labels: A=1 …
  18. Default Values Schematics Distribution Edge Version: 3.0 PVC: 10G NodeSelector:

    Infra Core Team Packages Sec Team Packages Ingress Team Packages Version: 1.5 Mem: 1G Cpu: 100m Labels: A=1 …
  19. Building the distro • What is Helmfile? ◦ TL;DR -

    A “wrapper” to Helm with steroids, or a Chart of Charts. • Some characteristics ◦ Composable Files ◦ Extended template - Sprig (including the VALUES file) ◦ DOESN’T create a state file on its own. ◦ Separation between logic and values ◦ Multiple ways to do reference (s3, git, local, OCI…) ◦ Transforms almost everything in a Helm Release ◦ Defines dependencies between releases ⚠ • jsonpatch after manifests rendered • Multiple Environments • Secrets backends integration • Hooks • Extra metadata / selectors Helmfile to the rescue…
  20. Building the distro $ cat helmfile.yaml repositories: - name: bitnami

    url: https://charts.bitnami.com/bitnami - name: custom url: git+https://github.com/reactiveops/polaris@deploy/helm?ref=master releases: - name: external-dns namespace: machinery chart: bitnami/external-dns version: 3.2.6 values: - values/default.yaml - name: reactiveops chart: custom/reactiveops values: - image: tag: 1.4 - scheme: {{ env "SCHEME" | default "https" }} $ helmfile apply
  21. Building the distro $ cat meta-helmfile.yaml helmfiles: - path: git::https://github.com/drequena/grafana-package.git@/helmfile.yaml?ref=main

    values: - grafana: enabled: true resources: request: cpu: “100m” - path: git::https://github.com/drequena/prometheus-package.git@/helmfile.yaml?ref=1.0.1 values: - prometheus: enabled: true labels: - owner: “secteam” ...
  22. $ cat myclusters/sales.yaml helmfiles: - path: git::https://github.com/drequena/distribution.git@/helmfile.yaml?ref=1.0 values: - prometheus:

    enabled: false - grafana: url: “graphs.company.net” Building the distro (Cluster) - path: git::https://github.com/drequena/cert-manager-package.git@/helmfile.yaml?ref=1.3 values: - cert-manager: resources: request: cpu: 250m
  23. $ cat distribution/helmfile.yaml helmfiles: - path: git::https://github.com/drequena/grafana-package.git@/helmfile.yaml?ref=main values: - <optional-distribution-default-values>

    - {{ .Values | get "grafana" dict | toYaml | indent 6 | trim}} - path: git::https://github.com/drequena/prometheus-package.git@/helmfile.yaml?ref=1.0.1 values: - {{ .Values | get "prometheus" dict | toYaml | indent 6 | trim}} Building the distro (Distro)
  24. $ cat packages/prometheus/helmfile.yaml repositories: - name: prometheus url: https://charts.prometheus.org/prometheus releases:

    - name: prometheus condition: prometheus.enabled needs: - prometheus-operator namespace: monitoring chart: prometheus/prometheus version: 2.49.1 values: - values/default.yaml - {{ .Values | get "prometheus" dict | toYaml | indent 6 | trim}} Building the distro (Package)
  25. Cluster workflow Cluster (sandbox) $ helmfile apply $ helmfile diff

    commit values: - prometheus: resources: request: mem: 2G [INFO ] + status: [INFO ] + prometheus:: [INFO ] + resources: [INFO ] + request: [INFO ] - memory: 1G [INFO ] + memory: 2G [WARN ] at least one change was identified [INFO ] + status: [INFO ] + prometheus:: [INFO ] + resources: [INFO ] + request: [INFO ] - memory: 1G [INFO ] + memory: 2G [INFO ] New release applied
  26. commit helmfile: -path: git::https://… values: - metricserver: custom: “newvalue” Distro

    workflow Distribution (Edge) Clusters $ helmfile apply $ helmfile diff [INFO ] + status: [INFO ] + metricserver:: [INFO ] + labels: [INFO ] + custom: “newvalue” [WARN] at least one change was identified [INFO ] + status: [INFO ] + metricserver:: [INFO ] + labels: [INFO ] + custom: “newvalue” [INFO ] New release applied
  27. Package workflow Distribution Cluster Package $ helmfile apply $ helmfile

    diff commit releases: - name: prometheus chart: prometheus/prometheus version: 2.38.0 Tag: 1.1.3 commit helmfile: -path: git::https://…?ref:1.1.3 Tag: 2.0.1 commit helmfile: -path: git::https://…?ref:2.0.1
  28. Current workflow Clusters Distribution Sales Logistics cluster.yaml cluster.yaml Core Sec

    L7 helmfile.yaml nginx-ingress certmanager kubecost prometheus
  29. How about tests? Our distro install/upgrade/reconcile packages and “that's all”.

    • Helm packages are the real CORE ◦ One repo per package (with A LOT of governance) ▪ Divided by: Impact level ▪ vCluster (test against all supported k8s version) ▪ Pluto (check k8s API deprecations) ▪ Terratest (unit test) ▪ Golang (integration tests) ▪ Gitlab-ci ▪ Semantic Release ▪ Renovate Bot
  30. Conclusions The Good • Helm ✅ • Flexible ✅ •

    Scalable ✅ • Extensible ✅ • Standardized ✅ • The concept is modular and reusable ✅ The Bad and the Ugly • Dependency check between splitted helmfiles ⚠ ✅ • Simple? Tracing values can be hard ⚠ • Low parallelism depending on the repo/distro organization ⚠
  31. • Blueprint repos: https://github.com/drequena/clusters • Helm: https://helm.sh/ • Helmfile Docs:

    https://helmfile.readthedocs.io/en/latest/ • Helmfile git: https://github.com/helmfile/helmfile • Terratest: https://terratest.gruntwork.io/ • Renovatebot: https://github.com/renovatebot/renovate • Helm-Unit-tests: https://github.com/anikin-aa/helm-unittest • vCluster: https://www.vcluster.com/ • Pluto: https://github.com/FairwindsOps/pluto • Semantic Release: https://github.com/semantic-release/semantic-release Links and References