Slide 1

Slide 1 text

Shinya Uemura / Z Lab The Story of Managing Common Add-ons on 1000+ Kubernetes Clusters

Slide 2

Slide 2 text

Shinya Uemura / @uesyn › Z Lab Corporation › Software engineer › Customer Success Team › Organizing communities › "Kubernetes Meetup Tokyo" › "Kubernetes มߋ಺༰ڞ༗ձ" etc…

Slide 3

Slide 3 text

Z Lab Corporation › A wholly owned subsidiary of Yahoo Japan Corporations, established in 2015 › Research and development of infrastructure technologies › Development of managed Kubernetes services/PaaS, etc. for Yahoo Japan Corporation › https://zlab.co.jp

Slide 4

Slide 4 text

Agenda - Yahoo! JAPAN's Private Kubernetes as a Service - Management of add-ons on clusters - Monitoring of the add-on system

Slide 5

Slide 5 text

Yahoo! JAPAN's Private Kubernetes as a Service

Slide 6

Slide 6 text

Yahoo! JAPAN's Kubernetes as a Service - Provided in-house as Managed Kubernetes as a Service - Create and manage clusters on OpenStack on-premise - Commonly known internally as ZCP - Developed by Z Lab Corp., and operated by Yahoo Japan Corporation - One of the largest number of clusters/nodes in Japan 2017/7 2018/7 2019/7 2020/5 2021/7 2022/9 5 cluster 20 cluster 400+ cluster 680+ cluster 1000+ cluster 1200+ cluster ✓ Number of Clusters: 1207 ✓ Number of nodes(VMs): 41735 ✓ Number of containers: 748522 (As of 9/13/2022)

Slide 7

Slide 7 text

Yahoo! JAPAN's Kubernetes as a Service Kubernetes as a Service Create Control Plane Control Plane Control Plane etcd Control Plane worker Use

Slide 8

Slide 8 text

Characteristics of Yahoo! JAPAN's Private KaaS (1/2) - Secure by default - Do not allow users to create containers with privileges - Application with Ingress works out of the box - A DNS domain available for Ingress is allocated for each cluster Control Plane ingress Control Plane worker xxx.example.co.jp HTTP Request Privileged Container

Slide 9

Slide 9 text

- Managed - Repairs nodes with failures or problems(self-healing) - Can upgrade clusters safely with zero downtime - Coordinates with monitoring services - Automatically backups or restores data stores(etcd), etc. - Scalable - Can freely change the cluster size within a defined quota as requested by the user Frees operators from complex Kubernetes operations Characteristics of Yahoo! JAPAN's Private KaaS (2/2)

Slide 10

Slide 10 text

Cluster management in Private KaaS - 40,000+ servers of servers need to be operated - Automation is almost essential to manage them - Automation of operations with Kubernetes Controller - Clusters can be managed in a declarative configuration - "controllers are control loops that watch the state of your cluster, then make or request changes where needed. " - ref: https://kubernetes.io/docs/concepts/architecture/ controller/

Slide 11

Slide 11 text

Reference information on Yahoo! JAPAN Private KaaS - There are many other features and ingenuities - Rereference links - https://www.slideshare.net/techblogyahoo/yahoo-japan-kubernetesasaservice - https://www.docswell.com/s/ydnjp/KJ1G3Z-2019-01-26-152927

Slide 12

Slide 12 text

Providing add-ons - Some Kubernetes features require external dependencies - Resolve DNS domain in the Kubernetes cluster - Receive L7 traffic with Ingress Controller - Use Horizontal Pod Autoscaler etc... - Some applications are not essential to the operation of the cluster, but are desired to run on many clusters - Some applications are needed by KaaS to manage user clusters - KaaS-managed applications are also running on user clusters - These are called add-ons - Ex: Ingress Controller, CoreDNS, Prometheus etc…

Slide 13

Slide 13 text

Why do we offer add-ons? - Improvement in convenience - Allow users to focus on application development - Security - Enforce our company's security policy - Monitoring - KaaS monitors user's clusters - Users monitor applications on clusters

Slide 14

Slide 14 text

- Some add-ons are forcibly enabled and the others can be enabled at the user's discretion - Add-ons that are forcibly enabled - Related to cluster monitoring and security - External dependencies of Kubernetes - Add-ons that users can enable as needed - Ones for improving development and operational efficiency for users - Ones that are not required for all users, but are useful - Add-ons run on user Kubernetes just like the user's application - They run on a different namespace from the user's application User Kubernetes add-ons user apps admin namespaces user namespaces Add-ons running on user cluster

Slide 15

Slide 15 text

Maintenance of add-ons - Add-ons need to be updated as needed, not just deployed - Add new features, fix bugs, address volunerabilities, etc. - Adding and removing add-ons - Adding add-ons that are strongly requested by user - Removing add-ons that are no longer needed or impossible to maintain - Each version of Kubernetes has different support - Need to deploy suitable add-ons which supports Kubernetes version We must maintain add-ons running on huge amount of user clusters

Slide 16

Slide 16 text

Management of add-ons run on clusters

Slide 17

Slide 17 text

Our requirements for add-on management system - Update add-ons automatically - Resilient to temporary deployment errors - Support to add and remove add-ons In order to reduce operational costs due to the large number of clusters… In order to provide highly convenient for users/administrators… - Enable and Disable add-ons - Customize add-ons to fit each cluster

Slide 18

Slide 18 text

Kubernetes official addon-manager - Controller that continues to apply the pre-defined add-on manifests - https://github.com/kubernetes/kubernetes/tree/master/cluster/addons/addon- manager - Simple implementation with kubectl apply —prune command loop - Written in Bash - Applies all manifests for a specific directory - Deploys add-ons if they do not exist - Even when changes are made to add-ons, they are modified to match the changed manifests - Runs on each cluster

Slide 19

Slide 19 text

Different environment for each cluster (1/2) - Size or number of nodes, etc. vary by cluster - Add-ons that monitor nodes have more items to be monitored and consume more resources, the more nodes there are - Some add-ons demand more resources depending on usage - The more metrics an app has, the more resources add-ons that monitor metrics consume - Add-ons that record cluster events consume more resources, the more cluster operations there are, etc… - Some add-ons are needed by only a part of user Prometheus app usage Prometheus app usage app app

Slide 20

Slide 20 text

Different environment for each cluster (2/2) - A different value exists for each cluster, such as a domain used by Ingress - Some add-ons expose Web UI, etc. to the public outside of the cluster through Ingress - Each cluster has a different DNS domain, so the Ingress resource must also change the value accordingly According to usage, a mechanism that allows a part of configuration to be changed or a mechanism that automatically generates manifests suitable for clusters is needed Kubernetes Cluster shopping.example.co.jp Kubernetes Cluster fashion.example.co.jp Kubernetes Cluster hobby.example.co.jp

Slide 21

Slide 21 text

zcp-addon-manager - We developed zcp-addon-manager (abbreviated as ZAM) that meets the requirements of private KaaS - The main mechanism is the same as that of the official addon-manager - Differences from the official addon-manager - Implemented as Kubernetes Controller in Go language - Ability to enable/disable add-ons - Cluster users can control only some allowed add-ons - Templating add-on manifests - Automatically acquiring a different value for each cluster, such as a domain, and inserting the value into the template - Providing a mechanism where users can securely modify a part of the manifest

Slide 22

Slide 22 text

Release of zcp-addon-manager (1/3) - ZAM and add-on manifests are released as a set - Deploy as a single container image - In order to treat as immutable to simplify updates coredns v1.10.0 node-exporter 1.4.0 Prometheus v2.39.0 ZAM v1.24.0 ZAM v1.24.1 coredns v1.10.0 node-exporter 1.4.0 Prometheus v2.39.1 updated!

Slide 23

Slide 23 text

Release of zcp-addon-manager (2/3) - Release branches exist by each minor versions of Kubernetes - Kubernetes supports different APIs by each minor version - Add-ons and ZAM include operations that depend on the API of Kubernetes, so it is necessary to change contents to be released for each version main For Kubernetes v1.24 For Kubernetes v1.25 v1.24.0 v1.24.1 v1.24.2 v1.24.3 v1.25.0 v1.25.1 v1.25.2

Slide 24

Slide 24 text

Release of zcp-addon-manager (3/3) - Breaking changes are only released in minor version updates wherever possible - Changes that change the sense of operating, decommissioning of add-ons, changes requiring user action, etc. - In order to avoid impacting the cluster usage experience - In order to automatically update the patch version main For Kubernetes v1.24 For Kubernetes v1.25 breaking change! v1.24.0 v1.24.1 v1.24.2 v1.24.3 v1.25.0 v1.25.1 v1.25.2

Slide 25

Slide 25 text

Deployment management of ZAM - ZAM that matches the minor version of the Kubernetes needs to be deployed - Kubernetes can be upgraded at any time by users - ZAM needs to be released in conjunction with that operation - ZAM suitable for each of the approximately 1200 clusters needs to be deployed - There are cases where only ZAM for the minor version of specific Kubernetes are released - In a particular cluster, for testing purposes, it may be desirable to always apply the latest ZAM - In some clusters, it may be desirable to temporarily stop updating ZAM Kubernetes v1.24 Cluster add-ons ZAM v1.24.x Kubernetes v1.25 Cluster add-ons ZAM v1.25.y Cluster Upgrade new add-on

Slide 26

Slide 26 text

Automated Deployment of ZAM - We developed zam-updater, a controller specialized for ZAM deployment - It detects the version of kubernetes in the cluster where it is running and deploys a suitable ZAM - Unlike ZAM, zam-updater is deployed in a version common across all clusters Kubernetes v1.24 Cluster add-ons ZAM v1.24.x Kubernetes v1.25 Cluster add-ons ZAM v1.25.y Cluster Upgrade new add-on zam-updater ZAM v1.24.x zam-updater

Slide 27

Slide 27 text

Configuration of zam-updater(1/3) - 2 configs for zam-updater - Channels and applied ZAM version list - It must be exist - Configure the zam-updater behavior - Channel to be used - zam-updater to enable/disable updates, etc. - If It doesn't exists, default configuration is used. - Release the version of the channel written in the config file Kubernetes Minor Version Channel ZAM Version Version List zam-updater config

Slide 28

Slide 28 text

Configuration of zam-updater(2/3) Kubernetes v1.24 Cluster add-ons ZAM v1.24.1 Kuberneter v1.24 Cluster add-ons ZAM v1.24.2 new add-on zam-updater ZAM v1.24.1 zam-updater Update Version List Version List Version List

Slide 29

Slide 29 text

Version List Configuration of zam-updater(3/3) Kubernetes v1.24 Cluster add-ons ZAM v1.24.1 Kuberneter v1.24 Cluster new add-on zam-updater zam-updater Version List add-ons ZAM v1.24.3-rc.1 zam-updater config zam-updater config Version List

Slide 30

Slide 30 text

Automated deployment of zam-updater - Developed a mechanism that continues to apply the common manifest to all clusters - Apply the same manifest regardless of the configuration or version of each cluster - Release zam-updater and version list with this - Updating ZAM is completed simply by updating the version list to be distributed Kubernetes v1.24 Cluster add-ons ZAM v1.24.x zam-updater Kubernetes v1.25 Cluster add-ons ZAM v1.25.x zam-updater Default Manifest Deployer Kubernetes as a Service Cluster Discovery Version List Version List

Slide 31

Slide 31 text

Monitoring of add-on systems

Slide 32

Slide 32 text

Monitoring of add-ons and add-on systems - Add-ons, ZAM, and zam-updater are running in each cluster - They are not centrally controlled - As a KaaS developer/operator, we need to monitor whether they are running properly

Slide 33

Slide 33 text

Monitoring of clusters - Prometheus is running as an add-on in each cluster - Prometheus monitors add-ons and user applications on the same cluster - KaaS Prometheus scrapes critical metrics/alerts from each User Prometheus Kubernetes as a Service KaaS Prometheus Cluster Discovery User Kubernetes Cluster User Prometheus add-ons User Kubernetes Cluster User Prometheus add-ons Federation

Slide 34

Slide 34 text

- Usage of add-ons in all clusters can be checked - Add-ons actually applied by ZAM - Which and how many add-ons are applied to cluster - Applied zam-updater versions etc. - They are not centrally controlled, but monitored to check their normality Visualization of usage in zcp-addon-manager

Slide 35

Slide 35 text

Summary - Yahoo Japan Corporation and Z Lab Corporation are developing and operating Private KaaS - Approximately 1200 clusters running on Private KaaS - Private KaaS maintains not only Kubernetes, but also common add-ons wchich runs on all clusters - Management of add-ons is automated with Kubernetes Controller - ZAM: Deploys and updates add-ons in a cluster - zam-updater: Deploy and Updates ZAM in a cluster - Default Manifest Deployer: Deploy zam-updater and it's configs in each cluster - We monitor add-ons to check their normality, not just distribute them These efforts allow us to manage add-ons on 1000+ clusters

Slide 36

Slide 36 text

Thank you