Managed Kubernetes in Private Cloud using Rancher with more than 1000 nodes scale

Managed Kubernetes in Private Cloud using Rancher with more than
1000 nodes scale ~ Part 1: How we are using Rancher ~ LINE Corporation Yuki Nishiwaki

High Level Architecture of LINE Private Cloud IaaS Region 1
Region 2 Region 3 Identity Image DNS L4LB L7LB Block Storage Baremetal Object Storage Kubernetes VM Redis Mysql PaaS ElasticSearch FaaS Function as a Service Multiple Regions Support Scale is growing up - 1500 HV - 8000 Baremetals Provide different Level of abstraction PaaS

Today’s Topic IaaS Region 1 Region 2 Region 3 Identity
Image DNS L4LB L7LB Block Storage Baremetal Object Storage VM Redis Mysql PaaS ElasticSearch FaaS Function as a Service Multiple Regions Support Scale is growing up - 1500 HV - 8000 Baremetals Provide different Level of abstraction Kubernetes

Kubernetes Cluster Performance Deployment / Update Private Cloud Collaboration Managed
Kubernetes Mission of Our Managed Kubernetes Service For more than 2200 developers (100+ clusters) Kubernetes Operator Kubernetes Solution Architect High Availability Make an effort to keep Kubernetes Cluster stable Keep thinking How we can migrate existing application to Kubernetes

Kubernetes Cluster Performance Deployment / Update Private Cloud Collaboration Managed
Kubernetes Mission of Our Managed Kubernetes Service For more than 2200 developers (100+ clusters) Kubernetes Operator Kubernetes Solution Architect High Availability Make an effort to keep Kubernetes Cluster stable Keep thinking How we can migrate existing application to Kubernetes Where we focus on Now Where we focus on Now

How we achieve Managed Kubernetes in Private Cloud?

Architecture of Managed Kubernetes Service Kubernetes Cluster Kubernetes Cluster Kubernetes
Cluster API Automate Operating Multiple Cluster Cluster Operation - Cluster Create - Cluster Update - Add Worker Manage Cluster - Deploy - Update - Monitor Use Cluster - Deploy application - Scale application

Why we need simple API server in front of Rancher
Responsibility 1. Hide Rancher API/GUI from User 2. Aggregate Rancher API e.g. 1 Cluster Create API will internally make following Rancher API Calls - POST /v3/clusters - POST /v3/nodepools (multiple times) 3. Support Multiple Rancher deployments Why 1. Avoid strongly depending on Rancher 2. By limiting features, fixing user cluster deployment, Reduce risk for user to configure/use in wrong way 3. As a last resort to scale, we can support multiple rancher deployments by putting extra API in front of them API

Architecture of Managed Kubernetes Service Kubernetes Cluster Kubernetes Cluster Kubernetes
Cluster Automate Operating Multiple Cluster Cluster Operation - Cluster Create - Cluster Update - Add Worker Manage Cluster - Deploy - Update - Monitor Use Cluster - Deploy application - Scale application API

is our core management functionality What is Rancher? • OSS
tools which is developed by Rancher Lab • Implemented based on Kubernetes (Use CRD, client-go heavily) • Provide multiple clusters management functionality Responsibility in our Managed Kubernetes • Provision • Update • Keep Kubernetes Cluster Available/Healthy ◦ Monitoring ◦ Log Collecting ◦ Etcd Periodically Backup...

Take a look at Rancher Internal Implementation

Rancher 2.X architecture API Controller Kubernetes Cluster Kubernetes Cluster Cluster
Agent Node Agent Node Agent Node Agent Node Agent Kubernetes Cluster Cluster Agent Node Agent Node Agent Node Agent Node Agent Rancher Server is needed to run on Kubernetes Rancher Server can be divided into “API part” and “Controller part” Kubernetes Cluster managed by Rancher Need to run “Cluster Agent” and “Node Agent” Websocket Websocket 1 2 3 4

Rancher 2.X made use of Kubernetes Ecosystem 1. Use Kubernetes
CRD as a Data Store 2. Implement logic as a controller by using informer, workqueue from client-go 3. Use ConfigMap based leader election from client-go 4. Use endpoints resource for Rancher Server Discovery 5. Use Kubernetes rolebinding, role for API Authorization

Implement all logics as a Kubernetes Controller API Controller ClusterA
Watch Kubernetes Cluster Cluster Agent Node Agent NodeA NodeB CRD ・・・ Use CRD(Custom Resource Definition) to store Cluster, Node, User Informations…. Rancher API is just to create Kubernetes Custom Resource (kind of API proxy) When Controller detect new cluster resource, Do provisioning

Custom Resource Definition(CRD) in Kubernetes? Kubernetes Native Resource Type Custom
Resource Type CustomResourceDefinition ConfigMap Pod Nginx App A Nginx Config Cluster Node Cluster Node Cluster A Cluster B Node A Node B Kubernetes allow user to create custom resource type in addition to natively supported resource.

Example of CRD for Rancher Resource: Cluster > kubectl get
crd clusters.management.cattle.io -o yaml apiVersion: apiextensions.k8s.io/v1beta1 kind: CustomResourceDefinition metadata: creationTimestamp: 2018-10-26T13:49:37Z generation: 1 name: clusters.management.cattle.io resourceVersion: "1278" selfLink: /apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions/clusters.management.cattle.io uid: fa628204-d925-11e8-b840-fa163e305e2c spec: group: management.cattle.io names: kind: Cluster listKind: ClusterList plural: clusters singular: cluster scope: Cluster version: v3 > kubectl get cluster NAME AGE local 1d CRD for Cluster Cluster Resource

How we use Rancher? 1. Deploy Kubernetes with OpenStack 2.
Easy to heal by replacing 3. Etcd Periodic backup 4. Basic Monitoring for Cluster

4 types of methods to deploy k8s from Rancher (1/2)

1.Allocate Server 2.Install Docker 3.Register Node 4.Build Etcd, K8s 5.Deploy Rancher Agent Rancher Scope Rancher Scope Rancher Scope Rancher Scope Out of Rancher Scope Out of Rancher Scope Import Existing Cluster In a hosted Kubernetes provider Out of Rancher Scope From my own existing nodes From nodes in an infrastructure provider

1.Allocate Server 2.Install Docker 3.Register Node 4.Build Etcd, K8s 5.Deploy Rancher Agent Rancher Scope Rancher Scope Rancher Scope Rancher Scope Out of Rancher Scope Out of Rancher Scope Import Existing Cluster In a hosted Kubernetes provider Out of Rancher Scope From my own existing nodes From nodes in an infrastructure provider Use 2 different ways in LINE Import Driver (From my own existing nodes) OpenStack Driver (From nodes in an infrastructure provider)

Use OpenStack Node Driver for most of case Automate Operating
Multiple Cluster Web Application Dev Team A Machine Learning Team A Web Application Dev Team B Kubernetes Cluster OpenStack VMs Kubernetes Cluster OpenStack VMs Kubernetes Cluster OpenStack VMs Etcd Controlplane Etcd Controlplane Worker 1. Create VM by using OpenStack Driver 2. Run Rancher Agent GPU Server GPU Server GPU Server Worker docker-machine Cluster Agent Node Agent 1 2

Use OpenStack Node Driver for most of case Automate Operating
Multiple Cluster Web Application Dev Team A Machine Learning Team A Web Application Dev Team B Kubernetes Cluster OpenStack VMs Kubernetes Cluster OpenStack VMs Kubernetes Cluster OpenStack VMs We wanted to use our own GPU Server which is not maintained by Private Cloud Etcd Controlplane Etcd Controlplane Worker 1. Create VM by using OpenStack Driver 2. Run Rancher Agent GPU Server GPU Server GPU Server Worker docker-machine Cluster Agent Node Agent 1 2

Use Import Driver for user who have special server Automate
Operating Multiple Cluster Web Application Dev Team A Machine Learning Team A Web Application Dev Team B Kubernetes Cluster OpenStack VMs Kubernetes Cluster OpenStack VMs Kubernetes Cluster OpenStack VMs GPU Server GPU Server GPU Server Worker Worker Worker Etcd Controlplane Etcd Controlplane Worker Cluster Agent Node Agent 1. Create VM by using OpenStack Driver 2. Run Rancher Agent Node Agent 2 Allow to import only as a worker sudo docker run -d --privileged --restart=unless-stopped --net=host \ -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run rancher/rancher-agent:v2.1.5 \ --server <Rancher Server> --token <Token> --ca-checksum <CA Checksum> \ --worker

Use Import Driver for user who have special server Automate
Operating Multiple Cluster Web Application Dev Team A Machine Learning Team A Web Application Dev Team B Kubernetes Cluster OpenStack VMs Kubernetes Cluster OpenStack VMs Kubernetes Cluster OpenStack VMs GPU Server GPU Server GPU Server Worker Worker Worker Etcd Controlplane Etcd Controlplane Worker Cluster Agent Node Agent 1. Create VM by using OpenStack Driver 2. Run Rancher Agent Node Agent 2 Allow to import only as a worker Note: In Rancher GUI, We can not use 2 different methodologies (cloud providers) to install k8s like OpenStack and Import Driver, AWS and OpenStack Driver… But Rancher Server implementation doesn’t restrict user from doing it. That’s why if you call API or do expected procedure by yourself, you can mix it.

etcd etcd etcd 5 2 etcd etcd controller controller controller
3 •kube-apiserver •kube-controller-manager •kube-scheduler •kubelet •kube-proxy × × × 2 worker worker worker N •kubelet •kube-proxy - Toleration Li Our Deployment × Summary: How we deployed Kubernetes with Rancher From From From OR Import

As node got increased, More nodes get broken... Kubernetes Cluster
OpenStack VMs VM1 VM2 VM3 VM4 Hypervisor Failure Dockerd Bug... VM1 VM2 VM3 VM4 VM A VM B VM C

As node got increased, More nodes get broken... Kubernetes Cluster
OpenStack VMs VM1 VM2 VM3 VM4 Hypervisor Failure Dockerd Bug... VM1 VM2 VM3 VM4 VM A VM B VM C Let’s replace broken node with healthy node To keep providing enough compute resource

Delete Node on GUI when Node get broken Kubernetes Cluster
OpenStack VMs VM1 VM1 VM2 VM3 VM4 VM2 VM3 VM4 Automate Operating Multiple Cluster Delete VM3 docker-machine docker-machine rm VM3 Node Controller NodePool Controller Kind: Node VM3 Kind: NodePool NodePool1 Delete Node VM3 Detect Node VM3 Deleted Doing cleanup and remove finalizer

NodePool Controller in Rancher will re-create Node Kubernetes Cluster OpenStack
VMs VM1 VM1 VM2 VM3 VM4 VM2 VM3 VM4 Automate Operating Multiple Cluster docker-machine Node Controller NodePool Controller Compare The number of Node = 3 VS Quantity of NodePool = 4 Kind: Node VM3 Kind: NodePool NodePool1 Kind: Node VM3 Re-Create VM3 Node ... Spec: nodeTemplateName: XX quantity: 4 ...

Node Controller provision new Node Kubernetes Cluster OpenStack VMs VM1
VM1 VM2 VM3 VM4 VM2 VM4 Automate Operating Multiple Cluster docker-machine docker-machine create VM3 Node Controller NodePool Controller Kind: Node VM3 Kind: NodePool NodePool1 Detect new Node VM3 VM3 VM3 New

Finally install k8s and Make new node join cluster Kubernetes
Cluster OpenStack VMs VM1 VM1 VM2 VM3 VM4 VM2 VM4 Automate Operating Multiple Cluster Node Controller NodePool Controller Kind: Node VM3 Kind: NodePool NodePool1 After Node Provisioning Finished (Check Condition Fields), Run RKE VM3 VM3 New Cluster Provisioner Install/Update Kubernetes

HA deployment (multiple controlplane) is not always... etcd etcd etcd
14:30 14:40 etcd etcd etcd × × × There is always risk for etcd data to be gone - Crush multiple etcd nodes - Accidently deleted data (Human Error) Kubernetes Cluster 1 Kubernetes Cluster 1

Enable Periodic Backup for all of clusters s3_backup_config: access_key: "<access
key>" bucket_name: "cluster1-bucket" endpoint: “<S3 API Endpoint>” region: "us-east-1" Create Bucket for each cluster

How Periodic Backup works (1/2) etcd etcd etcd 14:30 Kubernetes
Cluster 1 Object Storage (Ceph) cluster1-bucket cluster2-bucket etcd Automate Operating Multiple Cluster Kind: EtcdBackup EtcdBackup1430 Kind: Cluster Kubernetes Cluster 1 Kind: Cluster Kubernetes Cluster 1 Kind: Cluster Kubernetes Cluster 1 Kind: EtcdBackup EtcdBackup1430 Kind: EtcdBackup EtcdBackup1430 EtcdBackup Controller Check Backup Periodically Main Logic goroutine Check all clusters for every 5 min - Backup is enabled or not - When Last backup is took Create EtcdBackup Resource Represent actual snapshot

How Periodic Backup works (2/2) etcd etcd etcd 14:30 Kubernetes
Cluster 1 Object Storage (Ceph) cluster1-bucket cluster2-bucket etcd rke-etcd-backup Automate Operating Multiple Cluster Kind: EtcdBackup EtcdBackup1430 Kind: Cluster Kubernetes Cluster 1 Kind: Cluster Kubernetes Cluster 1 Kind: Cluster Kubernetes Cluster 1 Kind: EtcdBackup EtcdBackup1430 Kind: EtcdBackup EtcdBackup1430 EtcdBackup Controller Check Backup Periodically Main Logic goroutine Detect New EtcdBackup Run one-shot container To take snapshot Upload to Object Storage

If we enable Periodic Backup, we can restore ! etcd
etcd etcd 14:30 14:40 etcd etcd etcd × × × Kubernetes Cluster 1 Kubernetes Cluster 1 etcd etcd etcd Kubernetes Cluster 1 15:00 Object Storage (Ceph) cluster1-bucket cluster2-bucket Restore from snapshot Download

Restore operation will be finished by 1 API Call Automate
Operating Multiple Cluster Kind: Cluster Kubernetes Cluster 1 Kind: EtcdBackup EtcdBackup1430 Cluster Provisioner ... Spec: rancherKubernetesEngineConfig: restore: true snapshotName: EtcdBackup1430 ... Object Storage (Ceph) cluster1-bucket cluster2-bucket Restore from EtcdBackup1430 for Cluster 1 etcd etcd etcd Kubernetes Cluster 1 15:00 rke-etcd-backup Detect Cluster Change Run one-shot container to download snapshot Rancher API Update Cluster

Restore operation will be finished by 1 API Call Automate
Operating Multiple Cluster Kind: Cluster Kubernetes Cluster 1 Kind: EtcdBackup EtcdBackup1430 Cluster Provisioner ... Spec: rancherKubernetesEngineConfig: restore: true snapshotName: EtcdBackup1430 ... Object Storage (Ceph) cluster1-bucket cluster2-bucket etcd etcd etcd Kubernetes Cluster 1 15:00 etcdctl snapshot restore Run one-shot container to restore with snapshot

Rancher Provide 2 different monitoring Kubernetes Cluster Kubernetes Cluster Kubernetes
Cluster 1. Basic Monitoring based on Kubernetes Standard Feature • Kubernetes Component Status • Kubernetes Node Condition 2. Advance Monitoring by deploying Grafana, Prometheus • kube-apiserver, kube-scheduler, kube-XXXX metrics API • coredns, kube-dns metrics API • node-exporter • kube-state-metrics Automate Operating Multiple Cluster

1. Basic Monitoring based on Kubernetes Features Kubernetes Cluster Kubernetes
Cluster Kubernetes Cluster Periodically Call Component Status API (/api/v1/componentstatuses) Automate Operating Multiple Cluster HealthSyncer Monitoring Kind: Cluster Cluster1 Kind: Node Node1 NodeSyncer … omit ... Status: componentStatuses: …. omit ... Update based on API Response … omit ... Status: internalNodeStatus: …. omit ... Periodically Call Node API (/api/v1/node)

2. Advance Monitoring with Grafana, Prometheus Kubernetes Cluster Kubernetes Cluster
Kubernetes Cluster Check if Cluster enable Extra Monitoring or not Automate Operating Multiple Cluster HealthSyncer Monitoring Kind: Cluster Cluster1 NodeSyncer Detect Cluster Change Deploy Grafana, Prometheus by Helm Chart https://github.com/rancher/system-charts

Our Choice: Use Only Basic Monitoring Kubernetes Cluster Kubernetes Cluster
Kubernetes Cluster 1. Basic Monitoring based on Kubernetes Standard Feature • Kubernetes Component Status • Kubernetes Node Condition 2. Advance Monitoring by deploying Grafana, Prometheus • kube-apiserver, kube-scheduler, kube-XXXX metrics API • coredns, kube-dns metrics API • node-exporter • kube-state-metrics Automate Operating Multiple Cluster • Use only Basic Monitoring ◦ Set alert for Node Status, Cluster Status on CRD • We don’t enable Rancher’s Advanced Monitoring ◦ We have/use our own Prometheus, Grafana Configuration

Set Alert Resource Status updated by Basic Monitoring Automate Operating
Multiple Cluster HealthSyncer Monitoring Kind: Cluster Cluster1 Kind: Node Node1 NodeSyncer … omit ... Status: componentStatuses: …. omit … condition: …. omit … … omit ... Status: internalNodeStatus: …. omit … condition: …. omit … Rancher State Metrics rancher_cluster_not_true_condition {cluster="c-2mf2d",condition="<Condition Name>"} {cluster="c-2mf2d",condition="NoMemoryPressure"} rancher_cluster_component_not_true_status {cluster="c-25f4n",exported_component="<Component Name>"} {cluster="c-25f4n",exported_component="controller-manager"} rancher_node_not_true_internal_condition {cluster="c-k87fq",condition="PIDPressure",node="m-2tch7"} rancher_node_not_true_codition {cluster="c-2f6gk",condition="Provisioned",node="m-kkrz6"} metrics

Set Alert Resource Status updated by Basic Monitoring Automate Operating
Multiple Cluster HealthSyncer Monitoring Kind: Cluster Cluster1 Kind: Node Node1 NodeSyncer … omit ... Status: componentStatuses: …. omit … condition: …. omit … … omit ... Status: internalNodeStatus: …. omit … condition: …. omit … Rancher State Metrics Long Node Provisioning Long Cluster Provisioning Unhealthy Component Status Unhealthy Node Condition Unhealthy Cluster Condition rancher_cluster_not_true_condition {cluster="c-2mf2d",condition="<Condition Name>"} {cluster="c-2mf2d",condition="NoMemoryPressure"} rancher_cluster_component_not_true_status {cluster="c-25f4n",exported_component="<Component Name>"} {cluster="c-25f4n",exported_component="controller-manager"} rancher_node_not_true_internal_condition {cluster="c-k87fq",condition="PIDPressure",node="m-2tch7"} rancher_node_not_true_codition {cluster="c-2f6gk",condition="Provisioned",node="m-kkrz6"} metrics Alerts Configure Alerts

We can easily notice something wrong quickly Even if the
number of clusters are more than 90 Long Node Provisioning Long Cluster Provisioning Unhealthy Component Status Unhealthy Node Condition Unhealthy Cluster Condition We can notice each node, cluster’s status change

Using solve everything? Automate Operating Multiple Cluster

Using solve everything? Automate Operating Multiple Cluster “Yes as long
as keep working well”

as keep working well” Is it easy ?

as keep working well” Is it easy ? (・・;)。。。

Problems We faced and “solved” 1. NodeSelector value should be
always string (Failed to deploy ingress-nginx, kube-dns, coredns when specify "XXX.com: true" in nodeSelector 2. Rancher Cluster Agent, Node Agent might get hung when something goes wrong in the middle of WebSocket Session Handshake 3. Rancher Override/Delete the annotation of node flannel internally used to setup Vtep on the Host 4. Allow to configure additional tolerations for cluster-agent, node-agent rancher will deploy 5. Cluster with RKE driver always have error in "transitioning" field while provisioning (master, v2.0.8) 6. deployAgent in node-controller is always succeeded even if failed to run container(rancher/rancher-agent) 7. panic: "assignment to entry in nil map" when try to create node by calling POST /v3/nodes

Problems We faced and “solved” 1. NodeSelector value should be
always string (Failed to deploy ingress-nginx, kube-dns, coredns when specify "XXX.com: true" in nodeSelector 2. Rancher Cluster Agent, Node Agent might get hung when something goes wrong in the middle of WebSocket Session Handshake 3. Rancher Override/Delete the annotation of node flannel internally used to setup Vtep on the Host 4. Allow to configure additional tolerations for cluster-agent, node-agent rancher will deploy 5. Cluster with RKE driver always have error in "transitioning" field while provisioning (master, v2.0.8) 6. deployAgent in node-controller is always succeeded even if failed to run container(rancher/rancher-agent) 7. panic: "assignment to entry in nil map" when try to create node by calling POST /v3/nodes How we can detect the problem before become serious outage? How we troubleshoot? Where would be bottleneck in large scale? Where we should pay attention? How we extend Rancher ?

Reached Time Limit Today 1. NodeSelector value should be always
string (Failed to deploy ingress-nginx, kube-dns, coredns when specify "XXX.com: true" in nodeSelector 2. Rancher Cluster Agent, Node Agent might get hung when something goes wrong in the middle of WebSocket Session Handshake 3. Rancher Override/Delete the annotation of node flannel internally used to setup Vtep on the Host 4. Allow to configure additional tolerations for cluster-agent, node-agent rancher will deploy 5. Cluster with RKE driver always have error in "transitioning" field while provisioning (master, v2.0.8) 6. deployAgent in node-controller is always succeeded even if failed to run container(rancher/rancher-agent) 7. panic: "assignment to entry in nil map" when try to create node by calling POST /v3/nodes How we can detect the problem? How we troubleshoot? How we grasp what We submitted 1 CFP for North America 2019 “If it’s accepted”, let us talk about our story more detail Today Hopefully on KubeCon

What’s next?

Where we are heading to Kubernetes Cluster Performance Deployment /
Update Private Cloud Collaboration Managed Kubernetes Kubernetes Operator Kubernetes Solution Architect High Availability Make an effort to keep Kubernetes Cluster stable Keep thinking How we can migrate existing application to Kubernetes Where we focus on Next Where we focus on Next

Appendix

Extra Features/Considerations Not Covered by Rancher Automate Operating Multiple Cluster
API

Quota for each Project & Logging Management 1. Cluster, Node
Quota 2. Logging Management Automate Operating Multiple Cluster Project A Project B Can create only 1 cluster with 100 nodes Can create only 2 clusters with 200 nodes Database Quota Check Kubernetes Cluster Elasticsearch Maintained by other team Container Log Kubernetes Log Etcd Log Log Rotate Send Logs to Elasticsearch "log-driver": "json-file", "log-opts": { "max-size": "20m", "max-file": "2" } /etc/docker/daemon.json API

Addon Manager for Addons running on User Cluster 3. Addon
Manager Kubernetes Cluster Addon Manager DNS Block Storage L4LB L7LB Redis Cinder CSI Provider Plugin LINE Ingress Controller LINE Type LB Implementation LINE Service Operator Verda Private Cloud Family Kubernetes Addons - Deploy Addons - Update Addons - Monitoring Addons

Managed Kubernetes in Private Cloud using Ranch...

Managed Kubernetes in Private Cloud using Rancher with more than 1000 nodes scale

More Decks by LINE Developers

Other Decks in Technology

Featured

Transcript