Slide 1

Slide 1 text

Making your kubernetes-based log collection reliable & durable with vector VECTOR MAKSIM NABOKIKH Platform Lead

Slide 2

Slide 2 text

DISCLAIMER During this talk preparation, no Kubernetes clusters were hurt

Slide 3

Slide 3 text

DISCLAIMER During this talk preparation, no Kubernetes clusters were hurt Just kidding, in reality, there were ple-e-e-enty of outages

Slide 4

Slide 4 text

ABOUT PALARK We offer all-in-one DevOps-as-a-Service and pick the best Open Source projects to fulfill our client goals 16 70 Years in Linux, DevOps & Kubernetes Managed Kubernetes clusters 15 90 Awesome engineers Tech posts at blog.palark.com

Slide 5

Slide 5 text

PLAN LOGS IN KUBERNETES Let’s recall what to collect in Kubernetes WHAT IS VECTOR And in which way it is applicable PRACTICAL USE Exciting operating (Ops) experience cases 1 2 3

Slide 6

Slide 6 text

LOGS IN KUBERNETES

Slide 7

Slide 7 text

LOGS IN KUBERNETES: POD LOGS Log file location path consists of a pod name, container name, and UID Format and location of files depends on the CRI settings Max size and rotation depends on the kubelet settings kubernetes.io/docs/concepts/cluster-administration/logging/ /var/log/pods pod-1 pod-2 kubelet stdout stderr stdout stderr

Slide 8

Slide 8 text

LOGS IN KUBERNETES: NODE SERVICES Files in the /var/log directory (probably) Max size and rotation configured by journald Format can be anything… kubernetes.io/docs/concepts/cluster-administration/logging/ containerd kubelet audit logs syslog

Slide 9

Slide 9 text

LOGS IN KUBERNETES: EVENTS Can only be collected from the Kubernetes API Can be collected as either logs, metrics, or traces kubernetes.io/docs/concepts/cluster-administration/logging/ apiVersion: v1 kind: Event count: 1 metadata: name: standard-worker-1.178264e1185b006f namespace: default reason: RegisteredNode firstTimestamp: '2023-09-06T19:08:47Z' lastTimestamp: '2023-09-06T19:08:47Z' involvedObject: apiVersion: v1 kind: Node name: standard-worker-1 uid: 50fb55c5-d97e-4851-85c6-187465154db6 message: 'Registered Node standard-worker-1 in Controller'

Slide 10

Slide 10 text

LOGS IN KUBERNETES: EVENTS Can only be collected from the Kubernetes API Can be collected as either logs, metrics, or traces kubernetes.io/docs/concepts/cluster-administration/logging/ apiVersion: v1 kind: Event count: 1 metadata: name: standard-worker-1.178264e1185b006f namespace: default reason: RegisteredNode firstTimestamp: '2023-09-06T19:08:47Z' lastTimestamp: '2023-09-06T19:08:47Z' involvedObject: apiVersion: v1 kind: Node name: standard-worker-1 uid: 50fb55c5-d97e-4851-85c6-187465154db6 message: 'Registered Node standard-worker-1 in Controller'

Slide 11

Slide 11 text

LOGS IN KUBERNETES: EVENTS Can only be collected from the Kubernetes API Can be collected as either logs, metrics, or traces kubernetes.io/docs/concepts/cluster-administration/logging/ apiVersion: v1 kind: Event count: 1 metadata: name: standard-worker-1.178264e1185b006f namespace: default reason: RegisteredNode firstTimestamp: '2023-09-06T19:08:47Z' lastTimestamp: '2023-09-06T19:08:47Z' involvedObject: apiVersion: v1 kind: Node name: standard-worker-1 uid: 50fb55c5-d97e-4851-85c6-187465154db6 message: 'Registered Node standard-worker-1 in Controller'

Slide 12

Slide 12 text

LOGS IN KUBERNETES: EVENTS Can only be collected from the Kubernetes API Can be collected as either logs, metrics, or traces kubernetes.io/docs/concepts/cluster-administration/logging/ apiVersion: v1 kind: Event count: 1 metadata: name: standard-worker-1.178264e1185b006f namespace: default reason: RegisteredNode firstTimestamp: '2023-09-06T19:08:47Z' lastTimestamp: '2023-09-06T19:08:47Z' involvedObject: apiVersion: v1 kind: Node name: standard-worker-1 uid: 50fb55c5-d97e-4851-85c6-187465154db6 message: 'Registered Node standard-worker-1 in Controller'

Slide 13

Slide 13 text

LOGS IN KUBERNETES kubernetes.io/docs/concepts/cluster-administration/logging/ What we can collect? Source Pod logs Files Node services logs Files Events Kubernetes API

Slide 14

Slide 14 text

LOGS IN KUBERNETES kubernetes.io/docs/concepts/cluster-administration/logging/ What we can collect? Source Pod logs Files Node services logs Files Events Kubernetes API

Slide 15

Slide 15 text

WHAT IS VECTOR

Slide 16

Slide 16 text

WHAT IS VECTOR A lightweight, ultra-fast tool for building observability pipelines vector.dev

Slide 17

Slide 17 text

WHAT IS VECTOR A lightweight, ultra-fast tool for building observability pipelines vector.dev

Slide 18

Slide 18 text

WHAT IS VECTOR An open source, efficient tool for building log collecting pipelines vector.dev

Slide 19

Slide 19 text

WHAT IS VECTOR Vendor agnostic You do not need to rewrite Vector in Rust Performance by design and continuous benchmarking Flexible building block vector.dev An open source, efficient tool for building log collecting pipelines

Slide 20

Slide 20 text

VECTOR’S ARCHITECTURE

Slide 21

Slide 21 text

VECTOR’S ARCHITECTURE Remap Filter Aggregate Collect Transform Send File K8s Socket 9 in total 40 in total 52 in total …

Slide 22

Slide 22 text

… VECTOR’S ARCHITECTURE Remap Filter Aggregate Collect Transform Send File K8s Socket 9 in total 40 in total 52 in total Vector Remap Language (VRL)

Slide 23

Slide 23 text

VECTOR REMAP LANGUAGE

Slide 24

Slide 24 text

VECTOR REMAP LANGUAGE [transforms.filter_severity] type = "filter" inputs = ["logs"] condition = '.severity != "info"'

Slide 25

Slide 25 text

VECTOR REMAP LANGUAGE [transforms.filter_severity] type = "filter" inputs = ["logs"] condition = '.severity != "info"' [transforms.sanitize_kubernetes_labels] type = "remap" inputs = ["logs"] source = ''' if exists(.pod_labels."controller-revision-hash") { del(.pod_labels."controller-revision-hash") } if exists(.pod_labels."pod-template-hash") { del(.pod_labels."pod-template-hash") } '''

Slide 26

Slide 26 text

VECTOR REMAP LANGUAGE [transforms.filter_severity] type = "filter" inputs = ["logs"] condition = '.severity != "info"' [transforms.sanitize_kubernetes_labels] type = "remap" inputs = ["logs"] source = ''' if exists(.pod_labels."controller-revision-hash") { del(.pod_labels."controller-revision-hash") } if exists(.pod_labels."pod-template-hash") { del(.pod_labels."pod-template-hash") } ''' [transforms.backslash_multiline] type = "reduce" inputs = ["logs"] group_by = ["file", "stream"] merge_strategies."message" = "concat_newline" ends_when = ''' matched, err = match(.message, r'[^\\]$'); if err != null { false; } else { matched; } '''

Slide 27

Slide 27 text

LOG COLLECTING TOPOLOGIES

Slide 28

Slide 28 text

LOG COLLECTING TOPOLOGIES log-shipper log-shipper log-shipper log-shipper log-shipper storage Distributed

Slide 29

Slide 29 text

LOG COLLECTING TOPOLOGIES log-shipper log-shipper log-shipper log-shipper log-shipper log-shipper log-shipper log-shipper log-shipper log-shipper aggregator storage aggregator storage Distributed Centralized

Slide 30

Slide 30 text

LOG COLLECTING TOPOLOGIES log-shipper log-shipper log-shipper log-shipper log-shipper log-shipper log-shipper log-shipper log-shipper log-shipper aggregator storage aggregator storage log-shipper log-shipper log-shipper log-shipper log-shipper queue storage Distributed Centralized Stream

Slide 31

Slide 31 text

LOG COLLECTING TOPOLOGIES log-shipper log-shipper log-shipper log-shipper log-shipper log-shipper log-shipper log-shipper log-shipper log-shipper aggregator storage aggregator storage log-shipper log-shipper log-shipper log-shipper log-shipper queue storage Distributed Centralized Stream

Slide 32

Slide 32 text

LOG COLLECTING TOPOLOGIES aggregator storage aggregator storage queue storage Distributed Centralized Stream log-shipper log-shipper log-shipper log-shipper log-shipper log-shipper log-shipper log-shipper log-shipper log-shipper log-shipper log-shipper log-shipper log-shipper log-shipper

Slide 33

Slide 33 text

VECTOR IN KUBERNETES

Slide 34

Slide 34 text

VECTOR IN KUBERNETES github.com/deckhouse/deckhouse/blob/main/modules/460-log-shipper/templates/daemonset.yaml /var/log /vector-data /etc/vector Vector Reloader Kube RBAC proxy log-shipper Vector – collects logs Reloader – validates config and reloads Kube RBAC proxy – protects metrics Node File System

Slide 35

Slide 35 text

VECTOR IN KUBERNETES github.com/deckhouse/deckhouse/blob/main/modules/460-log-shipper/templates/daemonset.yaml apiVersion: apps/v1 kind: DaemonSet /var/log /vector-data /etc/vector Vector Reloader Kube RBAC proxy log-shipper Vector – collects logs Reloader – validates config and reloads Kube RBAC proxy – protects metrics Node File System

Slide 36

Slide 36 text

VECTOR IN KUBERNETES github.com/deckhouse/deckhouse/blob/main/modules/460-log-shipper/templates/daemonset.yaml apiVersion: apps/v1 kind: DaemonSet volumes: - name: var-log hostPath: path: /var/log/ - name: vector-data-dir hostPath: path: /mnt/vector-data - name: localtime hostPath: path: /etc/localtime /var/log /vector-data /etc/vector Vector Reloader Kube RBAC proxy log-shipper Vector – collects logs Reloader – validates config and reloads Kube RBAC proxy – protects metrics Node File System

Slide 37

Slide 37 text

VECTOR IN KUBERNETES github.com/deckhouse/deckhouse/blob/main/modules/460-log-shipper/templates/daemonset.yaml apiVersion: apps/v1 kind: DaemonSet volumes: - name: var-log hostPath: path: /var/log/ - name: vector-data-dir hostPath: path: /mnt/vector-data - name: localtime hostPath: path: /etc/localtime volumeMounts: - name: var-log mountPath: /var/log/ readOnly: true /var/log /vector-data /etc/vector Vector Reloader Kube RBAC proxy log-shipper Vector – collects logs Reloader – validates config and reloads Kube RBAC proxy – protects metrics Node File System

Slide 38

Slide 38 text

VECTOR IN KUBERNETES github.com/deckhouse/deckhouse/blob/main/modules/460-log-shipper/templates/daemonset.yaml apiVersion: apps/v1 kind: DaemonSet volumes: - name: var-log hostPath: path: /var/log/ - name: vector-data-dir hostPath: path: /mnt/vector-data - name: localtime hostPath: path: /etc/localtime volumeMounts: - name: var-log mountPath: /var/log/ readOnly: true terminationGracePeriodSeconds: 120 /var/log /vector-data /etc/vector Vector Reloader Kube RBAC proxy log-shipper Vector – collects logs Reloader – validates config and reloads Kube RBAC proxy – protects metrics Node File System

Slide 39

Slide 39 text

VECTOR IN KUBERNETES github.com/deckhouse/deckhouse/blob/main/modules/460-log-shipper/templates/daemonset.yaml apiVersion: apps/v1 kind: DaemonSet volumes: - name: var-log hostPath: path: /var/log/ - name: vector-data-dir hostPath: path: /mnt/vector-data - name: localtime hostPath: path: /etc/localtime volumeMounts: - name: var-log mountPath: /var/log/ readOnly: true terminationGracePeriodSeconds: 120 shareProcessNamespace: true /var/log /vector-data /etc/vector Vector Reloader Kube RBAC proxy log-shipper Vector – collects logs Reloader – validates config and reloads Kube RBAC proxy – protects metrics Node File System

Slide 40

Slide 40 text

VECTOR IN KUBERNETES github.com/deckhouse/deckhouse/blob/main/modules/460-log-shipper/templates/daemonset.yaml apiVersion: apps/v1 kind: DaemonSet volumes: - name: var-log hostPath: path: /var/log/ - name: vector-data-dir hostPath: path: /mnt/vector-data - name: localtime hostPath: path: /etc/localtime volumeMounts: - name: var-log mountPath: /var/log/ readOnly: true terminationGracePeriodSeconds: 120 shareProcessNamespace: true /var/log /vector-data /etc/vector Vector Reloader Kube RBAC proxy log-shipper Vector – collects logs Reloader – validates config and reloads Kube RBAC proxy – protects metrics Node File System

Slide 41

Slide 41 text

PRACTICAL USE

Slide 42

Slide 42 text

CASE #1: NO SPACE LEFT ON THE DEVICE

Slide 43

Slide 43 text

$ lsof -nP | grep '(deleted)' CASE #1: NO SPACE LEFT ON THE DEVICE

Slide 44

Slide 44 text

$ lsof -nP | grep '(deleted)' vector 6331 root 25r REG 253,3 10602 72738831 /var/log/.../1.log (deleted) vector 6331 root 44r REG 253,3 10239 33665268 /var/log/.../1.log (deleted) vector 6331 6628 vector-wo root 25r REG 253,3 10602 72738831 /var/log/.../1.log (deleted) vector 6331 6628 vector-wo root 44r REG 253,3 10239 33665268 /var/log/.../1.log (deleted) vector 6331 6629 vector-wo root 25r REG 253,3 10602 72738831 /var/log/.../1.log (deleted) CASE #1: NO SPACE LEFT ON THE DEVICE

Slide 45

Slide 45 text

$ lsof -nP | grep '(deleted)' vector 6331 root 25r REG 253,3 10602 72738831 /var/log/.../1.log (deleted) vector 6331 root 44r REG 253,3 10239 33665268 /var/log/.../1.log (deleted) vector 6331 6628 vector-wo root 25r REG 253,3 10602 72738831 /var/log/.../1.log (deleted) vector 6331 6628 vector-wo root 44r REG 253,3 10239 33665268 /var/log/.../1.log (deleted) vector 6331 6629 vector-wo root 25r REG 253,3 10602 72738831 /var/log/.../1.log (deleted) CASE #1: NO SPACE LEFT ON THE DEVICE

Slide 46

Slide 46 text

Vector CASE #1: NO SPACE LEFT ON THE DEVICE

Slide 47

Slide 47 text

Vector /var/log/pods CASE #1: NO SPACE LEFT ON THE DEVICE

Slide 48

Slide 48 text

Vector /var/log/pods /var/log/pods/{uid}/1.log 10Mb CASE #1: NO SPACE LEFT ON THE DEVICE

Slide 49

Slide 49 text

Vector /var/log/pods /var/log/pods/{uid}/1.log 20Mb CASE #1: NO SPACE LEFT ON THE DEVICE

Slide 50

Slide 50 text

Vector /var/log/pods /var/log/pods/{uid}/1.log 50Mb CASE #1: NO SPACE LEFT ON THE DEVICE

Slide 51

Slide 51 text

Vector /var/log/pods /var/log/pods/{uid}/1.log 50Mb kubelet CASE #1: NO SPACE LEFT ON THE DEVICE

Slide 52

Slide 52 text

Vector /var/log/pods /var/log/pods/{uid}/1.log 50Mb kubelet CASE #1: NO SPACE LEFT ON THE DEVICE

Slide 53

Slide 53 text

Vector /var/log/pods /var/log/pods/{uid}/1.log 10Mb kubelet CASE #1: NO SPACE LEFT ON THE DEVICE

Slide 54

Slide 54 text

Vector /var/log/pods /var/log/pods/{uid}/1.log 50Mb kubelet CASE #1: NO SPACE LEFT ON THE DEVICE

Slide 55

Slide 55 text

Vector /var/log/pods /var/log/pods/{uid}/1.log 50Mb kubelet Loki CASE #1: NO SPACE LEFT ON THE DEVICE

Slide 56

Slide 56 text

Vector /var/log/pods /var/log/pods/{uid}/1.log 50Mb kubelet Loki 429 CASE #1: NO SPACE LEFT ON THE DEVICE

Slide 57

Slide 57 text

Vector /var/log/pods /var/log/pods/{uid}/1.log 50Mb kubelet Loki 429 CASE #1: NO SPACE LEFT ON THE DEVICE

Slide 58

Slide 58 text

Vector /var/log/pods /var/log/pods/{uid}/1.log 10Mb kubelet Loki 429 /var/log/pods/{uid}/1.log (DELETED) 50Mb CASE #1: NO SPACE LEFT ON THE DEVICE

Slide 59

Slide 59 text

Vector /var/log/pods /var/log/pods/{uid}/1.log kubelet Loki 429 /var/log/pods/{uid}/1.log (DELETED) 50Mb 10Mb CASE #1: NO SPACE LEFT ON THE DEVICE

Slide 60

Slide 60 text

Vector /var/log/pods /var/log/pods/{uid}/1.log kubelet Loki 429 /var/log/pods/{uid}/1.log (DELETED) 50Mb 10Mb /var/log/pods/{uid}/1.log (DELETED) 50Mb CASE #1: NO SPACE LEFT ON THE DEVICE

Slide 61

Slide 61 text

Vector /var/log/pods /var/log/pods/{uid}/1.log kubelet Loki 429 /var/log/pods/{uid}/1.log (DELETED) 50Mb 10Mb /var/log/pods/{uid}/1.log (DELETED) 50Mb CASE #1: NO SPACE LEFT ON THE DEVICE

Slide 62

Slide 62 text

Vector /var/log/pods /var/log/pods/{uid}/1.log kubelet Loki 429 /var/log/pods/{uid}/1.log (DELETED) 50Mb 10Mb /var/log/pods/{uid}/1.log (DELETED) 50Mb /var/log/pods/{uid}/1.log (DELETED) 50Mb CASE #1: NO SPACE LEFT ON THE DEVICE

Slide 63

Slide 63 text

HOW TO SOLVE? 1. Tune buffer settings Blocking (default) Drop Newest In Memory (default) Disk buffer Max events 1000 (default) 10000 2. Rule of a thumb Let logs go out of the node as quick as possible 3. If you brave enough sysctl -w fs.file-max=1000 (unsafe) vector.dev/docs/about/under-the-hood/architecture/buffering-model/ CASE #1: NO SPACE LEFT ON THE DEVICE

Slide 64

Slide 64 text

CASE #2: PROMETHEUS EXPLODED

Slide 65

Slide 65 text

uid=a uid=b vector_checkpoints_total{file=”/var/log/pods/a/{container}/1.log”} vector_checkpoints_total{file=”/var/log/pods/b/{container}/1.log”} Vector Prometheus a b CASE #2: PROMETHEUS EXPLODED

Slide 66

Slide 66 text

uid=a uid=b vector_checkpoints_total{file=”/var/log/pods/a/{container}/1.log”} vector_checkpoints_total{file=”/var/log/pods/b/{container}/1.log”} Vector Prometheus a b CASE #2: PROMETHEUS EXPLODED

Slide 67

Slide 67 text

uid=c uid=d vector_checkpoints_total{file=”/var/log/pods/a/{container}/1.log”} vector_checkpoints_total{file=”/var/log/pods/b/{container}/1.log”} vector_checkpoints_total{file=”/var/log/pods/c/{container}/1.log”} vector_checkpoints_total{file=”/var/log/pods/d/{container}/1.log”} Vector Prometheus a b c d CASE #2: PROMETHEUS EXPLODED

Slide 68

Slide 68 text

uid=c uid=d vector_checkpoints_total{file=”/var/log/pods/a/{container}/1.log”} vector_checkpoints_total{file=”/var/log/pods/b/{container}/1.log”} vector_checkpoints_total{file=”/var/log/pods/c/{container}/1.log”} vector_checkpoints_total{file=”/var/log/pods/d/{container}/1.log”} Vector Prometheus a b c d CASE #2: PROMETHEUS EXPLODED

Slide 69

Slide 69 text

uid=f uid=e vector_checkpoints_total{file=”/var/log/pods/a/{container}/1.log”} vector_checkpoints_total{file=”/var/log/pods/b/{container}/1.log”} vector_checkpoints_total{file=”/var/log/pods/c/{container}/1.log”} vector_checkpoints_total{file=”/var/log/pods/d/{container}/1.log”} vector_checkpoints_total{file=”/var/log/pods/e/{container}/1.log”} vector_checkpoints_total{file=”/var/log/pods/f/{container}/1.log”} Vector Prometheus a b c d e f CASE #2: PROMETHEUS EXPLODED

Slide 70

Slide 70 text

uid=f uid=e vector_checkpoints_total{file=”/var/log/pods/a/{container}/1.log”} vector_checkpoints_total{file=”/var/log/pods/b/{container}/1.log”} vector_checkpoints_total{file=”/var/log/pods/c/{container}/1.log”} vector_checkpoints_total{file=”/var/log/pods/d/{container}/1.log”} vector_checkpoints_total{file=”/var/log/pods/e/{container}/1.log”} vector_checkpoints_total{file=”/var/log/pods/f/{container}/1.log”} Vector Prometheus a b c d e f CASE #2: PROMETHEUS EXPLODED

Slide 71

Slide 71 text

uid=f uid=e vector_checkpoints_total{file=”/var/log/pods/a/{container}/1.log”} vector_checkpoints_total{file=”/var/log/pods/b/{container}/1.log”} vector_checkpoints_total{file=”/var/log/pods/c/{container}/1.log”} vector_checkpoints_total{file=”/var/log/pods/d/{container}/1.log”} vector_checkpoints_total{file=”/var/log/pods/e/{container}/1.log”} vector_checkpoints_total{file=”/var/log/pods/f/{container}/1.log”} Vector Prometheus a b c d e f metric_relabel_configs: - regex: 'file' action: labeldrop CASE #2: PROMETHEUS EXPLODED

Slide 72

Slide 72 text

uid=f uid=e vector_checkpoints_total{file=”/var/log/pods/a/{container}/1.log”} vector_checkpoints_total{file=”/var/log/pods/b/{container}/1.log”} vector_checkpoints_total{file=”/var/log/pods/c/{container}/1.log”} vector_checkpoints_total{file=”/var/log/pods/d/{container}/1.log”} vector_checkpoints_total{file=”/var/log/pods/e/{container}/1.log”} vector_checkpoints_total{file=”/var/log/pods/f/{container}/1.log”} Vector Prometheus a b c d e f metric_relabel_configs: - regex: 'file' action: labeldrop CASE #2: PROMETHEUS EXPLODED

Slide 73

Slide 73 text

uid=f uid=e vector_checkpoints_total{file=”/var/log/pods/a/{container}/1.log”} vector_checkpoints_total{file=”/var/log/pods/b/{container}/1.log”} vector_checkpoints_total{file=”/var/log/pods/c/{container}/1.log”} vector_checkpoints_total{file=”/var/log/pods/d/{container}/1.log”} vector_checkpoints_total{file=”/var/log/pods/e/{container}/1.log”} vector_checkpoints_total{file=”/var/log/pods/f/{container}/1.log”} Vector Prometheus a b c d e f HOW TO SOLVE? expire_metrics_secs=60 CASE #2: PROMETHEUS EXPLODED

Slide 74

Slide 74 text

uid=f uid=e vector_checkpoints_total{file=”/var/log/pods/c/{container}/1.log”} vector_checkpoints_total{file=”/var/log/pods/d/{container}/1.log”} vector_checkpoints_total{file=”/var/log/pods/e/{container}/1.log”} vector_checkpoints_total{file=”/var/log/pods/f/{container}/1.log”} Vector Prometheus c d e f HOW TO SOLVE? expire_metrics_secs=60 CASE #2: PROMETHEUS EXPLODED

Slide 75

Slide 75 text

uid=f uid=e vector_checkpoints_total{file=”/var/log/pods/e/{container}/1.log”} vector_checkpoints_total{file=”/var/log/pods/f/{container}/1.log”} Vector Prometheus e f HOW TO SOLVE? expire_metrics_secs=60 CASE #2: PROMETHEUS EXPLODED

Slide 76

Slide 76 text

HOW TO SOLVE? expire_metrics_secs=60 vector_component_errors_total time 7 3 3 errors 4 m ore errors expiration triggered 3 errors empty! This behavior makes the result of the rate PromQL function equal to zero. CASE #2: PROMETHEUS EXPLODED

Slide 77

Slide 77 text

HOW TO SOLVE? expire_metrics_secs=60 CASE #2: PROMETHEUS EXPLODED

Slide 78

Slide 78 text

HOW TO SOLVE? expire_metrics_secs=60 Patch for Vector to remove the file label CASE #2: PROMETHEUS EXPLODED

Slide 79

Slide 79 text

CASE #3: KUBERNETES CONTROL PLANE OUTAGE

Slide 80

Slide 80 text

Vector Vector Vector Kubernetes CASE #3: KUBERNETES CONTROL PLANE OUTAGE

Slide 81

Slide 81 text

Vector Vector Vector Kubernetes CASE #3: KUBERNETES CONTROL PLANE OUTAGE

Slide 82

Slide 82 text

Vector Vector Vector Kubernetes CASE #3: KUBERNETES CONTROL PLANE OUTAGE

Slide 83

Slide 83 text

control-plane node memory consumption etcd memory consumption CASE #3: KUBERNETES CONTROL PLANE OUTAGE

Slide 84

Slide 84 text

Vector Vector Vector Kubernetes CASE #3: KUBERNETES CONTROL PLANE OUTAGE

Slide 85

Slide 85 text

Vector Vector Vector Kubernetes CASE #3: KUBERNETES CONTROL PLANE OUTAGE

Slide 86

Slide 86 text

Vector Vector Vector Kubernetes LIST /api/v1/pods?fieldSelector=spec.nodeName=$NODE_NAME 110 pods CASE #3: KUBERNETES CONTROL PLANE OUTAGE

Slide 87

Slide 87 text

Vector Vector Vector Kubernetes LIST /api/v1/pods?fieldSelector=spec.nodeName=$NODE_NAME 110 pods etcd /registry/// ALL pods CASE #3: KUBERNETES CONTROL PLANE OUTAGE

Slide 88

Slide 88 text

Vector Vector Vector Kubernetes LIST /api/v1/pods?fieldSelector=spec.nodeName=$NODE_NAME 110 pods etcd /registry/// ALL pods RAM↑ RAM↑ CASE #3: KUBERNETES CONTROL PLANE OUTAGE

Slide 89

Slide 89 text

Vector Vector Vector Kubernetes LIST /api/v1/pods?fieldSelector=spec.nodeName=$NODE_NAME 110 pods etcd /registry/// ALL pods RAM↑ RAM↑ CASE #3: KUBERNETES CONTROL PLANE OUTAGE

Slide 90

Slide 90 text

HOW TO SOLVE? CASE #3: KUBERNETES CONTROL PLANE OUTAGE

Slide 91

Slide 91 text

1. Cache read (resourceVersion=0) LIST /api/v1/pods?fieldSelector=spec.nodeName=$NODE_NAME&resourceVersion=0 HOW TO SOLVE? CASE #3: KUBERNETES CONTROL PLANE OUTAGE

Slide 92

Slide 92 text

1. Cache read (resourceVersion=0) LIST /api/v1/pods?fieldSelector=spec.nodeName=$NODE_NAME&resourceVersion=0 HOW TO SOLVE? CASE #3: KUBERNETES CONTROL PLANE OUTAGE

Slide 93

Slide 93 text

1. Cache read (resourceVersion=0) LIST /api/v1/pods?fieldSelector=spec.nodeName=$NODE_NAME&resourceVersion=0 use_apiserver_cache=true HOW TO SOLVE? CASE #3: KUBERNETES CONTROL PLANE OUTAGE

Slide 94

Slide 94 text

1. Cache read (resourceVersion=0) 2. Limit concurrent requests (Priority and Fairness API) apiVersion: flowcontrol.apiserver.k8s.io/v1beta1 kind: PriorityLevelConfiguration metadata: name: limit-list-custom spec: type: Limited limited: assuredConcurrencyShares: 5 limitResponse: queuing: handSize: 4 queueLengthLimit: 50 queues: 16 type: Queue apiVersion: flowcontrol.apiserver.k8s.io/v1beta1 kind: FlowSchema metadata: name: limit-list-custom spec: priorityLevelConfiguration: name: limit-list-custom distinguisherMethod: type: ByUser rules: - resourceRules: - apiGroups: [""] clusterScope: true namespaces: ["*"] resources: ["pods"] verbs: ["list", "get"] subjects: - kind: ServiceAccount serviceAccount: name: *** namespace: *** HOW TO SOLVE? CASE #3: KUBERNETES CONTROL PLANE OUTAGE

Slide 95

Slide 95 text

1. Cache read (resourceVersion=0) 2. Limit concurrent requests (Priority and Fairness API) apiVersion: flowcontrol.apiserver.k8s.io/v1beta1 kind: PriorityLevelConfiguration metadata: name: limit-list-custom spec: type: Limited limited: assuredConcurrencyShares: 5 limitResponse: queuing: handSize: 4 queueLengthLimit: 50 queues: 16 type: Queue apiVersion: flowcontrol.apiserver.k8s.io/v1beta1 kind: FlowSchema metadata: name: limit-list-custom spec: priorityLevelConfiguration: name: limit-list-custom distinguisherMethod: type: ByUser rules: - resourceRules: - apiGroups: [""] clusterScope: true namespaces: ["*"] resources: ["pods"] verbs: ["list", "get"] subjects: - kind: ServiceAccount serviceAccount: name: *** namespace: *** HOW TO SOLVE? CASE #3: KUBERNETES CONTROL PLANE OUTAGE

Slide 96

Slide 96 text

1. Cache read (resourceVersion=0) 2. Limit concurrent requests (Priority and Fairness API) HOW TO SOLVE? CASE #3: KUBERNETES CONTROL PLANE OUTAGE

Slide 97

Slide 97 text

1. Cache read (resourceVersion=0) 2. Limit concurrent requests (Priority and Fairness API) 3. Use kubelet API instead of Kubernetes Pods metadata can be fetched by requesting the /pods endpoint HOW TO SOLVE? CASE #3: KUBERNETES CONTROL PLANE OUTAGE

Slide 98

Slide 98 text

1. Cache read (resourceVersion=0) 2. Limit concurrent requests (Priority and Fairness API) 3. Use kubelet API instead of Kubernetes HOW TO SOLVE? CASE #3: KUBERNETES CONTROL PLANE OUTAGE

Slide 99

Slide 99 text

CONCLUSION 1. Great to build platforms 2. Vector is awesome, seriously, deploy it today 3. Share practical cases and learn together

Slide 100

Slide 100 text

github.com/werf github.com/palark THANK YOU! Q&A @nabokihms [email protected] OPEN SOURCE TOOLS OUR BLOGS AND SOCIAL MEDIA CONTACT US palark.com twitter.com/palark_com MAKSIM NABOKIKH Platform Lead