Workshop: Hidden Signals in K8s Clusters: A Data-Driven Approach to Reliability

Slide 1

Slide 1 text

The Hidden Signals You Miss in K8s Clusters A Data-Driven Approach to Complex Reliability Issues with Andrei Pokhilko Open Source Dev Lead, Komodor

Slide 2

Slide 2 text

How We Uncovered Hidden Signals by Analyzing Cluster Data Kubernetes’ popular, yet its complexity often leave users struggling when issues arise. At Komodor, we saw an opportunity to leverage the wealth of data already existing within clusters to help users identify and resolve complex problems. Our journey wasn’t straightforward. Kubernetes is a dynamic system with numerous moving parts, generating an enormous amount of data. Through our “Reliability Insights” project, we aimed to analyze raw data from hundreds of Kubernetes clusters and transform it into actionable intelligence. Extensive research and experimentation led us to develop methods to clean and process this data, uncovering remarkable ﬁndings. We identiﬁed multiple types of reliability-related insights, each offering a unique perspective on cluster health and performance. In this talk, we’ll share our journey of discovery. We’ll explore how we combined various data points to reveal hidden issues, discuss the challenges we faced in making sense of Kubernetes’ vast data landscape, and demonstrate how these insights can level up one’s cluster reliability management. Join us to learn how extracting meaningful patterns from the complex world of Kubernetes can transform your understanding of what is going on in your infrastructure.

Slide 3

Slide 3 text

Abstract In the dynamic realm of Kubernetes infrastructure, it functions much like a living organism: pods spawn and terminate like a heartbeat, nodes scale up and down like breaths, and events trigger like nerve impulses. While biological organisms are flawless, technical systems always have room for improvement. By observing these systems closely over time, one can figure out the ways to enhance them. However, due to human limitations in processing vast amounts of data manually, we have to use tools to analyze raw data and provide actionable insights for k8s users. Within the Komodor platform, data from hundreds of Kubernetes clusters flows continuously, placing us in a unique position to analyze this data for the benefit of our customers. This led to the initiation of the "Reliability Insights" project, which, after extensive research and experimentation, has become an integral part of our main platform. During our research, we identified a dozen of types of reliability-related insights. While only two-thirds of these insights made it into the final product, all of them were valuable, and some of the unreleased ones were particularly cool. In this talk, we will share our observations and findings from the insights research, providing a deeper understanding of each type of insight and explaining the importance of analyzing the life of clusters from a higher-level perspective.

Slide 4

Slide 4 text

Kubernetes has become a victim of its own success. Its popularity and complexity mean that while many organizations eagerly adopt it, they often find themselves at a loss when issues arise. We saw an opportunity to leverage the wealth of data already existing within clusters to help users identify and resolve complex problems. Our journey wasn’t straightforward. Kubernetes is a living, breathing entity with numerous moving parts. Pods spawn and terminate like heartbeats, nodes scale up and down like breaths, and events trigger like nerve impulses. This constant flux generates an enormous amount of data, which, if properly harnessed, can provide invaluable insights. At Komodor, we’re in a unique position. Data from hundreds of Kubernetes clusters flows through our platform continuously. This led us to initiate the “Reliability Insights” project, aiming to analyze this raw data and transform it into actionable intelligence for our users. Through extensive research and experimentation, we developed methods to clean and process this data, uncovering some truly remarkable findings. We identified a dozen types of reliability-related insights, each offering a unique perspective on cluster health and performance. In this talk, we’ll share our journey of discovery. We’ll explore how we combined various data points to reveal hidden issues, discuss the challenges we faced in making sense of Kubernetes’ vast data landscape, and demonstrate how these insights can revolutionize the way teams approach cluster reliability. Join us as we delve into the art and science of extracting meaningful patterns from the seemingly chaotic world of Kubernetes, and learn how this approach can transform your ability to maintain and optimize your infrastructure.

Slide 5

Slide 5 text

What is hiding in a plain sight? Kubernetes is an infrastructure layer on top of apps Simple signals everyone can see: statuses, events That information: + over time + inter-correlated Reliability Insights + cross-cluster

Slide 6

Slide 6 text

Areas of Interest Timeline analysis Correlation analysis Clustering analysis Tricky misconﬁgurations

Slide 7

Slide 7 text

Komodor Sees Kubernetes Top to Bottom K8s Fleet Cluster Group 1 Cluster Group 2 Cluster Group N Komodor Data Ingestion Data Storage Native k8s Events Synthetic Events CPU+RAM Monitoring

Slide 8

Slide 8 text

Time to Hide in a Research Lab

Slide 9

Slide 9 text

Experiments Setup Postgres AWS Timestream CLI Output

Slide 10

Slide 10 text

Examples of Reliability Insights

Slide 11

Slide 11 text

Container Restarts ● Density of restart events over time ● Grouping of pods belonging to a workload ● Thresholds may be subjective Type: Timeline analysis OutOfMemory Limit

Slide 12

Slide 12 text

HPA Maxed and CPU Throttled ● Average of multiple metrics formula over time ● Reveals compromised performance Type: Timeline analysis

Slide 13

Slide 13 text

Cluster EOL / Deprecated API ● Vital for maintaining clusters ● Upgrading requires API migration Type: Tricky Misconﬁguration

Slide 14

Slide 14 text

Noisy Neighbor in K8s ● Phase 1: Workload A grows resource usage ● Phase 2: k8s evicts workload B from nodes ● Above situation happens statistically over time ● Expensive and tricky to calculate Type: Timeline and correlation analysis

Slide 15

Slide 15 text

Insights That Didn’t Survive

Slide 16

Slide 16 text

Idle Workloads ● Near-zero CPU consumption ● Opposite of throttled CPU ● Questionable in many situations Type: Timeline analysis

Slide 17

Slide 17 text

Issues After Deploy Deploy Fixing Issue Timeline analysis / Correlation analysis Include at all? Type: Timeline and correlation analysis

Slide 18

Slide 18 text

Outdated Helm Charts ● Like our Open Source Helm Dashboard ● More automated ● Heavy to run ● Problem with private repositories Type: Tricky Misconﬁguration

Slide 19

Slide 19 text

Node Termination Impact ● Node termination vs workload health correlated ● For the SPOT instance lovers, also node autoscaling ● Pod termination statistics ● Resulting reliability issues ● Expensive to calculate Type: Correlation analysis

Slide 20

Slide 20 text

Services That Fail Together ● Workloads that fail in a quick succession (DBSCAN over timeline) ● Consistent historical patterns ● Expensive to calculate ● Hard to suggest acting on it Type: Cluster analysis A B A B C B C A C Time

Slide 21

Slide 21 text

From Lab to Product

Slide 22

Slide 22 text

Experiments Setup Postgres AWS Timestream CLI Output Freedom Ideas

Slide 23

Slide 23 text

Scheduled Runner (single-thread) Internal Review Setup Postgres AWS Timestream Results DB AppSmith UI Python Python (Optimizing SQL Queries, adding DB indexes)

Slide 24

Slide 24 text

Review & Dogfooding Rounds

Slide 25

Slide 25 text

Scheduled Runner (multi-thread, retries, priorities) Production Setup Postgres AWS Timestream Python Python Amazon SQS Jobs

Slide 26

Slide 26 text

Conclusions

Slide 27

Slide 27 text

What we have learned today ● Reliability insights are derived from basic Kubernetes info ● Implementation is challenging and resource-intensive ● Some insights are viable, others are not ● Komodor Platform provides good examples