How to Cope When Kubernetes Misbehaves

Komodor <> Epsagon | May 2021 How to Cope When
Your Kubernetes Misbehaves Itiel Shwartz, Co-Founder & CTO @ Komodor

Cloud native | March 2021 DevOps Days Buffalo | October
2021 • The CTO and co-founder of Komodor, a startup building the first k8s-native troubleshooting platform. • A big believer in dev empowerment and moving fast. • Worked at eBay, Forter and Rookout (first developer). A lot of backend and infra dev experience (“DevOps”) • K8S fan 😃 Who am I?

Cloud native | March 2021 Source: CNCF 2022 Announcements About
Container and Kubernetes Adoption “The usage of Kubernetes is continuing to grow and reached its highest level ever, with 96% of organizations using or evaluating the technology.”

Cloud native | March 2021 Kubernetes: It would be easy,
they said…

Cloud native | March 2021 • People abuse it •
It looks easy to start… until you understand that it isn’t • We are going to talk about moving fast, while staying alive • By the end of this session, you will be more proficient on how to manage your K8S system Kubernetes is complex.

Cloud native | March 2021 What Makes K8s Troubleshooting So
Complex? Issues happen on a daily basis and it’s almost impossible to understand what causes them. 85% of incidents can be traced to system changes: Blind spot Changes are unaudited or hidden Fragmented data Events are scattered between hundreds of different tools Butterfly effect Distributed systems makes it harder to understand the effect of a single change

Cloud native | March 2021 So What Are The 4
Things You Can Do When Your K8s is Misbehaving?

Investors | January 2021 Investors | January 2021 Failure Scenario
#1: Oh no, our app is not-working!

Cloud native | March 2021 SOLUTION #1 Distributed Tracing 4
Things To Do When Your K8s is Misbehaving

Investors | January 2021 Investors | January 2021 Best Practice
#1: Distributed Tracing Traces and Spans

Investors | January 2021 Investors | January 2021 Who Should
I blame?

#1: Distributed Tracing Why Distributed Tracing is IMPORTANT • Troubleshooting & debugging techniques differ among monolithic and multiverse architectures ◦ Monolithic troubleshooting approach: isolate a single instance of the monolith and reproduce the problem ◦ Microservices troubleshooting approach: Unlike above, this approach is no longer feasible, because no single service provides a complete picture of the performance or correctness of the app as a whole

#1: Distributed Tracing How to do Distributed Tracing Properly • Invest the time in instrumentation • Add relevant metadata and context • Spend the time familiarizing the team with these tools! • Re-iterate over the data collected and make sure to find the blind-spots

#2: Oh no, our apps are not-working!

Things To Do When Your K8s is Misbehaving SOLUTION #2 Metrics

Investors | January 2021 Investors | January 2021 Metric #1:
CPU Requests vs Actual Usage Make sure that your requests are aligned with your actual usage ◦ If Requests < actual usage (over utilization) Your app will work slower due to insufficient resources on the node ◦ If Requests > actual usage (under utilization) You’ll be “wasting” cores on your nodes which will be unused Goal: Define the pod requests as 100% and aim for an actual usage rate of 60%–80% on the 90th percentile Best Practice #2: Metrics Monitor Critical CPU & Node Related Metrics

CPU Limit vs Actual Usage Make sure to enforce the CPU limit boundaries of your workloads during runtime • When a container reaches the CPU limit it will get throttled (it will get less CPU cycles from the OS than it could have and that eventually results in slower execution time!) Goal: Monitor CPU Limits the way I suggested monitoring CPU requests. Best Practice #2: Metrics Monitor Critical CPU & Node Related Metrics

Nodes Failing Status Checks Common node health statuses: • Ready • DiskPressure • MemoryPressure • PIDPressure • NetworkUnavailable If the ready status is False or any of the other conditions turn positive → something bad is up! Best Practice #2: Metrics Monitor Critical CPU & Node Related Metrics

#3: Our app is still not working :(

Things To Do When Your K8s is Misbehaving SOLUTION #2 Metrics SOLUTION #3 Logging

Investors | January 2021 Investors | January 2021 • When
writing logs don’t think about yourself, but the dev after you • When unexpected things happen - you should log them • Each log should contain as much relevant data as possible • It’s OK to add and remove (or add) logs Best Practice #3: Logging Add logs for your fellow humans

Investors | January 2021 Investors | January 2021 Tag and
label your logs properly, by including the: • Proper service name (not the pod names!) • Version • Cluster environment information • Business-specific data REMEMBER: Things will FAIL! So even if you are not sure if this code will ever run -> log anyway :) Best Practice #3: Logging Here’s how to log the “unexpected”

#3: Logging A snapshot of useful logs

Things To Do When Your K8s is Misbehaving SOLUTION #2 Metrics SOLUTION #3 Logging SOLUTION #4 Change Tracking

#4: Change Tracking Adding the Final Missing Piece • No monitoring or observability tool provides a system-wide, historical (and present) view of changes in a simple UI • It requires a lot of manual work to extrapolate specific data about your overall system and its components • This is why K8s-native troubleshooting tools are a necessity to gain the necessary context and overview to troubleshoot effectively.

Investors | January 2021 Investors | January 2021 Who Should
I blame? (Yes, again.)

Cloud native | March 2021 The TL:DR Prepare for the
Failure of State of Mind Failure is inevitable. So make sure you remember and think of the following: • Nothing beats preparation! • Change of the state will result in change of overall resilience of the system • If your team is afraid of failing, invest more in the right tools and the right culture

Our Komodor reps are waiting for you! Are You Feeling
Lucky?! Swing by Komodor’s booth to enter our live raffle tomorrow (28 April) at 2:15pm. Have a chat about K8s troubleshooting, claim your ticket, and stand a chance to take home one of TWO amazing prizes!

How to Cope When Kubernetes Misbehaves

How to Cope When Kubernetes Misbehaves

Komodor

More Decks by Komodor

Other Decks in Programming

Featured

Transcript

Komodor <> Epsagon | May 2021 How to Cope When

Cloud native | March 2021 DevOps Days Buffalo | October

Cloud native | March 2021 Source: CNCF 2022 Announcements About

Cloud native | March 2021 Kubernetes: It would be easy,

Cloud native | March 2021 • People abuse it •

Cloud native | March 2021 What Makes K8s Troubleshooting So

Cloud native | March 2021 So What Are The 4

Investors | January 2021 Investors | January 2021 Failure Scenario

Cloud native | March 2021 SOLUTION #1 Distributed Tracing 4

Investors | January 2021 Investors | January 2021 Best Practice

Investors | January 2021 Investors | January 2021 Who Should

Investors | January 2021 Investors | January 2021 Best Practice

Investors | January 2021 Investors | January 2021 Best Practice

Investors | January 2021 Investors | January 2021 Failure Scenario

Cloud native | March 2021 SOLUTION #1 Distributed Tracing 4

Investors | January 2021 Investors | January 2021 Metric #1:

Investors | January 2021 Investors | January 2021 Metric #2:

Investors | January 2021 Investors | January 2021 Metric #3:

Investors | January 2021 Investors | January 2021 Failure Scenario

Cloud native | March 2021 SOLUTION #1 Distributed Tracing 4

Investors | January 2021 Investors | January 2021 • When

Investors | January 2021 Investors | January 2021 Tag and

Investors | January 2021 Investors | January 2021 Best Practice

Cloud native | March 2021 SOLUTION #1 Distributed Tracing 4

Investors | January 2021 Investors | January 2021 Best Practice

Investors | January 2021 Investors | January 2021 Who Should

Cloud native | March 2021 The TL:DR Prepare for the

Q&A

Our Komodor reps are waiting for you! Are You Feeling