Chaos & Order: Breaking and Fixing Things in K8s Environments

Komodor <> Epsagon | May 2021 with Gremlin & Komodor
Chaos & Order: Breaking and Fixing Things in K8s Environments

Epic | February 2021 Why is it hard to troubleshoot?
Issues happen on an hourly basis and it’s almost impossible to understand what causes them. 85% of incidents can be traced to system changes: Blind spot Changes are unaudited or hidden Fragmented data Events are scattered between hundreds of different tools Butterfly effect Distributed systems makes it harder to understand the effect of a single change

Introduction Explore relevant exceptions Troubleshooting Today Understand who changed what
Check the CI pipeline Check pods status Check the CI pipeline Check current alert Explore relevant exceptions Review the alert’s metrics Check account activity Review the latest code changes

Epic | February 2021 Introduction Komodor tracks changes across tools
& teams, understands their ripple effect and gives users the context they need to troubleshoot efficiently. We track down cross-services cascading failures We are service-centric, showing the full activity timeline per service We help you find the root cause across all systems

Introduction How does it work? Collect cross systems events Provide
a complete overview of all services and their relations in a single place For each service, we build a comprehensive timeline: deploys, config changes, alerts and more

Introduction Introduction Installation and integration • Komodor takes about 5
minutes to install. • K8s agent documentation can be found here: https://github.com/komodorio/helm- charts/tree/master/charts/k8s-watc her • Komodor integrates with all of your favorite DevOps tools

Introduction Introduction Service Explorer We collect data from Kubernetes and
enrich it with observability, code repository, CI/CD and alerting tools. The data is organized in a comprehensive way, ready for a drill down from the big picture to its details.

Introduction Introduction Related Services Troubleshooting microservices requires a deep understanding
of connections and dependencies. In one click, you can add more services to the service view, so it’s correlated on one timeline.

Introduction Introduction Events View The ‘Events’ feature offers a panoramic
view of all occurrences across your entire K8s environment. With this system-wide visibility, Komodor Events makes it easier to troubleshoot elusive issues, particularly those that aren’t traced to any one specific service or cluster.

Introduction Introduction Pod Status and logs ‘Pods Status and Logs’
enables you to quickly drill down in the pods of an unhealthy service. This offers quick access to all of the pod-level data you`ll need for troubleshooting, including: • Overview of all pods running the service • Pod details, similar to what you would get with kubectl describe • Live view of all events • Pod containers’ logs

Introduction Introduction Workflows OOMKilled 1. Detect Kubernetes issues (e.g. health
events, schedulable resources and etc) 2. Correlate the information with data from external sources (e.g, Cloud providers, source code and feature flags) 3. Run sequences of checks that quickly pinpoint the exact root cause 4. Use all of the information acquired to deliver made-to-measure instructions for remediation

Failures are inherent to complex systems and will cause downtime
unless tested for. 12

What is Chaos Engineering?

Thoughtful, planned experiments designed to reveal weakness in our systems.

Start Small & Increase the Blast Radius

Development Staging Production

Chaos Engineering 01 Resource failure

Chaos Engineering 01 Resource failure 02 Service failure

Chaos Engineering 01 Resource failure 02 Service failure 03 Dependency
failure

failure 04 Application failure

failure 04 Application failure 05 Continuous Chaos

Chaos & Order: Breaking and Fixing Things in K8...

Chaos & Order: Breaking and Fixing Things in K8s Environments

Komodor

More Decks by Komodor

Other Decks in Technology

Featured

Transcript

Komodor <> Epsagon | May 2021 with Gremlin & Komodor

Epic | February 2021 Why is it hard to troubleshoot?

Introduction Explore relevant exceptions Troubleshooting Today Understand who changed what

Epic | February 2021 Introduction Komodor tracks changes across tools

Introduction How does it work? Collect cross systems events Provide

Introduction Introduction Installation and integration • Komodor takes about 5

Introduction Introduction Service Explorer We collect data from Kubernetes and

Introduction Introduction Related Services Troubleshooting microservices requires a deep understanding

Introduction Introduction Events View The ‘Events’ feature offers a panoramic

Introduction Introduction Pod Status and logs ‘Pods Status and Logs’

Introduction Introduction Workflows OOMKilled 1. Detect Kubernetes issues (e.g. health

Failures are inherent to complex systems and will cause downtime

What is Chaos Engineering?

Thoughtful, planned experiments designed to reveal weakness in our systems.

Start Small & Increase the Blast Radius

Development Staging Production

Chaos Engineering 01 Resource failure

Chaos Engineering 01 Resource failure 02 Service failure

Chaos Engineering 01 Resource failure 02 Service failure 03 Dependency

Chaos Engineering 01 Resource failure 02 Service failure 03 Dependency

Chaos Engineering 01 Resource failure 02 Service failure 03 Dependency