Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Chaos & Order: Breaking and Fixing Things in K8s Environments

Komodor
April 24, 2022

Chaos & Order: Breaking and Fixing Things in K8s Environments

You can’t build a CI/CD pipeline and support fast-paced development cycles without considering continuous reliability. On the one hand, this means being rehearsed and prepared for every scenario. On the other, this calls for a contingency plan for when (inevitably) something will go wrong.

Join this live event and see how DevOps tools can help you plan for the best and prepare for the worst, as Julie from Gremlin injects chaos into the Bank of Anthos’ system and Rona from Komodor troubleshoots things back into order.

Komodor

April 24, 2022
Tweet

More Decks by Komodor

Other Decks in Technology

Transcript

  1. Komodor <> Epsagon | May 2021 with Gremlin & Komodor

    Chaos & Order: Breaking and Fixing Things in K8s Environments
  2. Epic | February 2021 Why is it hard to troubleshoot?

    Issues happen on an hourly basis and it’s almost impossible to understand what causes them. 85% of incidents can be traced to system changes: Blind spot Changes are unaudited or hidden Fragmented data Events are scattered between hundreds of different tools Butterfly effect Distributed systems makes it harder to understand the effect of a single change
  3. Introduction Explore relevant exceptions Troubleshooting Today Understand who changed what

    Check the CI pipeline Check pods status Check the CI pipeline Check current alert Explore relevant exceptions Review the alert’s metrics Check account activity Review the latest code changes
  4. Epic | February 2021 Introduction Komodor tracks changes across tools

    & teams, understands their ripple effect and gives users the context they need to troubleshoot efficiently. We track down cross-services cascading failures We are service-centric, showing the full activity timeline per service We help you find the root cause across all systems
  5. Introduction How does it work? Collect cross systems events Provide

    a complete overview of all services and their relations in a single place For each service, we build a comprehensive timeline: deploys, config changes, alerts and more
  6. Introduction Introduction Installation and integration • Komodor takes about 5

    minutes to install. • K8s agent documentation can be found here: https://github.com/komodorio/helm- charts/tree/master/charts/k8s-watc her • Komodor integrates with all of your favorite DevOps tools
  7. Introduction Introduction Service Explorer We collect data from Kubernetes and

    enrich it with observability, code repository, CI/CD and alerting tools. The data is organized in a comprehensive way, ready for a drill down from the big picture to its details.
  8. Introduction Introduction Related Services Troubleshooting microservices requires a deep understanding

    of connections and dependencies. In one click, you can add more services to the service view, so it’s correlated on one timeline.
  9. Introduction Introduction Events View The ‘Events’ feature offers a panoramic

    view of all occurrences across your entire K8s environment. With this system-wide visibility, Komodor Events makes it easier to troubleshoot elusive issues, particularly those that aren’t traced to any one specific service or cluster.
  10. Introduction Introduction Pod Status and logs ‘Pods Status and Logs’

    enables you to quickly drill down in the pods of an unhealthy service. This offers quick access to all of the pod-level data you`ll need for troubleshooting, including: • Overview of all pods running the service • Pod details, similar to what you would get with kubectl describe • Live view of all events • Pod containers’ logs
  11. Introduction Introduction Workflows OOMKilled 1. Detect Kubernetes issues (e.g. health

    events, schedulable resources and etc) 2. Correlate the information with data from external sources (e.g, Cloud providers, source code and feature flags) 3. Run sequences of checks that quickly pinpoint the exact root cause 4. Use all of the information acquired to deliver made-to-measure instructions for remediation
  12. Chaos Engineering 01 Resource failure 02 Service failure 03 Dependency

    failure 04 Application failure 05 Continuous Chaos