Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Anatomy of a Production Kubernetes Outage – Kubecon EU 2018

Anatomy of a Production Kubernetes Outage – Kubecon EU 2018

Dive into a production Kubernetes outage that Monzo experienced a few months ago, its causes and effects, and the architectural and operational lessons learned.

Oliver Beattie

May 02, 2018
Tweet

More Decks by Oliver Beattie

Other Decks in Technology

Transcript

  1. IMPACT % 1 hour, 21 mins of cluster downtime Vast

    majority of payments succeeded throughout
  2. ROOT CAUSES ! Bug in gRPC client library affecting etcd

    Incompatibility between Kubernetes + Linkerd
  3. ROOT CAUSES ! Bug in gRPC client library affecting etcd

    Incompatibility between Kubernetes + Linkerd Human error
  4. “Chaos Engineering is the discipline of experimenting on a distributed

    system in order to build confidence in the system’s capability to withstand turbulent conditions in production.”