Anatomy of a Production Kubernetes Outage – Kubecon EU 2018

Anatomy of a Production Kubernetes Outage – Kubecon EU 2018

Dive into a production Kubernetes outage that Monzo experienced a few months ago, its causes and effects, and the architectural and operational lessons learned.

Acab65d674375c233a783d1aad163528?s=128

Oliver Beattie

May 02, 2018
Tweet

Transcript

  1. 2.
  2. 3.
  3. 4.
  4. 5.
  5. 20.

    IMPACT % 1 hour, 21 mins of cluster downtime Vast

    majority of payments succeeded throughout
  6. 21.

    ROOT CAUSES ! Bug in gRPC client library affecting etcd

    Incompatibility between Kubernetes + Linkerd
  7. 24.

    ROOT CAUSES ! Bug in gRPC client library affecting etcd

    Incompatibility between Kubernetes + Linkerd Human error
  8. 29.

    “Chaos Engineering is the discipline of experimenting on a distributed

    system in order to build confidence in the system’s capability to withstand turbulent conditions in production.”
  9. 32.
  10. 33.
  11. 35.