Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Anatomy of a Production Kubernetes Outage – Kubecon EU 2018

Anatomy of a Production Kubernetes Outage – Kubecon EU 2018

Dive into a production Kubernetes outage that Monzo experienced a few months ago, its causes and effects, and the architectural and operational lessons learned.

Acab65d674375c233a783d1aad163528?s=128

Oliver Beattie

May 02, 2018
Tweet

Transcript

  1. Anatomy of a Production
 Kubernetes Outage ! Oliver Beattie Head

    of Engineering, Monzo Bank
  2. None
  3. None
  4. None
  5. None
  6. You've spent £35.50 today ! £10 at Tiger now MONZO

    Wednesday, 2 May 5:10
  7. > 500 micro services Built on open source software

  8. Story of an outage

  9. CAST OF CHARACTERS " Kubernetes etcd Linkerd Humans #

  10. etcd upgrade ⏰ 2 WEEKS BEFORE THE OUTAGE

  11. Deployment of faulty service Scaled to zero replicas ⏰ 1

    DAY BEFORE THE OUTAGE
  12. ⏰ START OF PARTIAL OUTAGE Ledger change deployed

  13. Ledger change rolled back ⏰ 2 MINS INTO THE OUTAGE

  14. Linkerd identified as unhealthy ⏰ 6 MINS INTO THE OUTAGE

  15. Begin restarting Linkerd pods ⏰ 16 MINS INTO THE OUTAGE

  16. New Linkerd pods cannot start Kubernetes apiserver restarted ⏰ 27

    MINS INTO THE OUTAGE
  17. Finish restarting Linkerd pods ⏰ ESCALATED TO TOTAL OUTAGE 1

    HR 3 MINS INTO THE OUTAGE
  18. Linkerd NullPointerException
 observed on start up ⏰ 1 HR 17

    MINS INTO THE OUTAGE
  19. Linkerd/k8s incompatibility found Empty services deleted ⏰ END OF OUTAGE

    1 HR 21 MINS
  20. IMPACT % 1 hour, 21 mins of cluster downtime Vast

    majority of payments succeeded throughout
  21. ROOT CAUSES ! Bug in gRPC client library affecting etcd

    Incompatibility between Kubernetes + Linkerd
  22. "endpoints": [] V K8S < 1.6

  23. "endpoints": [] VS. "endpoints": null K8S < 1.6 K8S 1.6+

  24. ROOT CAUSES ! Bug in gRPC client library affecting etcd

    Incompatibility between Kubernetes + Linkerd Human error
  25. LESSONS & Defence in depth

  26. Mastercard
 Banknet 
 Mastercard
 processor AWS PHYSICAL
 DATA CENTRES 


    Mastercard
 proxy 
 Ledger
  27. Mastercard
 Banknet 
 Mastercard
 processor AWS PHYSICAL
 DATA CENTRES 


    Mastercard
 proxy 
 Ledger '
  28. LESSONS & Chaos engineering

  29. “Chaos Engineering is the discipline of experimenting on a distributed

    system in order to build confidence in the system’s capability to withstand turbulent conditions in production.”
  30. LESSONS ( More monitoring, more visible

  31. LESSONS & Be transparent; embrace the community

  32. None
  33. None
  34. monzo.com/careers

  35. @obeattie