$30 off During Our Annual Pro Sale. View Details »

Anatomy of a Production Kubernetes Outage – Kubecon EU 2018

Anatomy of a Production Kubernetes Outage – Kubecon EU 2018

Dive into a production Kubernetes outage that Monzo experienced a few months ago, its causes and effects, and the architectural and operational lessons learned.

Oliver Beattie

May 02, 2018
Tweet

More Decks by Oliver Beattie

Other Decks in Technology

Transcript

  1. Anatomy of a Production

    Kubernetes Outage !
    Oliver Beattie
    Head of Engineering, Monzo Bank

    View Slide

  2. View Slide

  3. View Slide

  4. View Slide

  5. View Slide

  6. You've spent £35.50 today
    !
    £10 at Tiger
    now
    MONZO
    Wednesday, 2 May
    5:10

    View Slide

  7. > 500 micro services
    Built on open source software

    View Slide

  8. Story of an outage

    View Slide

  9. CAST OF CHARACTERS "
    Kubernetes
    etcd
    Linkerd
    Humans
    #

    View Slide

  10. etcd upgrade
    ⏰ 2 WEEKS BEFORE THE OUTAGE

    View Slide

  11. Deployment of faulty service
    Scaled to zero replicas
    ⏰ 1 DAY BEFORE THE OUTAGE

    View Slide

  12. ⏰ START OF PARTIAL OUTAGE
    Ledger change deployed

    View Slide

  13. Ledger change rolled back
    ⏰ 2 MINS INTO THE OUTAGE

    View Slide

  14. Linkerd identified as unhealthy
    ⏰ 6 MINS INTO THE OUTAGE

    View Slide

  15. Begin restarting Linkerd pods
    ⏰ 16 MINS INTO THE OUTAGE

    View Slide

  16. New Linkerd pods cannot start
    Kubernetes apiserver restarted
    ⏰ 27 MINS INTO THE OUTAGE

    View Slide

  17. Finish restarting Linkerd pods
    ⏰ ESCALATED TO TOTAL OUTAGE 1 HR 3 MINS INTO THE OUTAGE

    View Slide

  18. Linkerd NullPointerException

    observed on start up
    ⏰ 1 HR 17 MINS INTO THE OUTAGE

    View Slide

  19. Linkerd/k8s incompatibility found
    Empty services deleted
    ⏰ END OF OUTAGE 1 HR 21 MINS

    View Slide

  20. IMPACT %
    1 hour, 21 mins of cluster downtime
    Vast majority of payments succeeded
    throughout

    View Slide

  21. ROOT CAUSES !
    Bug in gRPC client library affecting etcd
    Incompatibility between Kubernetes + Linkerd

    View Slide

  22. "endpoints": []
    V
    K8S < 1.6

    View Slide

  23. "endpoints": []
    VS.
    "endpoints": null
    K8S < 1.6
    K8S 1.6+

    View Slide

  24. ROOT CAUSES !
    Bug in gRPC client library affecting etcd
    Incompatibility between Kubernetes + Linkerd
    Human error

    View Slide

  25. LESSONS &
    Defence in depth

    View Slide

  26. Mastercard

    Banknet

    Mastercard

    processor
    AWS
    PHYSICAL

    DATA CENTRES

    Mastercard

    proxy

    Ledger

    View Slide

  27. Mastercard

    Banknet

    Mastercard

    processor
    AWS
    PHYSICAL

    DATA CENTRES

    Mastercard

    proxy

    Ledger
    '

    View Slide

  28. LESSONS &
    Chaos engineering

    View Slide

  29. “Chaos Engineering is the discipline of
    experimenting on a distributed system in
    order to build confidence in the system’s
    capability to withstand turbulent
    conditions in production.”

    View Slide

  30. LESSONS (
    More monitoring, more visible

    View Slide

  31. LESSONS &
    Be transparent; embrace the
    community

    View Slide

  32. View Slide

  33. View Slide

  34. monzo.com/careers

    View Slide

  35. @obeattie

    View Slide