Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Keeping a modern bank online

Keeping a modern bank online

Monzo is a bank serving over 2 million customers in the UK. Unlike other banks, Monzo is 100% online, so it's critically important that our systems are too. Hear what it takes to keep the bank online and continue to provide world-class service levels to its customers.

47da0b3335f6f2e6efb33c3cb395271b?s=128

Christopher Evans

June 12, 2019
Tweet

More Decks by Christopher Evans

Other Decks in Technology

Transcript

  1. Keeping a modern bank online ⚡ Chris Evans Platform /

    On-call Lead @evnsio
  2. None
  3. None
  4. None
  5. None
  6. @evnsio

  7. So how do you keep a modern bank online?

  8. Monitoring and Alerting On-call Incident Response @evnsio

  9. Monitoring and Alerting Making sure we know what’s going on

  10. Monzo Platform @evnsio

  11. Monitor liberally; alert judiciously Monetary Costs Human Costs @evnsio

  12. Monitoring liberally Physical DCs Cloud Infrastructure Kubernetes Monzo Services Social

    Media Customer Queries Alerts Prometheus Monitoring @evnsio
  13. ~8000 unique scrape targets @evnsio

  14. 42 million active time series @evnsio

  15. 1.7 million samples per second @evnsio

  16. Shared Dashboards @evnsio

  17. Integrated Metrics @evnsio

  18. Aim for alerts that are sensitive and specific @evnsio

  19. ✅ Sensitive ❌ Specific These go off all the time.

    When they go off, people typically ignore them Alerts Car Alarm ✅ Sensitive ✅ Specific When there’s an issue, they go off quickly. When they go off, people pay attention and leave a building. Building Alarm @evnsio
  20. Decay Ownership Classification Routing Globals (Some of!) Our Alert Issues

    @evnsio
  21. Alerts in Version Control Fetch alerts + reload @evnsio

  22. Goal: Alert Zero @evnsio

  23. On-call Building the team to support our systems

  24. @evnsio

  25. @evnsio

  26. Early Days @evnsio

  27. @evnsio

  28. Expanding the Pool @evnsio

  29. @evnsio

  30. Introducing Specialists @evnsio

  31. Shadow rotations to encourage learning Runbooks to document the undocumented

    People on-call when it makes sense for you @evnsio
  32. Incident Response Restoring service as fast as possible

  33. response.pagerduty.com @evnsio

  34. Incident Response Workflow

  35. Monzo Incident ⚡

  36. Bring as much as reasonably practical into the conversation @evnsio

  37. Make it so easy to do the right thing that

    nobody would have reason to do otherwise @evnsio
  38. @evnsio

  39. @evnsio

  40. @evnsio

  41. The Headline Post @evnsio

  42. The Incident Doc @evnsio

  43. The Comms Channel @evnsio

  44. The Comms Channel @evnsio

  45. @incident <command> ... @evnsio

  46. @evnsio

  47. @evnsio

  48. @evnsio

  49. @evnsio

  50. @evnsio

  51. @evnsio

  52. @evnsio

  53. @evnsio

  54. Integrating with PagerDuty @evnsio

  55. None
  56. ...and that’s how we keep Monzo online @evnsio github.com/monzo/response