QCon London 2017: Avoiding Alerts Overload From Microservices

QCon London 2017: Avoiding Alerts Overload From Microservices

Microservices can be a great way to work: the services are simple, you can use the right technology for the job, and deployments become smaller and less risky. Unfortunately, other things become more complex. You probably took some time to design a deployment pipeline and set up self-service provisioning, for example. But did the rest of your thinking about what “done” means catch up? Are you still setting up alerts, run books, and monitoring for each microservice as though it was a monolith?

Two years ago, a team at the FT started out building a microservices-based system from scratch. Their initial naive approach to monitoring meant that an underlying network issue could mean 20 people each receiving 10,000 alert emails overnight. With that volume, you can’t pick out the important stuff. In fact, your inbox is unusable unless you have everything filtered away where you’ll never see it. Furthermore, you have information radiators all over the place, but there’s always something flashing or the wrong colour. You can spend the whole day moving from one attention-grabbing screen to another.

That team now has over 150 microservices in production. So how they get themselves out of that mess and regain control of their inboxes and their time? First, you have to work out what’s important, and then you have to ruthlessly narrow down on that. You need to be able to see only the things you need to take action on in a way that tells you exactly what you need to do. Sarah shares how her team regained control and offers some tips and tricks.

A288fb976fc633cde90a2bc19bf2b5a6?s=128

Sarah Wells

March 07, 2017
Tweet

Transcript

  1. Avoiding alerts overload from microservices Sarah Wells Principal Engineer, Financial

    Times @sarahjwells
  2. None
  3. None
  4. None
  5. @sarahjwells Knowing when there’s a problem isn’t enough

  6. You only want an alert when you need to take

    action
  7. @sarahjwells Hello

  8. None
  9. None
  10. 1

  11. 1 2

  12. 1 2 3

  13. 1 2 3 4

  14. @sarahjwells Monitoring this system…

  15. @sarahjwells Microservices make it worse

  16. “microservices (n,pl): an efficient device for transforming business problems into

    distributed transaction problems” @drsnooks
  17. @sarahjwells The services *themselves* are simple…

  18. @sarahjwells There’s a lot of complexity around them

  19. @sarahjwells Why do they make monitoring harder?

  20. @sarahjwells You have a lot more services

  21. @sarahjwells 99 functional microservices 350 running instances

  22. @sarahjwells 52 non functional services 218 running instances

  23. @sarahjwells That’s 568 separate services

  24. @sarahjwells If we checked each service every minute…

  25. @sarahjwells 817,920 checks per day

  26. @sarahjwells What about system checks?

  27. @sarahjwells 16,358,400 checks per day

  28. @sarahjwells “One-in-a-million” issues would hit us 16 times every day

  29. @sarahjwells Running containers on shared VMs reduces this to 92,160

    system checks per day
  30. @sarahjwells For a total of 910,080 checks per day

  31. @sarahjwells It’s a distributed system

  32. @sarahjwells Services are not independent

  33. None
  34. None
  35. None
  36. None
  37. http://devopsreactions.tumblr.com/post/122408751191/ alerts-when-an-outage-starts

  38. @sarahjwells You have to change how you think about monitoring

  39. How can you make it better?

  40. @sarahjwells 1. Build a system you can support

  41. @sarahjwells The basic tools you need

  42. @sarahjwells Log aggregation

  43. None
  44. @sarahjwells Logs go missing or get delayed more now

  45. @sarahjwells Which means log based alerts may miss stuff

  46. @sarahjwells Monitoring

  47. None
  48. @sarahjwells Limitations of our nagios integration…

  49. @sarahjwells No ‘service-level’ view

  50. @sarahjwells Default checks included things we couldn’t fix

  51. @sarahjwells A new approach for our container stack

  52. @sarahjwells We care about each service

  53. None
  54. @sarahjwells We care about each VM

  55. None
  56. @sarahjwells We care about unhealthy instances

  57. @sarahjwells Monitoring needs aggregating somehow

  58. @sarahjwells SAWS

  59. Built by Silvano Dossan See our Engine room blog: http://bit.ly/1GATHLy

  60. @sarahjwells "I imagine most people do exactly what I do

    - create a google filter to send all Nagios emails straight to the bin"
  61. @sarahjwells "Our screens have a viewing angle of about 10

    degrees"
  62. @sarahjwells "It never seems to show the page I want"

  63. @sarahjwells Code at: https://github.com/muce/SAWS

  64. @sarahjwells Dashing

  65. None
  66. None
  67. @sarahjwells Graphing of metrics

  68. None
  69. None
  70. None
  71. None
  72. https://www.flickr.com/photos/davidmasters/2564786205/

  73. @sarahjwells The things that make those tools WORK

  74. @sarahjwells Effective log aggregation needs a way to find all

    related logs
  75. Transaction ids tie all microservices together

  76. @sarahjwells Make it easy for any language you use

  77. @sarahjwells

  78. @sarahjwells Services need to report on their own health

  79. The FT healthcheck standard GET http://{service}/__health

  80. The FT healthcheck standard GET http://{service}/__health returns 200 if the

    service can run the healthcheck
  81. The FT healthcheck standard GET http://{service}/__health returns 200 if the

    service can run the healthcheck each check will return "ok": true or "ok": false
  82. None
  83. None
  84. @sarahjwells Knowing about problems before your clients do

  85. Synthetic requests tell you about problems early https://www.flickr.com/photos/jted/ 5448635109

  86. @sarahjwells 2. Concentrate on the stuff that matters

  87. @sarahjwells It’s the business functionality you should care about

  88. None
  89. We care about whether content got published successfully

  90. None
  91. When people call our APIs, we care about speed

  92. … we also care about errors

  93. None
  94. But it's the end-to-end that matters https://www.flickr.com/photos/robef/16537786315/

  95. If you just want information, create a dashboard or report

  96. @sarahjwells Checking the services involved in a business flow

  97. /__health?categories=lists-publish

  98. None
  99. None
  100. @sarahjwells 3. Cultivate your alerts

  101. Make each alert great http://www.thestickerfactory.co.uk/

  102. @sarahjwells Splunk Alert: PROD - MethodeAPIResponseTime5MAlert Business Impact The methode

    api server is slow responding to requests. This might result in articles not getting published to the new content platform or publishing requests timing out. ...
  103. @sarahjwells Splunk Alert: PROD - MethodeAPIResponseTime5MAlert Business Impact The methode

    api server is slow responding to requests. This might result in articles not getting published to the new content platform or publishing requests timing out. ...
  104. @sarahjwells … Technical Impact The server is experiencing service degradation

    because of network latency, high publishing load, high bandwidth utilization, excessive memory or cpu usage on the VM. This might result in failure to publish articles to the new content platform.
  105. @sarahjwells Splunk Alert: PROD Content Platform Ingester Methode Publish Failures

    Alert There has been one or more publish failures to the Universal Publishing Platform. The UUIDs are listed below. Please see the run book for more information. _time transaction_id uuid Mon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe
  106. @sarahjwells Splunk Alert: PROD Content Platform Ingester Methode Publish Failures

    Alert There has been one or more publish failures to the Universal Publishing Platform. The UUIDs are listed below. Please see the run book for more information. _time transaction_id uuid Mon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe
  107. @sarahjwells Splunk Alert: PROD Content Platform Ingester Methode Publish Failures

    Alert There has been one or more publish failures to the Universal Publishing Platform. The UUIDs are listed below. Please see the run book for more information. _time transaction_id uuid Mon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe
  108. @sarahjwells Splunk Alert: PROD Content Platform Ingester Methode Publish Failures

    Alert There has been one or more publish failures to the Universal Publishing Platform. The UUIDs are listed below. Please see the run book for more information. _time transaction_id uuid Mon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe
  109. Make sure you can't miss an alert

  110. @sarahjwells ‘Ops Cops’ keep an eye on our systems

  111. @sarahjwells Use the right communication channel

  112. @sarahjwells It’s not email

  113. Slack integration

  114. None
  115. @sarahjwells Support isn’t just getting the system fixed

  116. None
  117. @sarahjwells ‘You build it, you run it’?

  118. @sarahjwells Review the alerts you get

  119. If it isn't helpful, make sure you don't get sent

    it again
  120. See if you can improve it www.workcompass. com/

  121. @sarahjwells When you didn't get an alert

  122. What would have told you about this?

  123. @sarahjwells

  124. @sarahjwells Setting up an alert is part of fixing the

    problem ✔ code ✔ test alerts
  125. System boundaries are more difficult Severin.stalder [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)],

    via Wikimedia Commons
  126. @sarahjwells Make sure you would know if an alert stopped

    working
  127. Add a unit test public void shouldIncludeTriggerWordsForPublishFailureAlertInSplunk() { … }

  128. Deliberately break things

  129. Chaos snail

  130. @sarahjwells It’s going to change: deal with it

  131. @sarahjwells Out of date information can be worse than none

  132. @sarahjwells Automate updates where you can

  133. @sarahjwells Find ways to share what’s changing

  134. @sarahjwells In summary: to avoid alerts overload…

  135. 1 Build a system you can support

  136. 2 Concentrate on the stuff that matters

  137. 3 Cultivate your alerts

  138. @sarahjwells A microservice architecture lets you move fast…

  139. @sarahjwells But there’s an associated operational cost

  140. @sarahjwells Make sure it’s a cost you’re willing to pay

  141. @sarahjwells Thank you