Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Alert Overload: How to adopt a microservices architecture without being overwhelmed with noise

Alert Overload: How to adopt a microservices architecture without being overwhelmed with noise

You’ve heard all about what microservices can do for you. You’re convinced. So you build some. Reasoning about your functionality is way easier: these services are so simple! Then you get to the point where you have 35 microservices, and all the monitoring and alerting tactics you used for your monoliths are a complete disaster. Something needs to change and this talk explains what and how.

Talk given at Continuous Lifecycle London, May 2016

A288fb976fc633cde90a2bc19bf2b5a6?s=128

Sarah Wells

May 03, 2016
Tweet

More Decks by Sarah Wells

Other Decks in Technology

Transcript

  1. Alerts Overload How to adopt a microservices architecture without being

    overwhelmed with noise Sarah Wells @sarahjwells
  2. None
  3. Microservices make it worse

  4. microservices (n,pl): an efficient device for transforming business problems into

    distributed transaction problems @drsnooks
  5. You have a lot more systems

  6. 45 microservices

  7. 45 microservices 3 environments

  8. 45 microservices 3 environments 2 instances for each service

  9. 45 microservices 3 environments 2 instances for each service 20

    checks per instance
  10. 45 microservices 3 environments 2 instances for each service 20

    checks per instance running every 5 minutes
  11. > 1,500,000 system checks per day

  12. Over 19,000 system monitoring alerts in 50 days

  13. Over 19,000 system monitoring alerts in 50 days An average

    of 380 per day
  14. Functional monitoring is also an issue

  15. 12,745 response time/error alerts in 50 days

  16. 12,745 response time/error alerts An average of 255 per day

  17. Why so many?

  18. None
  19. None
  20. None
  21. None
  22. http://devopsreactions.tumblr.com/post/122408751191/alerts-when-an-outage-starts

  23. How can you make it better?

  24. Quick starts: attack your problem See our EngineRoom blog for

    more: http://bit.ly/1PP7uQQ
  25. 1 2 3

  26. Think about monitoring from the start 1

  27. It's the business functionality you care about

  28. None
  29. None
  30. 1

  31. 2 1

  32. 3 1 2

  33. 4 1 2 3

  34. We care about whether published content made it to us

  35. When people call our APIs, we care about speed

  36. … we also care about errors

  37. But it's the end-to-end that matters https://www.flickr.com/photos/robef/16537786315/

  38. You only want an alert where you need to take

    action
  39. If you just want information, create a dashboard or report

  40. Turn off your staging environment overnight and at weekends

  41. Make sure you can't miss an alert

  42. Make the alert great http://www.thestickerfactory.co.uk/

  43. Build your system with support in mind

  44. Transaction ids tie all microservices together

  45. None
  46. Healthchecks tell you whether a service is OK GET http://{service}/__health

  47. Healthchecks tell you whether a service is OK GET http://{service}/__health

    returns 200 if the service can run the healthcheck
  48. Healthchecks tell you whether a service is OK GET http://{service}/__health

    returns 200 if the service can run the healthcheck each check will return "ok": true or "ok": false
  49. None
  50. None
  51. Synthetic requests tell you about problems early https://www.flickr.com/photos/jted/5448635109

  52. Use the right tools for the job 2

  53. There are basic tools you need

  54. Service monitoring (e.g. Nagios)

  55. Log aggregation (e.g. Splunk)

  56. FT Platform: An internal PaaS

  57. Graphing (e.g. Graphite/Grafana)

  58. metrics: reporters: - type: graphite frequency: 1 minute durationUnit: milliseconds

    rateUnit: seconds host: <%= @graphite.host %> port: 2003 prefix: content.<%= @config_env %>.api-policy-component.<%= scope.lookupvar('::hostname') %>
  59. None
  60. None
  61. Real time error analysis (e.g. Sentry)

  62. Build other tools to support you

  63. SAWS Built by Silvano Dossan See our Engine room blog:

    http://bit.ly/1GATHLy
  64. "I imagine most people do exactly what I do -

    create a google filter to send all Nagios emails straight to the bin"
  65. "Our screens have a viewing angle of about 10 degrees"

  66. "Our screens have a viewing angle of about 10 degrees"

    "It never seems to show the page I want"
  67. Code at: https://github.com/muce/SAWS

  68. Dashing

  69. None
  70. Nagios chart Built by Simon Gibbs @simonjgibbs See our Engine

    Room blog: http://engineroom.ft.com/2015/12/10/alerting- for-brains/
  71. None
  72. None
  73. None
  74. None
  75. Use the right communication channel

  76. It's not email

  77. Slack integration

  78. None
  79. Radiators everywhere

  80. Cultivate your alerts 3

  81. Review the alerts you get

  82. If it isn't helpful, make sure you don't get sent

    it again
  83. See if you can improve it www.workcompass.com/

  84. Splunk Alert: PROD - MethodeAPIResponseTime5MAlert Business Impact The methode api

    server is slow responding to requests. This might result in articles not getting published to the new content platform or publishing requests timing out. ...
  85. Splunk Alert: PROD - MethodeAPIResponseTime5MAlert Business Impact The methode api

    server is slow responding to requests. This might result in articles not getting published to the new content platform or publishing requests timing out. ...
  86. … Technical Impact The server is experiencing service degradation because

    of network latency, high publishing load, high bandwidth utilization, excessive memory or cpu usage on the VM. This might result in failure to publish articles to the new content platform.
  87. Splunk Alert: PROD Content Platform Ingester Methode Publish Failures Alert

    There has been one or more publish failures to the Universal Publishing Platform. The UUIDs are listed below. Please see the run book for more information. _time transaction_id uuid Mon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe
  88. Splunk Alert: PROD Content Platform Ingester Methode Publish Failures Alert

    There has been one or more publish failures to the Universal Publishing Platform. The UUIDs are listed below. Please see the run book for more information. _time transaction_id uuid Mon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe
  89. Splunk Alert: PROD Content Platform Ingester Methode Publish Failures Alert

    There has been one or more publish failures to the Universal Publishing Platform. The UUIDs are listed below. Please see the run book for more information. _time transaction_id uuid Mon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe
  90. When you didn't get an alert

  91. What would have told you about this?

  92. None
  93. Setting up an alert is part of fixing the problem

    ✔ code ✔ test alerts
  94. System boundaries are more difficult Severin.stalder [CC BY-SA 3.0 (http://creativecommons.

    org/licenses/by-sa/3.0)], via Wikimedia Commons
  95. Make sure you would know if an alert stopped working

  96. Add a unit test public void shouldIncludeTriggerWordsForPublishFailureAlertInSplunk() { … }

  97. Deliberately break things

  98. Chaos snail

  99. The thing that sends you alerts need to be up

    and running https://www.flickr.com/photos/davidmasters/2564786205/
  100. What happened to our alerts?

  101. We turned off ALL emails from system monitoring

  102. Our most important alerts come in via a team 'production

    alert' slack channel
  103. We created dashboards for our read APIs in Grafana

  104. We also have dashboards for our key metrics - the

    business related ones
  105. None
  106. None
  107. We do synthetic publishes for content and images

  108. What happened when we started again?

  109. Docker CoreOS AWS Fleet

  110. We thought about programming languages

  111. Using Go rather than Java by default

  112. Support for metrics https://github.com/rcrowley/go-metrics

  113. Output metrics to Graphite: go graphite.Graphite(metrics.DefaultRegistry, 5*time.Second, graphitePrefix, graphiteTCPAddress)

  114. Support for transactionIDs

  115. + Easy to add to http access logging - Have

    to pass around the transactionId for other logging as a function parameter
  116. Support for healthchecks

  117. Logging that meets our needs

  118. Service monitoring

  119. None
  120. None
  121. None
  122. None
  123. None
  124. Log aggregation

  125. Integration with Dashing

  126. None
  127. Using Graphite/Grafana

  128. None
  129. None
  130. None
  131. We may change the way we do it, but the

    things we do are the same
  132. To summarise...

  133. Build microservices

  134. 1 2 3

  135. About technology at the FT: Look us up on Stack

    Overflow http://bit.ly/1H3eXVe Read our blog http://engineroom.ft.com/
  136. The FT on github https://github.com/Financial-Times/ https://github.com/ftlabs

  137. Thank you