Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Alert Overload: How to adopt a microservices ar...

Alert Overload: How to adopt a microservices architecture without being overwhelmed with noise

You’ve heard all about what microservices can do for you. You’re convinced. So you build some. Reasoning about your functionality is way easier: these services are so simple! Then you get to the point where you have 35 microservices, and all the monitoring and alerting tactics you used for your monoliths are a complete disaster. Something needs to change and this talk explains what and how.

Talk given at Continuous Lifecycle London, May 2016

Sarah Wells

May 03, 2016
Tweet

More Decks by Sarah Wells

Other Decks in Technology

Transcript

  1. Alerts Overload How to adopt a microservices architecture without being

    overwhelmed with noise Sarah Wells @sarahjwells
  2. 45 microservices 3 environments 2 instances for each service 20

    checks per instance running every 5 minutes
  3. 1

  4. 2 1

  5. Healthchecks tell you whether a service is OK GET http://{service}/__health

    returns 200 if the service can run the healthcheck
  6. Healthchecks tell you whether a service is OK GET http://{service}/__health

    returns 200 if the service can run the healthcheck each check will return "ok": true or "ok": false
  7. metrics: reporters: - type: graphite frequency: 1 minute durationUnit: milliseconds

    rateUnit: seconds host: <%= @graphite.host %> port: 2003 prefix: content.<%= @config_env %>.api-policy-component.<%= scope.lookupvar('::hostname') %>
  8. "I imagine most people do exactly what I do -

    create a google filter to send all Nagios emails straight to the bin"
  9. "Our screens have a viewing angle of about 10 degrees"

    "It never seems to show the page I want"
  10. Nagios chart Built by Simon Gibbs @simonjgibbs See our Engine

    Room blog: http://engineroom.ft.com/2015/12/10/alerting- for-brains/
  11. Splunk Alert: PROD - MethodeAPIResponseTime5MAlert Business Impact The methode api

    server is slow responding to requests. This might result in articles not getting published to the new content platform or publishing requests timing out. ...
  12. Splunk Alert: PROD - MethodeAPIResponseTime5MAlert Business Impact The methode api

    server is slow responding to requests. This might result in articles not getting published to the new content platform or publishing requests timing out. ...
  13. … Technical Impact The server is experiencing service degradation because

    of network latency, high publishing load, high bandwidth utilization, excessive memory or cpu usage on the VM. This might result in failure to publish articles to the new content platform.
  14. Splunk Alert: PROD Content Platform Ingester Methode Publish Failures Alert

    There has been one or more publish failures to the Universal Publishing Platform. The UUIDs are listed below. Please see the run book for more information. _time transaction_id uuid Mon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe
  15. Splunk Alert: PROD Content Platform Ingester Methode Publish Failures Alert

    There has been one or more publish failures to the Universal Publishing Platform. The UUIDs are listed below. Please see the run book for more information. _time transaction_id uuid Mon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe
  16. Splunk Alert: PROD Content Platform Ingester Methode Publish Failures Alert

    There has been one or more publish failures to the Universal Publishing Platform. The UUIDs are listed below. Please see the run book for more information. _time transaction_id uuid Mon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe
  17. The thing that sends you alerts need to be up

    and running https://www.flickr.com/photos/davidmasters/2564786205/
  18. + Easy to add to http access logging - Have

    to pass around the transactionId for other logging as a function parameter
  19. We may change the way we do it, but the

    things we do are the same
  20. About technology at the FT: Look us up on Stack

    Overflow http://bit.ly/1H3eXVe Read our blog http://engineroom.ft.com/