Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Alert Overload: How to adopt a microservices architecture without being overwhelmed with noise

Alert Overload: How to adopt a microservices architecture without being overwhelmed with noise

You’ve heard all about what microservices can do for you. You’re convinced. So you build some. Reasoning about your functionality is way easier: these services are so simple! Then you get to the point where you have 35 microservices, and all the monitoring and alerting tactics you used for your monoliths are a complete disaster. Something needs to change and this talk explains what and how.

Talk given at Continuous Lifecycle London, May 2016

Sarah Wells

May 03, 2016
Tweet

More Decks by Sarah Wells

Other Decks in Technology

Transcript

  1. Alerts Overload
    How to adopt a microservices
    architecture without being
    overwhelmed with noise
    Sarah Wells
    @sarahjwells

    View Slide

  2. View Slide

  3. Microservices make it worse

    View Slide

  4. microservices (n,pl): an efficient device for
    transforming business problems into distributed
    transaction problems
    @drsnooks

    View Slide

  5. You have a lot more systems

    View Slide

  6. 45 microservices

    View Slide

  7. 45 microservices
    3 environments

    View Slide

  8. 45 microservices
    3 environments
    2 instances for each service

    View Slide

  9. 45 microservices
    3 environments
    2 instances for each service
    20 checks per instance

    View Slide

  10. 45 microservices
    3 environments
    2 instances for each service
    20 checks per instance
    running every 5 minutes

    View Slide

  11. > 1,500,000 system checks
    per day

    View Slide

  12. Over 19,000 system
    monitoring alerts in 50 days

    View Slide

  13. Over 19,000 system
    monitoring alerts in 50 days
    An average of 380 per day

    View Slide

  14. Functional monitoring is also an issue

    View Slide

  15. 12,745 response time/error
    alerts in 50 days

    View Slide

  16. 12,745 response time/error
    alerts
    An average of 255 per day

    View Slide

  17. Why so many?

    View Slide

  18. View Slide

  19. View Slide

  20. View Slide

  21. View Slide

  22. http://devopsreactions.tumblr.com/post/122408751191/alerts-when-an-outage-starts

    View Slide

  23. How can you make it better?

    View Slide

  24. Quick starts: attack your problem
    See our EngineRoom blog for more:
    http://bit.ly/1PP7uQQ

    View Slide

  25. 1 2 3

    View Slide

  26. Think about monitoring from the start
    1

    View Slide

  27. It's the business functionality you care about

    View Slide

  28. View Slide

  29. View Slide

  30. 1

    View Slide

  31. 2
    1

    View Slide

  32. 3
    1
    2

    View Slide

  33. 4
    1
    2
    3

    View Slide

  34. We care about whether published content made it to us

    View Slide

  35. When people call our APIs, we care about speed

    View Slide

  36. … we also care about errors

    View Slide

  37. But it's the end-to-end that matters
    https://www.flickr.com/photos/robef/16537786315/

    View Slide

  38. You only want an alert where you need to take
    action

    View Slide

  39. If you just want information, create a dashboard or report

    View Slide

  40. Turn off your staging
    environment overnight and at
    weekends

    View Slide

  41. Make sure you can't miss an alert

    View Slide

  42. Make the alert great
    http://www.thestickerfactory.co.uk/

    View Slide

  43. Build your system with support in mind

    View Slide

  44. Transaction ids tie all microservices together

    View Slide

  45. View Slide

  46. Healthchecks tell you whether a service is OK
    GET http://{service}/__health

    View Slide

  47. Healthchecks tell you whether a service is OK
    GET http://{service}/__health
    returns 200 if the service can run the healthcheck

    View Slide

  48. Healthchecks tell you whether a service is OK
    GET http://{service}/__health
    returns 200 if the service can run the healthcheck
    each check will return "ok": true or "ok": false

    View Slide

  49. View Slide

  50. View Slide

  51. Synthetic requests tell you about problems early
    https://www.flickr.com/photos/jted/5448635109

    View Slide

  52. Use the right tools for the job
    2

    View Slide

  53. There are basic tools you need

    View Slide

  54. Service monitoring (e.g. Nagios)

    View Slide

  55. Log aggregation (e.g. Splunk)

    View Slide

  56. FT Platform: An internal PaaS

    View Slide

  57. Graphing (e.g. Graphite/Grafana)

    View Slide

  58. metrics:
    reporters:
    - type: graphite
    frequency: 1 minute
    durationUnit: milliseconds
    rateUnit: seconds
    host:
    port: 2003
    prefix: content..api-policy-component.scope.lookupvar('::hostname') %>

    View Slide

  59. View Slide

  60. View Slide

  61. Real time error analysis (e.g. Sentry)

    View Slide

  62. Build other tools to support you

    View Slide

  63. SAWS
    Built by Silvano Dossan
    See our Engine room blog: http://bit.ly/1GATHLy

    View Slide

  64. "I imagine most people do exactly
    what I do - create a google filter to
    send all Nagios emails straight to the
    bin"

    View Slide

  65. "Our screens have a viewing angle of
    about 10 degrees"

    View Slide

  66. "Our screens have a viewing angle of
    about 10 degrees"
    "It never seems to show the page I
    want"

    View Slide

  67. Code at: https://github.com/muce/SAWS

    View Slide

  68. Dashing

    View Slide

  69. View Slide

  70. Nagios chart
    Built by Simon Gibbs
    @simonjgibbs
    See our Engine Room blog: http://engineroom.ft.com/2015/12/10/alerting-
    for-brains/

    View Slide

  71. View Slide

  72. View Slide

  73. View Slide

  74. View Slide

  75. Use the right communication channel

    View Slide

  76. It's not email

    View Slide

  77. Slack integration

    View Slide

  78. View Slide

  79. Radiators everywhere

    View Slide

  80. Cultivate your alerts
    3

    View Slide

  81. Review the alerts you get

    View Slide

  82. If it isn't
    helpful, make
    sure you don't
    get sent it
    again

    View Slide

  83. See if you can improve it
    www.workcompass.com/

    View Slide

  84. Splunk Alert: PROD - MethodeAPIResponseTime5MAlert
    Business Impact
    The methode api server is slow responding to requests.
    This might result in articles not getting published to the new
    content platform or publishing requests timing out.
    ...

    View Slide

  85. Splunk Alert: PROD - MethodeAPIResponseTime5MAlert
    Business Impact
    The methode api server is slow responding to requests.
    This might result in articles not getting published to the new
    content platform or publishing requests timing out.
    ...

    View Slide


  86. Technical Impact
    The server is experiencing service degradation because of
    network latency, high publishing load, high bandwidth
    utilization, excessive memory or cpu usage on the VM. This
    might result in failure to publish articles to the new content
    platform.

    View Slide

  87. Splunk Alert: PROD Content Platform Ingester Methode
    Publish Failures Alert
    There has been one or more publish failures to the
    Universal Publishing Platform. The UUIDs are listed below.
    Please see the run book for more information.
    _time transaction_id uuid
    Mon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe

    View Slide

  88. Splunk Alert: PROD Content Platform Ingester Methode
    Publish Failures Alert
    There has been one or more publish failures to the
    Universal Publishing Platform. The UUIDs are listed below.
    Please see the run book for more information.
    _time transaction_id uuid
    Mon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe

    View Slide

  89. Splunk Alert: PROD Content Platform Ingester Methode
    Publish Failures Alert
    There has been one or more publish failures to the
    Universal Publishing Platform. The UUIDs are listed below.
    Please see the run book for more information.
    _time transaction_id uuid
    Mon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe

    View Slide

  90. When you didn't get an alert

    View Slide

  91. What would have told you about this?

    View Slide

  92. View Slide

  93. Setting up an alert is part of fixing the problem
    ✔ code
    ✔ test
    alerts

    View Slide

  94. System boundaries are more difficult
    Severin.stalder [CC BY-SA 3.0 (http://creativecommons.
    org/licenses/by-sa/3.0)], via Wikimedia Commons

    View Slide

  95. Make sure you would know if an alert stopped
    working

    View Slide

  96. Add a unit test
    public void shouldIncludeTriggerWordsForPublishFailureAlertInSplunk() {

    }

    View Slide

  97. Deliberately break things

    View Slide

  98. Chaos snail

    View Slide

  99. The thing that sends you alerts need to be up and running
    https://www.flickr.com/photos/davidmasters/2564786205/

    View Slide

  100. What happened to our alerts?

    View Slide

  101. We turned off ALL emails from
    system monitoring

    View Slide

  102. Our most important alerts
    come in via a team 'production
    alert' slack channel

    View Slide

  103. We created dashboards for
    our read APIs in Grafana

    View Slide

  104. We also have dashboards for
    our key metrics - the business
    related ones

    View Slide

  105. View Slide

  106. View Slide

  107. We do synthetic publishes for
    content and images

    View Slide

  108. What happened when we started again?

    View Slide

  109. Docker
    CoreOS
    AWS
    Fleet

    View Slide

  110. We thought about programming languages

    View Slide

  111. Using Go rather than Java by
    default

    View Slide

  112. Support for metrics
    https://github.com/rcrowley/go-metrics

    View Slide

  113. Output metrics to Graphite:
    go graphite.Graphite(metrics.DefaultRegistry, 5*time.Second,
    graphitePrefix, graphiteTCPAddress)

    View Slide

  114. Support for transactionIDs

    View Slide

  115. + Easy to add to http access logging
    - Have to pass around the
    transactionId for other logging as a
    function parameter

    View Slide

  116. Support for healthchecks

    View Slide

  117. Logging that meets our needs

    View Slide

  118. Service monitoring

    View Slide

  119. View Slide

  120. View Slide

  121. View Slide

  122. View Slide

  123. View Slide

  124. Log aggregation

    View Slide

  125. Integration with Dashing

    View Slide

  126. View Slide

  127. Using Graphite/Grafana

    View Slide

  128. View Slide

  129. View Slide

  130. View Slide

  131. We may change the way we
    do it, but the things we do are
    the same

    View Slide

  132. To summarise...

    View Slide

  133. Build microservices

    View Slide

  134. 1 2 3

    View Slide

  135. About technology at the FT:
    Look us up on Stack Overflow
    http://bit.ly/1H3eXVe
    Read our blog
    http://engineroom.ft.com/

    View Slide

  136. The FT on github
    https://github.com/Financial-Times/
    https://github.com/ftlabs

    View Slide

  137. Thank you

    View Slide