Upgrade to Pro — share decks privately, control downloads, hide ads and more …

QCon London 2017: Avoiding Alerts Overload From Microservices

QCon London 2017: Avoiding Alerts Overload From Microservices

Microservices can be a great way to work: the services are simple, you can use the right technology for the job, and deployments become smaller and less risky. Unfortunately, other things become more complex. You probably took some time to design a deployment pipeline and set up self-service provisioning, for example. But did the rest of your thinking about what “done” means catch up? Are you still setting up alerts, run books, and monitoring for each microservice as though it was a monolith?

Two years ago, a team at the FT started out building a microservices-based system from scratch. Their initial naive approach to monitoring meant that an underlying network issue could mean 20 people each receiving 10,000 alert emails overnight. With that volume, you can’t pick out the important stuff. In fact, your inbox is unusable unless you have everything filtered away where you’ll never see it. Furthermore, you have information radiators all over the place, but there’s always something flashing or the wrong colour. You can spend the whole day moving from one attention-grabbing screen to another.

That team now has over 150 microservices in production. So how they get themselves out of that mess and regain control of their inboxes and their time? First, you have to work out what’s important, and then you have to ruthlessly narrow down on that. You need to be able to see only the things you need to take action on in a way that tells you exactly what you need to do. Sarah shares how her team regained control and offers some tips and tricks.

Sarah Wells

March 07, 2017
Tweet

More Decks by Sarah Wells

Other Decks in Technology

Transcript

  1. Avoiding alerts overload
    from microservices
    Sarah Wells
    Principal Engineer, Financial Times
    @sarahjwells

    View full-size slide

  2. @sarahjwells
    Knowing when there’s a problem isn’t enough

    View full-size slide

  3. You only want an alert when you need
    to take action

    View full-size slide

  4. @sarahjwells
    Hello

    View full-size slide

  5. @sarahjwells
    Monitoring this system…

    View full-size slide

  6. @sarahjwells
    Microservices make it worse

    View full-size slide

  7. “microservices (n,pl): an efficient device
    for transforming business problems
    into distributed transaction problems”
    @drsnooks

    View full-size slide

  8. @sarahjwells
    The services *themselves* are simple…

    View full-size slide

  9. @sarahjwells
    There’s a lot of complexity around them

    View full-size slide

  10. @sarahjwells
    Why do they make monitoring harder?

    View full-size slide

  11. @sarahjwells
    You have a lot more services

    View full-size slide

  12. @sarahjwells
    99 functional microservices
    350 running instances

    View full-size slide

  13. @sarahjwells
    52 non functional services
    218 running instances

    View full-size slide

  14. @sarahjwells
    That’s 568 separate services

    View full-size slide

  15. @sarahjwells
    If we checked each service every minute…

    View full-size slide

  16. @sarahjwells
    817,920 checks per day

    View full-size slide

  17. @sarahjwells
    What about system checks?

    View full-size slide

  18. @sarahjwells
    16,358,400 checks per day

    View full-size slide

  19. @sarahjwells
    “One-in-a-million” issues would hit us 16 times every
    day

    View full-size slide

  20. @sarahjwells
    Running containers on shared VMs reduces this to
    92,160 system checks per day

    View full-size slide

  21. @sarahjwells
    For a total of 910,080 checks per day

    View full-size slide

  22. @sarahjwells
    It’s a distributed system

    View full-size slide

  23. @sarahjwells
    Services are not independent

    View full-size slide

  24. http://devopsreactions.tumblr.com/post/122408751191/
    alerts-when-an-outage-starts

    View full-size slide

  25. @sarahjwells
    You have to change how you think about monitoring

    View full-size slide

  26. How can you make it better?

    View full-size slide

  27. @sarahjwells
    1. Build a system you can support

    View full-size slide

  28. @sarahjwells
    The basic tools you need

    View full-size slide

  29. @sarahjwells
    Log aggregation

    View full-size slide

  30. @sarahjwells
    Logs go missing or get delayed more now

    View full-size slide

  31. @sarahjwells
    Which means log based alerts may miss stuff

    View full-size slide

  32. @sarahjwells
    Monitoring

    View full-size slide

  33. @sarahjwells
    Limitations of our nagios integration…

    View full-size slide

  34. @sarahjwells
    No ‘service-level’ view

    View full-size slide

  35. @sarahjwells
    Default checks included things we couldn’t fix

    View full-size slide

  36. @sarahjwells
    A new approach for our container stack

    View full-size slide

  37. @sarahjwells
    We care about each service

    View full-size slide

  38. @sarahjwells
    We care about each VM

    View full-size slide

  39. @sarahjwells
    We care about unhealthy instances

    View full-size slide

  40. @sarahjwells
    Monitoring needs aggregating somehow

    View full-size slide

  41. @sarahjwells
    SAWS

    View full-size slide

  42. Built by Silvano Dossan
    See our Engine room blog: http://bit.ly/1GATHLy

    View full-size slide

  43. @sarahjwells
    "I imagine most people do exactly what I do - create
    a google filter to send all Nagios emails straight to
    the bin"

    View full-size slide

  44. @sarahjwells
    "Our screens have a viewing angle of about 10
    degrees"

    View full-size slide

  45. @sarahjwells
    "It never seems to show the page I want"

    View full-size slide

  46. @sarahjwells
    Code at: https://github.com/muce/SAWS

    View full-size slide

  47. @sarahjwells
    Dashing

    View full-size slide

  48. @sarahjwells
    Graphing of metrics

    View full-size slide

  49. https://www.flickr.com/photos/davidmasters/2564786205/

    View full-size slide

  50. @sarahjwells
    The things that make those tools WORK

    View full-size slide

  51. @sarahjwells
    Effective log aggregation needs a way to find all
    related logs

    View full-size slide

  52. Transaction ids tie all microservices together

    View full-size slide

  53. @sarahjwells
    Make it easy for any language you use

    View full-size slide

  54. @sarahjwells

    View full-size slide

  55. @sarahjwells
    Services need to report on their own health

    View full-size slide

  56. The FT healthcheck standard
    GET http://{service}/__health

    View full-size slide

  57. The FT healthcheck standard
    GET http://{service}/__health
    returns 200 if the service can run the healthcheck

    View full-size slide

  58. The FT healthcheck standard
    GET http://{service}/__health
    returns 200 if the service can run the healthcheck
    each check will return "ok": true or "ok": false

    View full-size slide

  59. @sarahjwells
    Knowing about problems before your clients do

    View full-size slide

  60. Synthetic requests tell you about problems early
    https://www.flickr.com/photos/jted/
    5448635109

    View full-size slide

  61. @sarahjwells
    2. Concentrate on the stuff that matters

    View full-size slide

  62. @sarahjwells
    It’s the business functionality you should care
    about

    View full-size slide

  63. We care about whether content got published successfully

    View full-size slide

  64. When people call our APIs, we care about speed

    View full-size slide

  65. … we also care about errors

    View full-size slide

  66. But it's the end-to-end that matters
    https://www.flickr.com/photos/robef/16537786315/

    View full-size slide

  67. If you just want information, create a dashboard or report

    View full-size slide

  68. @sarahjwells
    Checking the services involved in a business flow

    View full-size slide

  69. /__health?categories=lists-publish

    View full-size slide

  70. @sarahjwells
    3. Cultivate your alerts

    View full-size slide

  71. Make each alert great
    http://www.thestickerfactory.co.uk/

    View full-size slide

  72. @sarahjwells
    Splunk Alert: PROD - MethodeAPIResponseTime5MAlert
    Business Impact
    The methode api server is slow responding to requests.
    This might result in articles not getting published to the
    new content platform or publishing requests timing out.
    ...

    View full-size slide

  73. @sarahjwells
    Splunk Alert: PROD - MethodeAPIResponseTime5MAlert
    Business Impact
    The methode api server is slow responding to requests.
    This might result in articles not getting published to the
    new content platform or publishing requests timing out.
    ...

    View full-size slide

  74. @sarahjwells

    Technical Impact
    The server is experiencing service degradation because
    of network latency, high publishing load, high
    bandwidth utilization, excessive memory or cpu usage
    on the VM. This might result in failure to publish
    articles to the new content platform.

    View full-size slide

  75. @sarahjwells
    Splunk Alert: PROD Content Platform Ingester Methode
    Publish Failures Alert
    There has been one or more publish failures to the
    Universal Publishing Platform. The UUIDs are listed
    below.
    Please see the run book for more information.
    _time transaction_id uuid
    Mon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe

    View full-size slide

  76. @sarahjwells
    Splunk Alert: PROD Content Platform Ingester Methode
    Publish Failures Alert
    There has been one or more publish failures to the
    Universal Publishing Platform. The UUIDs are listed
    below.
    Please see the run book for more information.
    _time transaction_id uuid
    Mon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe

    View full-size slide

  77. @sarahjwells
    Splunk Alert: PROD Content Platform Ingester Methode
    Publish Failures Alert
    There has been one or more publish failures to the
    Universal Publishing Platform. The UUIDs are listed
    below.
    Please see the run book for more information.
    _time transaction_id uuid
    Mon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe

    View full-size slide

  78. @sarahjwells
    Splunk Alert: PROD Content Platform Ingester Methode
    Publish Failures Alert
    There has been one or more publish failures to the
    Universal Publishing Platform. The UUIDs are listed
    below.
    Please see the run book for more information.
    _time transaction_id uuid
    Mon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe

    View full-size slide

  79. Make sure you can't miss an alert

    View full-size slide

  80. @sarahjwells
    ‘Ops Cops’ keep an eye on our systems

    View full-size slide

  81. @sarahjwells
    Use the right communication channel

    View full-size slide

  82. @sarahjwells
    It’s not email

    View full-size slide

  83. Slack integration

    View full-size slide

  84. @sarahjwells
    Support isn’t just getting the system fixed

    View full-size slide

  85. @sarahjwells
    ‘You build it, you run it’?

    View full-size slide

  86. @sarahjwells
    Review the alerts you get

    View full-size slide

  87. If it isn't helpful, make sure you don't get sent it again

    View full-size slide

  88. See if you can improve it
    www.workcompass.
    com/

    View full-size slide

  89. @sarahjwells
    When you didn't get an alert

    View full-size slide

  90. What would have told you about this?

    View full-size slide

  91. @sarahjwells

    View full-size slide

  92. @sarahjwells
    Setting up an alert is part of fixing the problem
    ✔ code
    ✔ test
    alerts

    View full-size slide

  93. System boundaries are more difficult
    Severin.stalder [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via
    Wikimedia Commons

    View full-size slide

  94. @sarahjwells
    Make sure you would know if an alert stopped
    working

    View full-size slide

  95. Add a unit test
    public void
    shouldIncludeTriggerWordsForPublishFailureAlertInSplunk() {

    }

    View full-size slide

  96. Deliberately break things

    View full-size slide

  97. @sarahjwells
    It’s going to change: deal with it

    View full-size slide

  98. @sarahjwells
    Out of date information can be worse than none

    View full-size slide

  99. @sarahjwells
    Automate updates where you can

    View full-size slide

  100. @sarahjwells
    Find ways to share what’s changing

    View full-size slide

  101. @sarahjwells
    In summary: to avoid alerts overload…

    View full-size slide

  102. 1
    Build a system you can support

    View full-size slide

  103. 2
    Concentrate on the stuff that matters

    View full-size slide

  104. 3
    Cultivate your alerts

    View full-size slide

  105. @sarahjwells
    A microservice architecture lets you move fast…

    View full-size slide

  106. @sarahjwells
    But there’s an associated operational cost

    View full-size slide

  107. @sarahjwells
    Make sure it’s a cost you’re willing to pay

    View full-size slide

  108. @sarahjwells
    Thank you

    View full-size slide