Upgrade to Pro — share decks privately, control downloads, hide ads and more …

QCon London 2017: Avoiding Alerts Overload From Microservices

QCon London 2017: Avoiding Alerts Overload From Microservices

Microservices can be a great way to work: the services are simple, you can use the right technology for the job, and deployments become smaller and less risky. Unfortunately, other things become more complex. You probably took some time to design a deployment pipeline and set up self-service provisioning, for example. But did the rest of your thinking about what “done” means catch up? Are you still setting up alerts, run books, and monitoring for each microservice as though it was a monolith?

Two years ago, a team at the FT started out building a microservices-based system from scratch. Their initial naive approach to monitoring meant that an underlying network issue could mean 20 people each receiving 10,000 alert emails overnight. With that volume, you can’t pick out the important stuff. In fact, your inbox is unusable unless you have everything filtered away where you’ll never see it. Furthermore, you have information radiators all over the place, but there’s always something flashing or the wrong colour. You can spend the whole day moving from one attention-grabbing screen to another.

That team now has over 150 microservices in production. So how they get themselves out of that mess and regain control of their inboxes and their time? First, you have to work out what’s important, and then you have to ruthlessly narrow down on that. You need to be able to see only the things you need to take action on in a way that tells you exactly what you need to do. Sarah shares how her team regained control and offers some tips and tricks.

Sarah Wells

March 07, 2017
Tweet

More Decks by Sarah Wells

Other Decks in Technology

Transcript

  1. 1

  2. 1 2

  3. @sarahjwells "I imagine most people do exactly what I do

    - create a google filter to send all Nagios emails straight to the bin"
  4. The FT healthcheck standard GET http://{service}/__health returns 200 if the

    service can run the healthcheck each check will return "ok": true or "ok": false
  5. @sarahjwells Splunk Alert: PROD - MethodeAPIResponseTime5MAlert Business Impact The methode

    api server is slow responding to requests. This might result in articles not getting published to the new content platform or publishing requests timing out. ...
  6. @sarahjwells Splunk Alert: PROD - MethodeAPIResponseTime5MAlert Business Impact The methode

    api server is slow responding to requests. This might result in articles not getting published to the new content platform or publishing requests timing out. ...
  7. @sarahjwells … Technical Impact The server is experiencing service degradation

    because of network latency, high publishing load, high bandwidth utilization, excessive memory or cpu usage on the VM. This might result in failure to publish articles to the new content platform.
  8. @sarahjwells Splunk Alert: PROD Content Platform Ingester Methode Publish Failures

    Alert There has been one or more publish failures to the Universal Publishing Platform. The UUIDs are listed below. Please see the run book for more information. _time transaction_id uuid Mon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe
  9. @sarahjwells Splunk Alert: PROD Content Platform Ingester Methode Publish Failures

    Alert There has been one or more publish failures to the Universal Publishing Platform. The UUIDs are listed below. Please see the run book for more information. _time transaction_id uuid Mon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe
  10. @sarahjwells Splunk Alert: PROD Content Platform Ingester Methode Publish Failures

    Alert There has been one or more publish failures to the Universal Publishing Platform. The UUIDs are listed below. Please see the run book for more information. _time transaction_id uuid Mon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe
  11. @sarahjwells Splunk Alert: PROD Content Platform Ingester Methode Publish Failures

    Alert There has been one or more publish failures to the Universal Publishing Platform. The UUIDs are listed below. Please see the run book for more information. _time transaction_id uuid Mon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe