Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Crafting performance alerting tools

Crafting performance alerting tools

From Velocity 2015, I bring you... Crafting performance alerting tools!

The performance team at Etsy has recently developed a number of new alerting tools to help us discover and dig into performance regressions across the site. We built these new tools on top of existing technology (Nagios, Nagios Herald, and Graphite) to bring added context to site slowdowns and help us fix regressions more quickly.

These new alerts change the conversation with our coworkers. Before, we’d ask people about regressions long after they occurred. Now, we discover regressions almost immediately, and are able to share context such as graphs and recent site changes as we work with other teams to track down and fix regressions.

In this talk, Allison McKnight, performance engineer at Etsy, will cover:

- How we created alerts for backend performance slowdowns
- How we iterated on adding context to those alerts, including: Experiment ramp-ups, the state of our most popular and slowest - pages week-over-week, and better graphs
- How we built a dashboard for these alerts that highlights what’s currently an issue, allowing users to play with settings to dig into what is affected by the regression and to compare related pages
- How we built an IRC command to help us do this work alongside our daily chatter, which helps us ask our teammates in real-time for more context
- How good tools end up being contagious around a company
The future of our alerts: alerting on performance wins and native app metrics; automatically including other teams in alerts about their pages.

Open-source tools mentioned in this talk:

github.com/etsy/logster

github.com/etsy/nagios_tools

github.com/etsy/nagios-herald


Allison McKnight

May 29, 2015
Tweet

More Decks by Allison McKnight

Other Decks in Technology

Transcript

  1. Velocity Santa Clara – May 28, 2015
    @aemcknig
    Allison McKnight
    Crafting Performance
    Alerting Tools

    View full-size slide

  2. Agenda
    Here’s how it was
    Adding monitoring
    Iterating on alerting & tools
    Here’s how it is
    What’s next?
    Questions

    View full-size slide

  3. Allison McKnight | @aemcknig 3
    Performance at Etsy
    Allison McKnight | @aemcknig
    Kristyn Allison Lara Natalya

    View full-size slide

  4. Graph everything.
    4

    View full-size slide

  5. Allison McKnight | @aemcknig
    Graphing performance
    :  320  ms
    web logs
    phptime

    View full-size slide

  6. Allison McKnight | @aemcknig 6
    Graphing performance
    github.com/etsy/logster
    Logster
    web logs
    phptime:320
    phptime:520
    phptime:600
    phptime:410
    phptime:380
    phptime:804
    count:          6

    average:      505

    median:        410

    95th  perc:  804

    View full-size slide

  7. Allison McKnight | @aemcknig
    Graphing performance
    7
    backend time (ms)

    View full-size slide

  8. Allison McKnight | @aemcknig
    We needed monitoring
    10

    View full-size slide

  9. Allison McKnight | @aemcknig 11

    View full-size slide

  10. Allison McKnight | @aemcknig 12

    View full-size slide

  11. Allison McKnight | @aemcknig
    Regression report
    13
    • Didn’t catch small or slow-creep regressions
    • Difficult to tune
    • Additional investigation was required to verify and
    understand regressions
    • Alert fatigue

    View full-size slide

  12. Allison McKnight | @aemcknig
    What could we do here?
    14
    • Enforce better graph-watching during deploys
    • Change alerting mechanism
    • Create tools to help investigate regressions
    • Change alert format

    View full-size slide

  13. Allison McKnight | @aemcknig
    What could we do here?
    15
    • Enforce better graph-watching during deploys
    • Change alerting mechanism
    • Create tools to help investigate regressions
    • Change alert format

    View full-size slide

  14. Changing the alerting mechanism

    View full-size slide

  15. Allison McKnight | @aemcknig
    Monitoring page performance with Nagios
    17

    View full-size slide

  16. Allison McKnight | @aemcknig
    fast and fine-tuned alerting
    18
    Nagios

    View full-size slide

  17. Allison McKnight | @aemcknig
    check graphite data script
    19
    Nagios
    github.com/etsy/nagios_tools

    View full-size slide

  18. Allison McKnight | @aemcknig
    Individual check for each service

    you’d like to monitor
    20
    Nagios

    View full-size slide

  19. Allison McKnight | @aemcknig
    Individual thresholds for each page

    you’d like to monitor
    21
    Nagios

    View full-size slide

  20. Allison McKnight | @aemcknig
    How do you

    choose thresholds for
    40
    pages?
    22

    View full-size slide

  21. Creating tools

    View full-size slide

  22. Allison McKnight | @aemcknig 24

    View full-size slide

  23. Allison McKnight | @aemcknig 25
    recommended
    thresholds

    View full-size slide

  24. Allison McKnight | @aemcknig 26

    View full-size slide

  25. Allison McKnight | @aemcknig
    define  service  {

       use                                          graphite-­‐service

       host_name                              Performance

       service_description          Test  shop_policy  Performance

       check_command                      check_graphite_data!600!650!300!'http://...'

       check_interval                    1

       retry_interval                    5

       max_check_attempts            10

       notification_interval      1440

       contact_groups                    performance

    }
    27

    View full-size slide

  26. Allison McKnight | @aemcknig
    define  service  {

       use                                          graphite-­‐service

       host_name                              Performance

       service_description          Test  shop_policy  Performance

       check_command                      check_graphite_data!600!650!300!'http://...'

       check_interval                    1

       retry_interval                    5

       max_check_attempts            10

       notification_interval      1440

       contact_groups                    performance

    }
    28

    View full-size slide

  27. Allison McKnight | @aemcknig
    define  service  {

       use                                          graphite-­‐service

       host_name                              Performance

       service_description          Test  shop_policy  Performance

       check_command                      check_graphite_data!600!650!300!'http://...'

       check_interval                    1

       retry_interval                    5

       max_check_attempts            10

       notification_interval      1440

       contact_groups                    performance

    }
    29

    View full-size slide

  28. Allison McKnight | @aemcknig 30

    View full-size slide

  29. Allison McKnight | @aemcknig
    define  service  {

       use                                          graphite-­‐service

       host_name                              Performance

       service_description          Test  shop_policy  Performance

       check_command                      check_graphite_data!600!650!300!'http://...'

       check_interval                    1

       retry_interval                    5

       max_check_attempts            10

       notification_interval      1440

       contact_groups                    performance

    }
    31

    View full-size slide

  30. Allison McKnight | @aemcknig
    Creating a tool to visualize our performance alerts

    helped us develop well-tuned alerts.
    32

    View full-size slide

  31. Allison McKnight | @aemcknig 33

    View full-size slide

  32. Allison McKnight | @aemcknig 34

    View full-size slide

  33. Allison McKnight | @aemcknig 35
    Current value: 768.5,

    warn threshold: 750.0,

    critical threshold: 800.0

    View full-size slide

  34. Allison McKnight | @aemcknig
    We needed alerts that helped us

    understand the problem.
    36

    View full-size slide

  35. Changing the alert format

    View full-size slide

  36. Allison McKnight | @aemcknig
    a tool for adding context to Nagios alerts
    38
    Nagios Herald
    github.com/etsy/nagios-herald

    View full-size slide

  37. Allison McKnight | @aemcknig 39

    View full-size slide

  38. Allison McKnight | @aemcknig 40

    View full-size slide

  39. Allison McKnight | @aemcknig 41
    Allison McKnight | @aemcknig

    View full-size slide

  40. Allison McKnight | @aemcknig
    Allison McKnight | @aemcknig 42

    View full-size slide

  41. Allison McKnight | @aemcknig
    Allison McKnight | @aemcknig 43

    View full-size slide

  42. Allison McKnight | @aemcknig
    Allison McKnight | @aemcknig 44

    View full-size slide

  43. Allison McKnight | @aemcknig
    With dependencies, we receive

    only actionable alerts
    45

    View full-size slide

  44. Improving sleuthing tools

    View full-size slide

  45. Allison McKnight | @aemcknig 47
    Allison McKnight | @aemcknig 47

    View full-size slide

  46. Allison McKnight | @aemcknig 48
    Allison McKnight | @aemcknig 48

    View full-size slide

  47. Allison McKnight | @aemcknig 49
    Allison McKnight | @aemcknig 49

    View full-size slide

  48. Allison McKnight | @aemcknig
    led to easier and faster investigation
    50
    Improved context and alerting tools

    View full-size slide

  49. #performance

    View full-size slide

  50. #performance

    View full-size slide

  51. #performance

    View full-size slide

  52. Allison McKnight | @aemcknig
    improved cross-team collaboration
    57
    Improved context and alerting tools

    View full-size slide

  53. What’s next?

    View full-size slide

  54. Allison McKnight | @aemcknig 59
    Alerting on improvements

    View full-size slide

  55. Allison McKnight | @aemcknig
    Adding more context

    (teams, recent commits, alert history)
    60

    View full-size slide

  56. Allison McKnight | @aemcknig
    Better alert integration

    (IRC alerts, alert deployers)
    61

    View full-size slide

  57. Allison McKnight | @aemcknig
    More comprehensive alerting

    (front-end, mobile, API)
    62

    View full-size slide

  58. Use context to improve your tools

    View full-size slide

  59. 64
    64
    Questions?
    www.etsy.com/shop/cateanevski
    Resources
    Open-source tools:

    github.com/etsy/logster

    github.com/etsy/nagios_tools

    github.com/etsy/nagios-herald

    Icons: www.endlessicons.com
    @aemcknig
    Allison McKnight

    View full-size slide