Crafting performance alerting tools

Crafting performance alerting tools

From Velocity 2015, I bring you... Crafting performance alerting tools!

The performance team at Etsy has recently developed a number of new alerting tools to help us discover and dig into performance regressions across the site. We built these new tools on top of existing technology (Nagios, Nagios Herald, and Graphite) to bring added context to site slowdowns and help us fix regressions more quickly.

These new alerts change the conversation with our coworkers. Before, we’d ask people about regressions long after they occurred. Now, we discover regressions almost immediately, and are able to share context such as graphs and recent site changes as we work with other teams to track down and fix regressions.

In this talk, Allison McKnight, performance engineer at Etsy, will cover:

- How we created alerts for backend performance slowdowns
- How we iterated on adding context to those alerts, including: Experiment ramp-ups, the state of our most popular and slowest - pages week-over-week, and better graphs
- How we built a dashboard for these alerts that highlights what’s currently an issue, allowing users to play with settings to dig into what is affected by the regression and to compare related pages
- How we built an IRC command to help us do this work alongside our daily chatter, which helps us ask our teammates in real-time for more context
- How good tools end up being contagious around a company
The future of our alerts: alerting on performance wins and native app metrics; automatically including other teams in alerts about their pages.

Open-source tools mentioned in this talk:

github.com/etsy/logster

github.com/etsy/nagios_tools

github.com/etsy/nagios-herald


C5ca01974effba0b394a7f54f26747ea?s=128

Allison McKnight

May 29, 2015
Tweet

Transcript

  1. 2.

    Agenda Here’s how it was Adding monitoring Iterating on alerting

    & tools Here’s how it is What’s next? Questions
  2. 3.
  3. 6.

    Allison McKnight | @aemcknig 6 Graphing performance github.com/etsy/logster Logster web

    logs phptime:320 phptime:520 phptime:600 phptime:410 phptime:380 phptime:804 count:          6
 average:      505
 median:        410
 95th  perc:  804
  4. 8.

    8

  5. 9.

    9

  6. 13.

    Allison McKnight | @aemcknig Regression report 13 • Didn’t catch

    small or slow-creep regressions • Difficult to tune • Additional investigation was required to verify and understand regressions • Alert fatigue
  7. 14.

    Allison McKnight | @aemcknig What could we do here? 14

    • Enforce better graph-watching during deploys • Change alerting mechanism • Create tools to help investigate regressions • Change alert format
  8. 15.

    Allison McKnight | @aemcknig What could we do here? 15

    • Enforce better graph-watching during deploys • Change alerting mechanism • Create tools to help investigate regressions • Change alert format
  9. 27.

    Allison McKnight | @aemcknig define  service  {
    use  

                                           graphite-­‐service
    host_name                              Performance
    service_description          Test  shop_policy  Performance
    check_command                      check_graphite_data!600!650!300!'http://...'
    check_interval                    1
    retry_interval                    5
    max_check_attempts            10
    notification_interval      1440
    contact_groups                    performance
 } 27
  10. 28.

    Allison McKnight | @aemcknig define  service  {
    use  

                                           graphite-­‐service
    host_name                              Performance
    service_description          Test  shop_policy  Performance
    check_command                      check_graphite_data!600!650!300!'http://...'
    check_interval                    1
    retry_interval                    5
    max_check_attempts            10
    notification_interval      1440
    contact_groups                    performance
 } 28
  11. 29.

    Allison McKnight | @aemcknig define  service  {
    use  

                                           graphite-­‐service
    host_name                              Performance
    service_description          Test  shop_policy  Performance
    check_command                      check_graphite_data!600!650!300!'http://...'
    check_interval                    1
    retry_interval                    5
    max_check_attempts            10
    notification_interval      1440
    contact_groups                    performance
 } 29
  12. 31.

    Allison McKnight | @aemcknig define  service  {
    use  

                                           graphite-­‐service
    host_name                              Performance
    service_description          Test  shop_policy  Performance
    check_command                      check_graphite_data!600!650!300!'http://...'
    check_interval                    1
    retry_interval                    5
    max_check_attempts            10
    notification_interval      1440
    contact_groups                    performance
 } 31
  13. 32.

    Allison McKnight | @aemcknig Creating a tool to visualize our

    performance alerts
 helped us develop well-tuned alerts. 32
  14. 38.

    Allison McKnight | @aemcknig a tool for adding context to

    Nagios alerts 38 Nagios Herald github.com/etsy/nagios-herald
  15. 50.
  16. 53.
  17. 55.
  18. 56.