Velocity NY - Crafting Performance Alerting Tools

Velocity NY - Crafting Performance Alerting Tools

The performance team at Etsy has recently developed a number of new alerting tools to help us discover and dig into performance regressions across the site. We built these new tools on top of existing technology (Nagios and Nagios Herald) to bring added context to site slowdowns and help us fix regressions more quickly.

These new alerts change the conversation with our coworkers. Before, we’d ask people about regressions long after they occurred. Now, we discover regressions almost immediately, and are able to share context such as graphs and recent site changes, as we work with other teams to track down and fix regressions.

C5ca01974effba0b394a7f54f26747ea?s=128

Allison McKnight

October 13, 2015
Tweet

Transcript

  1. 2.
  2. 3.

    Allison McKnight | #performance Performance at Etsy Allison McKnight |

    @aemcknig Lara Natalya Kristyn Allison Mike 3
  3. 6.

    Allison McKnight | @aemcknig 6 Graphing performance github.com/etsy/logster Logster web

    logs phptime:320 phptime:520 phptime:600 phptime:410 phptime:380 phptime:804 count:          6
 average:      505
 median:        410
 95th  perc:  804
  4. 8.

    8

  5. 9.

    9

  6. 13.

    Allison McKnight | @aemcknig Regression report 13 • Didn’t catch

    small or slow-creep regressions • Difficult to tune • Additional investigation was required to verify and understand regressions • Alert fatigue
  7. 14.

    Allison McKnight | @aemcknig What could we do here? 14

    • Enforce better graph-watching during deploys • Change alerting mechanism • Create tools to help investigate regressions • Change alert format
  8. 15.

    Allison McKnight | @aemcknig What could we do here? 15

    • Enforce better graph-watching during deploys • Change alerting mechanism • Create tools to help investigate regressions • Change alert format
  9. 26.

    Allison McKnight | @aemcknig define  service  {
    use  

                                           graphite-­‐service
    host_name                              Performance
    service_description          Test  home  Performance
    check_command                      check_graphite_data!700!900!300!'http://...'
    check_interval                    1
    retry_interval                    5
    max_check_attempts            10
    notification_interval      1440
    contact_groups                    performance
 } 26
  10. 27.

    Allison McKnight | @aemcknig define  service  {
    use  

                                           graphite-­‐service
    host_name                              Performance
    service_description          Test  home  Performance
    check_command                      check_graphite_data!700!900!300!'http://...'
    check_interval                    1
    retry_interval                    5
    max_check_attempts            10
    notification_interval      1440
    contact_groups                    performance
 } 27
  11. 28.

    Allison McKnight | @aemcknig define  service  {
    use  

                                           graphite-­‐service
    host_name                              Performance
    service_description          Test  home  Performance
    check_command                      check_graphite_data!700!900!300!'http://...'
    check_interval                    1
    retry_interval                    5
    max_check_attempts            10
    notification_interval      1440
    contact_groups                    performance
 } 28
  12. 30.

    Allison McKnight | @aemcknig define  service  {
    use  

                                           graphite-­‐service
    host_name                              Performance
    service_description          Test  home  Performance
    check_command                      check_graphite_data!700!900!300!'http://...'
    check_interval                    1
    retry_interval                    5
    max_check_attempts            10
    notification_interval      1440
    contact_groups                    performance
 } 30
  13. 31.

    Allison McKnight | @aemcknig Creating a tool to visualize our

    performance alerts
 helped us develop well-tuned alerts. 31
  14. 36.

    Allison McKnight | @aemcknig a tool for adding context to

    Nagios alerts 36 Nagios Herald github.com/etsy/nagios-herald
  15. 50.
  16. 55.
  17. 57.
  18. 59.
  19. 60.