Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Crafting performance alerting tools

Crafting performance alerting tools

From Velocity 2015, I bring you... Crafting performance alerting tools!

The performance team at Etsy has recently developed a number of new alerting tools to help us discover and dig into performance regressions across the site. We built these new tools on top of existing technology (Nagios, Nagios Herald, and Graphite) to bring added context to site slowdowns and help us fix regressions more quickly.

These new alerts change the conversation with our coworkers. Before, we’d ask people about regressions long after they occurred. Now, we discover regressions almost immediately, and are able to share context such as graphs and recent site changes as we work with other teams to track down and fix regressions.

In this talk, Allison McKnight, performance engineer at Etsy, will cover:

- How we created alerts for backend performance slowdowns
- How we iterated on adding context to those alerts, including: Experiment ramp-ups, the state of our most popular and slowest - pages week-over-week, and better graphs
- How we built a dashboard for these alerts that highlights what’s currently an issue, allowing users to play with settings to dig into what is affected by the regression and to compare related pages
- How we built an IRC command to help us do this work alongside our daily chatter, which helps us ask our teammates in real-time for more context
- How good tools end up being contagious around a company
The future of our alerts: alerting on performance wins and native app metrics; automatically including other teams in alerts about their pages.

Open-source tools mentioned in this talk:

github.com/etsy/logster

github.com/etsy/nagios_tools

github.com/etsy/nagios-herald


Allison McKnight

May 29, 2015
Tweet

More Decks by Allison McKnight

Other Decks in Technology

Transcript

  1. Velocity Santa Clara – May 28, 2015
    @aemcknig
    Allison McKnight
    Crafting Performance
    Alerting Tools

    View Slide

  2. Agenda
    Here’s how it was
    Adding monitoring
    Iterating on alerting & tools
    Here’s how it is
    What’s next?
    Questions

    View Slide

  3. Allison McKnight | @aemcknig 3
    Performance at Etsy
    Allison McKnight | @aemcknig
    Kristyn Allison Lara Natalya

    View Slide

  4. Graph everything.
    4

    View Slide

  5. Allison McKnight | @aemcknig
    Graphing performance
    :  320  ms
    web logs
    phptime

    View Slide

  6. Allison McKnight | @aemcknig 6
    Graphing performance
    github.com/etsy/logster
    Logster
    web logs
    phptime:320
    phptime:520
    phptime:600
    phptime:410
    phptime:380
    phptime:804
    count:          6

    average:      505

    median:        410

    95th  perc:  804

    View Slide

  7. Allison McKnight | @aemcknig
    Graphing performance
    7
    backend time (ms)

    View Slide

  8. 8

    View Slide

  9. 9

    View Slide

  10. Allison McKnight | @aemcknig
    We needed monitoring
    10

    View Slide

  11. Allison McKnight | @aemcknig 11

    View Slide

  12. Allison McKnight | @aemcknig 12

    View Slide

  13. Allison McKnight | @aemcknig
    Regression report
    13
    • Didn’t catch small or slow-creep regressions
    • Difficult to tune
    • Additional investigation was required to verify and
    understand regressions
    • Alert fatigue

    View Slide

  14. Allison McKnight | @aemcknig
    What could we do here?
    14
    • Enforce better graph-watching during deploys
    • Change alerting mechanism
    • Create tools to help investigate regressions
    • Change alert format

    View Slide

  15. Allison McKnight | @aemcknig
    What could we do here?
    15
    • Enforce better graph-watching during deploys
    • Change alerting mechanism
    • Create tools to help investigate regressions
    • Change alert format

    View Slide

  16. Changing the alerting mechanism

    View Slide

  17. Allison McKnight | @aemcknig
    Monitoring page performance with Nagios
    17

    View Slide

  18. Allison McKnight | @aemcknig
    fast and fine-tuned alerting
    18
    Nagios

    View Slide

  19. Allison McKnight | @aemcknig
    check graphite data script
    19
    Nagios
    github.com/etsy/nagios_tools

    View Slide

  20. Allison McKnight | @aemcknig
    Individual check for each service

    you’d like to monitor
    20
    Nagios

    View Slide

  21. Allison McKnight | @aemcknig
    Individual thresholds for each page

    you’d like to monitor
    21
    Nagios

    View Slide

  22. Allison McKnight | @aemcknig
    How do you

    choose thresholds for
    40
    pages?
    22

    View Slide

  23. Creating tools

    View Slide

  24. Allison McKnight | @aemcknig 24

    View Slide

  25. Allison McKnight | @aemcknig 25
    recommended
    thresholds

    View Slide

  26. Allison McKnight | @aemcknig 26

    View Slide

  27. Allison McKnight | @aemcknig
    define  service  {

       use                                          graphite-­‐service

       host_name                              Performance

       service_description          Test  shop_policy  Performance

       check_command                      check_graphite_data!600!650!300!'http://...'

       check_interval                    1

       retry_interval                    5

       max_check_attempts            10

       notification_interval      1440

       contact_groups                    performance

    }
    27

    View Slide

  28. Allison McKnight | @aemcknig
    define  service  {

       use                                          graphite-­‐service

       host_name                              Performance

       service_description          Test  shop_policy  Performance

       check_command                      check_graphite_data!600!650!300!'http://...'

       check_interval                    1

       retry_interval                    5

       max_check_attempts            10

       notification_interval      1440

       contact_groups                    performance

    }
    28

    View Slide

  29. Allison McKnight | @aemcknig
    define  service  {

       use                                          graphite-­‐service

       host_name                              Performance

       service_description          Test  shop_policy  Performance

       check_command                      check_graphite_data!600!650!300!'http://...'

       check_interval                    1

       retry_interval                    5

       max_check_attempts            10

       notification_interval      1440

       contact_groups                    performance

    }
    29

    View Slide

  30. Allison McKnight | @aemcknig 30

    View Slide

  31. Allison McKnight | @aemcknig
    define  service  {

       use                                          graphite-­‐service

       host_name                              Performance

       service_description          Test  shop_policy  Performance

       check_command                      check_graphite_data!600!650!300!'http://...'

       check_interval                    1

       retry_interval                    5

       max_check_attempts            10

       notification_interval      1440

       contact_groups                    performance

    }
    31

    View Slide

  32. Allison McKnight | @aemcknig
    Creating a tool to visualize our performance alerts

    helped us develop well-tuned alerts.
    32

    View Slide

  33. Allison McKnight | @aemcknig 33

    View Slide

  34. Allison McKnight | @aemcknig 34

    View Slide

  35. Allison McKnight | @aemcknig 35
    Current value: 768.5,

    warn threshold: 750.0,

    critical threshold: 800.0

    View Slide

  36. Allison McKnight | @aemcknig
    We needed alerts that helped us

    understand the problem.
    36

    View Slide

  37. Changing the alert format

    View Slide

  38. Allison McKnight | @aemcknig
    a tool for adding context to Nagios alerts
    38
    Nagios Herald
    github.com/etsy/nagios-herald

    View Slide

  39. Allison McKnight | @aemcknig 39

    View Slide

  40. Allison McKnight | @aemcknig 40

    View Slide

  41. Allison McKnight | @aemcknig 41
    Allison McKnight | @aemcknig

    View Slide

  42. Allison McKnight | @aemcknig
    Allison McKnight | @aemcknig 42

    View Slide

  43. Allison McKnight | @aemcknig
    Allison McKnight | @aemcknig 43

    View Slide

  44. Allison McKnight | @aemcknig
    Allison McKnight | @aemcknig 44

    View Slide

  45. Allison McKnight | @aemcknig
    With dependencies, we receive

    only actionable alerts
    45

    View Slide

  46. Improving sleuthing tools

    View Slide

  47. Allison McKnight | @aemcknig 47
    Allison McKnight | @aemcknig 47

    View Slide

  48. Allison McKnight | @aemcknig 48
    Allison McKnight | @aemcknig 48

    View Slide

  49. Allison McKnight | @aemcknig 49
    Allison McKnight | @aemcknig 49

    View Slide

  50. Allison McKnight | @aemcknig
    led to easier and faster investigation
    50
    Improved context and alerting tools

    View Slide

  51. #performance

    View Slide

  52. #performance

    View Slide

  53. View Slide

  54. #performance

    View Slide

  55. View Slide

  56. #payments

    View Slide

  57. Allison McKnight | @aemcknig
    improved cross-team collaboration
    57
    Improved context and alerting tools

    View Slide

  58. What’s next?

    View Slide

  59. Allison McKnight | @aemcknig 59
    Alerting on improvements

    View Slide

  60. Allison McKnight | @aemcknig
    Adding more context

    (teams, recent commits, alert history)
    60

    View Slide

  61. Allison McKnight | @aemcknig
    Better alert integration

    (IRC alerts, alert deployers)
    61

    View Slide

  62. Allison McKnight | @aemcknig
    More comprehensive alerting

    (front-end, mobile, API)
    62

    View Slide

  63. Use context to improve your tools

    View Slide

  64. 64
    64
    Questions?
    www.etsy.com/shop/cateanevski
    Resources
    Open-source tools:

    github.com/etsy/logster

    github.com/etsy/nagios_tools

    github.com/etsy/nagios-herald

    Icons: www.endlessicons.com
    @aemcknig
    Allison McKnight

    View Slide