Pro Yearly is on sale from $80 to $50! »

Crafting performance alerting tools

Crafting performance alerting tools

From Velocity 2015, I bring you... Crafting performance alerting tools!

The performance team at Etsy has recently developed a number of new alerting tools to help us discover and dig into performance regressions across the site. We built these new tools on top of existing technology (Nagios, Nagios Herald, and Graphite) to bring added context to site slowdowns and help us fix regressions more quickly.

These new alerts change the conversation with our coworkers. Before, we’d ask people about regressions long after they occurred. Now, we discover regressions almost immediately, and are able to share context such as graphs and recent site changes as we work with other teams to track down and fix regressions.

In this talk, Allison McKnight, performance engineer at Etsy, will cover:

- How we created alerts for backend performance slowdowns
- How we iterated on adding context to those alerts, including: Experiment ramp-ups, the state of our most popular and slowest - pages week-over-week, and better graphs
- How we built a dashboard for these alerts that highlights what’s currently an issue, allowing users to play with settings to dig into what is affected by the regression and to compare related pages
- How we built an IRC command to help us do this work alongside our daily chatter, which helps us ask our teammates in real-time for more context
- How good tools end up being contagious around a company
The future of our alerts: alerting on performance wins and native app metrics; automatically including other teams in alerts about their pages.

Open-source tools mentioned in this talk:

github.com/etsy/logster

github.com/etsy/nagios_tools

github.com/etsy/nagios-herald


C5ca01974effba0b394a7f54f26747ea?s=128

Allison McKnight

May 29, 2015
Tweet

Transcript

  1. Velocity Santa Clara – May 28, 2015 @aemcknig Allison McKnight

    Crafting Performance Alerting Tools
  2. Agenda Here’s how it was Adding monitoring Iterating on alerting

    & tools Here’s how it is What’s next? Questions
  3. Allison McKnight | @aemcknig 3 Performance at Etsy Allison McKnight

    | @aemcknig Kristyn Allison Lara Natalya
  4. Graph everything. 4

  5. Allison McKnight | @aemcknig Graphing performance :  320  ms web

    logs phptime
  6. Allison McKnight | @aemcknig 6 Graphing performance github.com/etsy/logster Logster web

    logs phptime:320 phptime:520 phptime:600 phptime:410 phptime:380 phptime:804 count:          6
 average:      505
 median:        410
 95th  perc:  804
  7. Allison McKnight | @aemcknig Graphing performance 7 backend time (ms)

  8. 8

  9. 9

  10. Allison McKnight | @aemcknig We needed monitoring 10

  11. Allison McKnight | @aemcknig 11

  12. Allison McKnight | @aemcknig 12

  13. Allison McKnight | @aemcknig Regression report 13 • Didn’t catch

    small or slow-creep regressions • Difficult to tune • Additional investigation was required to verify and understand regressions • Alert fatigue
  14. Allison McKnight | @aemcknig What could we do here? 14

    • Enforce better graph-watching during deploys • Change alerting mechanism • Create tools to help investigate regressions • Change alert format
  15. Allison McKnight | @aemcknig What could we do here? 15

    • Enforce better graph-watching during deploys • Change alerting mechanism • Create tools to help investigate regressions • Change alert format
  16. Changing the alerting mechanism

  17. Allison McKnight | @aemcknig Monitoring page performance with Nagios 17

  18. Allison McKnight | @aemcknig fast and fine-tuned alerting 18 Nagios

  19. Allison McKnight | @aemcknig check graphite data script 19 Nagios

    github.com/etsy/nagios_tools
  20. Allison McKnight | @aemcknig Individual check for each service
 you’d

    like to monitor 20 Nagios
  21. Allison McKnight | @aemcknig Individual thresholds for each page
 you’d

    like to monitor 21 Nagios
  22. Allison McKnight | @aemcknig How do you
 choose thresholds for

    40 pages? 22
  23. Creating tools

  24. Allison McKnight | @aemcknig 24

  25. Allison McKnight | @aemcknig 25 recommended thresholds

  26. Allison McKnight | @aemcknig 26

  27. Allison McKnight | @aemcknig define  service  {
    use  

                                           graphite-­‐service
    host_name                              Performance
    service_description          Test  shop_policy  Performance
    check_command                      check_graphite_data!600!650!300!'http://...'
    check_interval                    1
    retry_interval                    5
    max_check_attempts            10
    notification_interval      1440
    contact_groups                    performance
 } 27
  28. Allison McKnight | @aemcknig define  service  {
    use  

                                           graphite-­‐service
    host_name                              Performance
    service_description          Test  shop_policy  Performance
    check_command                      check_graphite_data!600!650!300!'http://...'
    check_interval                    1
    retry_interval                    5
    max_check_attempts            10
    notification_interval      1440
    contact_groups                    performance
 } 28
  29. Allison McKnight | @aemcknig define  service  {
    use  

                                           graphite-­‐service
    host_name                              Performance
    service_description          Test  shop_policy  Performance
    check_command                      check_graphite_data!600!650!300!'http://...'
    check_interval                    1
    retry_interval                    5
    max_check_attempts            10
    notification_interval      1440
    contact_groups                    performance
 } 29
  30. Allison McKnight | @aemcknig 30

  31. Allison McKnight | @aemcknig define  service  {
    use  

                                           graphite-­‐service
    host_name                              Performance
    service_description          Test  shop_policy  Performance
    check_command                      check_graphite_data!600!650!300!'http://...'
    check_interval                    1
    retry_interval                    5
    max_check_attempts            10
    notification_interval      1440
    contact_groups                    performance
 } 31
  32. Allison McKnight | @aemcknig Creating a tool to visualize our

    performance alerts
 helped us develop well-tuned alerts. 32
  33. Allison McKnight | @aemcknig 33

  34. Allison McKnight | @aemcknig 34

  35. Allison McKnight | @aemcknig 35 Current value: 768.5,
 warn threshold:

    750.0,
 critical threshold: 800.0
  36. Allison McKnight | @aemcknig We needed alerts that helped us


    understand the problem. 36
  37. Changing the alert format

  38. Allison McKnight | @aemcknig a tool for adding context to

    Nagios alerts 38 Nagios Herald github.com/etsy/nagios-herald
  39. Allison McKnight | @aemcknig 39

  40. Allison McKnight | @aemcknig 40

  41. Allison McKnight | @aemcknig 41 Allison McKnight | @aemcknig

  42. Allison McKnight | @aemcknig Allison McKnight | @aemcknig 42

  43. Allison McKnight | @aemcknig Allison McKnight | @aemcknig 43

  44. Allison McKnight | @aemcknig Allison McKnight | @aemcknig 44

  45. Allison McKnight | @aemcknig With dependencies, we receive
 only actionable

    alerts 45
  46. Improving sleuthing tools

  47. Allison McKnight | @aemcknig 47 Allison McKnight | @aemcknig 47

  48. Allison McKnight | @aemcknig 48 Allison McKnight | @aemcknig 48

  49. Allison McKnight | @aemcknig 49 Allison McKnight | @aemcknig 49

  50. Allison McKnight | @aemcknig led to easier and faster investigation

    50 Improved context and alerting tools
  51. #performance

  52. #performance

  53. None
  54. #performance

  55. None
  56. #payments

  57. Allison McKnight | @aemcknig improved cross-team collaboration 57 Improved context

    and alerting tools
  58. What’s next?

  59. Allison McKnight | @aemcknig 59 Alerting on improvements

  60. Allison McKnight | @aemcknig Adding more context
 (teams, recent commits,

    alert history) 60
  61. Allison McKnight | @aemcknig Better alert integration
 (IRC alerts, alert

    deployers) 61
  62. Allison McKnight | @aemcknig More comprehensive alerting
 (front-end, mobile, API)

    62
  63. Use context to improve your tools

  64. 64 64 Questions? www.etsy.com/shop/cateanevski Resources Open-source tools:
 github.com/etsy/logster
 github.com/etsy/nagios_tools
 github.com/etsy/nagios-herald


    Icons: www.endlessicons.com @aemcknig Allison McKnight