Velocity NY - Crafting Performance Alerting Tools

Velocity NY - Crafting Performance Alerting Tools

The performance team at Etsy has recently developed a number of new alerting tools to help us discover and dig into performance regressions across the site. We built these new tools on top of existing technology (Nagios and Nagios Herald) to bring added context to site slowdowns and help us fix regressions more quickly.

These new alerts change the conversation with our coworkers. Before, we’d ask people about regressions long after they occurred. Now, we discover regressions almost immediately, and are able to share context such as graphs and recent site changes, as we work with other teams to track down and fix regressions.

C5ca01974effba0b394a7f54f26747ea?s=128

Allison McKnight

October 13, 2015
Tweet

Transcript

  1. Velocity New York – October 13, 2015 @aemcknig Allison McKnight

    Crafting Performance Alerting Tools
  2. Agenda Before monitoring Adding monitoring Iterating on alerting & tools

    Here’s how it is What’s next? Questions
  3. Allison McKnight | #performance Performance at Etsy Allison McKnight |

    @aemcknig Lara Natalya Kristyn Allison Mike 3
  4. Graph everything. 4

  5. Allison McKnight | @aemcknig Graphing performance :  320  ms web

    logs phptime
  6. Allison McKnight | @aemcknig 6 Graphing performance github.com/etsy/logster Logster web

    logs phptime:320 phptime:520 phptime:600 phptime:410 phptime:380 phptime:804 count:          6
 average:      505
 median:        410
 95th  perc:  804
  7. Allison McKnight | @aemcknig Graphing performance 7 backend time (ms)

  8. 8

  9. 9

  10. Allison McKnight | @aemcknig We needed monitoring 10

  11. Allison McKnight | @aemcknig 11

  12. Allison McKnight | @aemcknig 12

  13. Allison McKnight | @aemcknig Regression report 13 • Didn’t catch

    small or slow-creep regressions • Difficult to tune • Additional investigation was required to verify and understand regressions • Alert fatigue
  14. Allison McKnight | @aemcknig What could we do here? 14

    • Enforce better graph-watching during deploys • Change alerting mechanism • Create tools to help investigate regressions • Change alert format
  15. Allison McKnight | @aemcknig What could we do here? 15

    • Enforce better graph-watching during deploys • Change alerting mechanism • Create tools to help investigate regressions • Change alert format
  16. Changing the alerting mechanism

  17. Allison McKnight | @aemcknig Monitoring page performance with Nagios 17

  18. Allison McKnight | @aemcknig fast and fine-tuned alerting 18 Nagios

  19. Allison McKnight | @aemcknig check graphite data script 19 Nagios

    github.com/etsy/nagios_tools
  20. Allison McKnight | @aemcknig Individual check for each service
 you’d

    like to monitor 20 Nagios
  21. Allison McKnight | @aemcknig Individual thresholds for each page
 you’d

    like to monitor 21 Nagios
  22. Allison McKnight | @aemcknig How do you
 choose thresholds for

    40 pages? 22
  23. Creating tools

  24. Allison McKnight | @aemcknig 24

  25. Allison McKnight | @aemcknig 25 recommended thresholds

  26. Allison McKnight | @aemcknig define  service  {
    use  

                                           graphite-­‐service
    host_name                              Performance
    service_description          Test  home  Performance
    check_command                      check_graphite_data!700!900!300!'http://...'
    check_interval                    1
    retry_interval                    5
    max_check_attempts            10
    notification_interval      1440
    contact_groups                    performance
 } 26
  27. Allison McKnight | @aemcknig define  service  {
    use  

                                           graphite-­‐service
    host_name                              Performance
    service_description          Test  home  Performance
    check_command                      check_graphite_data!700!900!300!'http://...'
    check_interval                    1
    retry_interval                    5
    max_check_attempts            10
    notification_interval      1440
    contact_groups                    performance
 } 27
  28. Allison McKnight | @aemcknig define  service  {
    use  

                                           graphite-­‐service
    host_name                              Performance
    service_description          Test  home  Performance
    check_command                      check_graphite_data!700!900!300!'http://...'
    check_interval                    1
    retry_interval                    5
    max_check_attempts            10
    notification_interval      1440
    contact_groups                    performance
 } 28
  29. Allison McKnight | @aemcknig 29

  30. Allison McKnight | @aemcknig define  service  {
    use  

                                           graphite-­‐service
    host_name                              Performance
    service_description          Test  home  Performance
    check_command                      check_graphite_data!700!900!300!'http://...'
    check_interval                    1
    retry_interval                    5
    max_check_attempts            10
    notification_interval      1440
    contact_groups                    performance
 } 30
  31. Allison McKnight | @aemcknig Creating a tool to visualize our

    performance alerts
 helped us develop well-tuned alerts. 31
  32. Allison McKnight | @aemcknig 32

  33. Allison McKnight | @aemcknig 33 Current value: 768.5,
 warn threshold:

    700.0,
 critical threshold: 1000.0
  34. Allison McKnight | @aemcknig We needed alerts that helped us


    understand the problem. 34
  35. Changing the alert format

  36. Allison McKnight | @aemcknig a tool for adding context to

    Nagios alerts 36 Nagios Herald github.com/etsy/nagios-herald
  37. Allison McKnight | @aemcknig 37 Allison McKnight | @aemcknig

  38. Allison McKnight | @aemcknig 38 Allison McKnight | @aemcknig

  39. Allison McKnight | @aemcknig 39 Allison McKnight | @aemcknig

  40. Allison McKnight | @aemcknig Allison McKnight | @aemcknig 40

  41. Allison McKnight | @aemcknig Allison McKnight | @aemcknig 41

  42. Allison McKnight | @aemcknig With dependencies, we receive
 only actionable

    alerts 42
  43. Improving sleuthing tools

  44. Allison McKnight | @aemcknig top 44 Allison McKnight | @aemcknig

    44 actual thresholds
  45. Allison McKnight | @aemcknig 45 Allison McKnight | @aemcknig 45

    hello
  46. Allison McKnight | @aemcknig Allison McKnight | @aemcknig 46

  47. Allison McKnight | @aemcknig Allison McKnight | @aemcknig 47

  48. Allison McKnight | @aemcknig Allison McKnight | @aemcknig 48

  49. Allison McKnight | @aemcknig Allison McKnight | @aemcknig 49

  50. Allison McKnight | @aemcknig led to easier and faster investigation

    50 Improved context and alerting tools
  51. #performance

  52. #performance

  53. #performance

  54. #performance

  55. None
  56. #performance

  57. None
  58. #performance

  59. None
  60. #payments

  61. Allison McKnight | @aemcknig improved cross-team collaboration 61 Improved context

    and alerting tools
  62. What’s next?

  63. Allison McKnight | @aemcknig 63 Alerting on improvements

  64. Allison McKnight | @aemcknig Adding more context
 (teams, alert history)

    64
  65. Allison McKnight | @aemcknig Better alert integration
 (alert other teams,

    alert recent deployers) 65
  66. Allison McKnight | @aemcknig More comprehensive alerting
 (front-end, mobile, API)

    66
  67. Use context to improve your tools

  68. 68 68 Questions? www.etsy.com/shop/cateanevski Resources Open-source tools:
 github.com/etsy/logster
 github.com/etsy/nagios_tools
 github.com/etsy/nagios-herald


    Icons: www.endlessicons.com @aemcknig Allison McKnight