Velocity NY - Crafting Performance Alerting Tools

Velocity New York – October 13, 2015 @aemcknig Allison McKnight
Crafting Performance Alerting Tools

Agenda Before monitoring Adding monitoring Iterating on alerting & tools
Here’s how it is What’s next? Questions

Allison McKnight | #performance Performance at Etsy Allison McKnight |
@aemcknig Lara Natalya Kristyn Allison Mike 3

Graph everything. 4

Allison McKnight | @aemcknig Graphing performance : 320 ms web
logs phptime

Allison McKnight | @aemcknig 6 Graphing performance github.com/etsy/logster Logster web
logs phptime:320 phptime:520 phptime:600 phptime:410 phptime:380 phptime:804 count: 6  average: 505  median: 410  95th perc: 804

Allison McKnight | @aemcknig Graphing performance 7 backend time (ms)

Allison McKnight | @aemcknig We needed monitoring 10

Allison McKnight | @aemcknig 11

Allison McKnight | @aemcknig Regression report 13 • Didn’t catch
small or slow-creep regressions • Difficult to tune • Additional investigation was required to verify and understand regressions • Alert fatigue

Allison McKnight | @aemcknig What could we do here? 14
• Enforce better graph-watching during deploys • Change alerting mechanism • Create tools to help investigate regressions • Change alert format

Allison McKnight | @aemcknig What could we do here? 15
• Enforce better graph-watching during deploys • Change alerting mechanism • Create tools to help investigate regressions • Change alert format

Changing the alerting mechanism

Allison McKnight | @aemcknig Monitoring page performance with Nagios 17

Allison McKnight | @aemcknig fast and fine-tuned alerting 18 Nagios

Allison McKnight | @aemcknig check graphite data script 19 Nagios
github.com/etsy/nagios_tools

Allison McKnight | @aemcknig Individual check for each service  you’d
like to monitor 20 Nagios

Allison McKnight | @aemcknig Individual thresholds for each page  you’d
like to monitor 21 Nagios

Allison McKnight | @aemcknig How do you  choose thresholds for
40 pages? 22

Creating tools

Allison McKnight | @aemcknig 25 recommended thresholds

Allison McKnight | @aemcknig define service {  use
graphite-‐service  host_name Performance  service_description Test home Performance  check_command check_graphite_data!700!900!300!'http://...'  check_interval 1  retry_interval 5  max_check_attempts 10  notification_interval 1440  contact_groups performance  } 26

Allison McKnight | @aemcknig Creating a tool to visualize our
performance alerts  helped us develop well-tuned alerts. 31

Allison McKnight | @aemcknig 33 Current value: 768.5,  warn threshold:
700.0,  critical threshold: 1000.0

Allison McKnight | @aemcknig We needed alerts that helped us 
understand the problem. 34

Changing the alert format

Allison McKnight | @aemcknig a tool for adding context to
Nagios alerts 36 Nagios Herald github.com/etsy/nagios-herald

Allison McKnight | @aemcknig 37 Allison McKnight | @aemcknig

Allison McKnight | @aemcknig Allison McKnight | @aemcknig 40

Allison McKnight | @aemcknig With dependencies, we receive  only actionable
alerts 42

Improving sleuthing tools

Allison McKnight | @aemcknig top 44 Allison McKnight | @aemcknig
44 actual thresholds

Allison McKnight | @aemcknig 45 Allison McKnight | @aemcknig 45
hello

Allison McKnight | @aemcknig led to easier and faster investigation
50 Improved context and alerting tools

#performance

#payments

Allison McKnight | @aemcknig improved cross-team collaboration 61 Improved context
and alerting tools

What’s next?

Allison McKnight | @aemcknig 63 Alerting on improvements

Allison McKnight | @aemcknig Adding more context  (teams, alert history)
64

Allison McKnight | @aemcknig Better alert integration  (alert other teams,
alert recent deployers) 65

Allison McKnight | @aemcknig More comprehensive alerting  (front-end, mobile, API)
66

Use context to improve your tools

68 68 Questions? www.etsy.com/shop/cateanevski Resources Open-source tools:  github.com/etsy/logster  github.com/etsy/nagios_tools  github.com/etsy/nagios-herald 
Icons: www.endlessicons.com @aemcknig Allison McKnight

Velocity NY - Crafting Performance Alerting Tools

Velocity NY - Crafting Performance Alerting Tools

More Decks by Allison McKnight

Other Decks in Programming

Featured

Transcript