Crafting performance alerting tools

Velocity Santa Clara – May 28, 2015 @aemcknig Allison McKnight
Crafting Performance Alerting Tools

Agenda Here’s how it was Adding monitoring Iterating on alerting
& tools Here’s how it is What’s next? Questions

Allison McKnight | @aemcknig 3 Performance at Etsy Allison McKnight
| @aemcknig Kristyn Allison Lara Natalya

Graph everything. 4

Allison McKnight | @aemcknig Graphing performance : 320 ms web
logs phptime

Allison McKnight | @aemcknig 6 Graphing performance github.com/etsy/logster Logster web
logs phptime:320 phptime:520 phptime:600 phptime:410 phptime:380 phptime:804 count: 6  average: 505  median: 410  95th perc: 804

Allison McKnight | @aemcknig Graphing performance 7 backend time (ms)

Allison McKnight | @aemcknig We needed monitoring 10

Allison McKnight | @aemcknig 11

Allison McKnight | @aemcknig Regression report 13 • Didn’t catch
small or slow-creep regressions • Difficult to tune • Additional investigation was required to verify and understand regressions • Alert fatigue

Allison McKnight | @aemcknig What could we do here? 14
• Enforce better graph-watching during deploys • Change alerting mechanism • Create tools to help investigate regressions • Change alert format

Allison McKnight | @aemcknig What could we do here? 15
• Enforce better graph-watching during deploys • Change alerting mechanism • Create tools to help investigate regressions • Change alert format

Changing the alerting mechanism

Allison McKnight | @aemcknig Monitoring page performance with Nagios 17

Allison McKnight | @aemcknig fast and fine-tuned alerting 18 Nagios

Allison McKnight | @aemcknig check graphite data script 19 Nagios
github.com/etsy/nagios_tools

Allison McKnight | @aemcknig Individual check for each service  you’d
like to monitor 20 Nagios

Allison McKnight | @aemcknig Individual thresholds for each page  you’d
like to monitor 21 Nagios

Allison McKnight | @aemcknig How do you  choose thresholds for
40 pages? 22

Creating tools

Allison McKnight | @aemcknig 25 recommended thresholds

Allison McKnight | @aemcknig define service {  use
graphite-‐service  host_name Performance  service_description Test shop_policy Performance  check_command check_graphite_data!600!650!300!'http://...'  check_interval 1  retry_interval 5  max_check_attempts 10  notification_interval 1440  contact_groups performance  } 27

Allison McKnight | @aemcknig Creating a tool to visualize our
performance alerts  helped us develop well-tuned alerts. 32

Allison McKnight | @aemcknig 35 Current value: 768.5,  warn threshold:
750.0,  critical threshold: 800.0

Allison McKnight | @aemcknig We needed alerts that helped us 
understand the problem. 36

Changing the alert format

Allison McKnight | @aemcknig a tool for adding context to
Nagios alerts 38 Nagios Herald github.com/etsy/nagios-herald

Allison McKnight | @aemcknig 41 Allison McKnight | @aemcknig

Allison McKnight | @aemcknig Allison McKnight | @aemcknig 42

Allison McKnight | @aemcknig With dependencies, we receive  only actionable
alerts 45

Improving sleuthing tools

Allison McKnight | @aemcknig 47 Allison McKnight | @aemcknig 47

Allison McKnight | @aemcknig led to easier and faster investigation
50 Improved context and alerting tools

#performance

#payments

Allison McKnight | @aemcknig improved cross-team collaboration 57 Improved context
and alerting tools

What’s next?

Allison McKnight | @aemcknig 59 Alerting on improvements

Allison McKnight | @aemcknig Adding more context  (teams, recent commits,
alert history) 60

Allison McKnight | @aemcknig Better alert integration  (IRC alerts, alert
deployers) 61

Allison McKnight | @aemcknig More comprehensive alerting  (front-end, mobile, API)
62

Use context to improve your tools

64 64 Questions? www.etsy.com/shop/cateanevski Resources Open-source tools:  github.com/etsy/logster  github.com/etsy/nagios_tools  github.com/etsy/nagios-herald 
Icons: www.endlessicons.com @aemcknig Allison McKnight

Crafting performance alerting tools

Crafting performance alerting tools

More Decks by Allison McKnight

Other Decks in Technology

Featured

Transcript