Slide 1

Slide 1 text

Velocity New York – October 13, 2015 @aemcknig Allison McKnight Crafting Performance Alerting Tools

Slide 2

Slide 2 text

Agenda Before monitoring Adding monitoring Iterating on alerting & tools Here’s how it is What’s next? Questions

Slide 3

Slide 3 text

Allison McKnight | #performance Performance at Etsy Allison McKnight | @aemcknig Lara Natalya Kristyn Allison Mike 3

Slide 4

Slide 4 text

Graph everything. 4

Slide 5

Slide 5 text

Allison McKnight | @aemcknig Graphing performance :  320  ms web logs phptime

Slide 6

Slide 6 text

Allison McKnight | @aemcknig 6 Graphing performance github.com/etsy/logster Logster web logs phptime:320 phptime:520 phptime:600 phptime:410 phptime:380 phptime:804 count:          6
 average:      505
 median:        410
 95th  perc:  804

Slide 7

Slide 7 text

Allison McKnight | @aemcknig Graphing performance 7 backend time (ms)

Slide 8

Slide 8 text

8

Slide 9

Slide 9 text

9

Slide 10

Slide 10 text

Allison McKnight | @aemcknig We needed monitoring 10

Slide 11

Slide 11 text

Allison McKnight | @aemcknig 11

Slide 12

Slide 12 text

Allison McKnight | @aemcknig 12

Slide 13

Slide 13 text

Allison McKnight | @aemcknig Regression report 13 • Didn’t catch small or slow-creep regressions • Difficult to tune • Additional investigation was required to verify and understand regressions • Alert fatigue

Slide 14

Slide 14 text

Allison McKnight | @aemcknig What could we do here? 14 • Enforce better graph-watching during deploys • Change alerting mechanism • Create tools to help investigate regressions • Change alert format

Slide 15

Slide 15 text

Allison McKnight | @aemcknig What could we do here? 15 • Enforce better graph-watching during deploys • Change alerting mechanism • Create tools to help investigate regressions • Change alert format

Slide 16

Slide 16 text

Changing the alerting mechanism

Slide 17

Slide 17 text

Allison McKnight | @aemcknig Monitoring page performance with Nagios 17

Slide 18

Slide 18 text

Allison McKnight | @aemcknig fast and fine-tuned alerting 18 Nagios

Slide 19

Slide 19 text

Allison McKnight | @aemcknig check graphite data script 19 Nagios github.com/etsy/nagios_tools

Slide 20

Slide 20 text

Allison McKnight | @aemcknig Individual check for each service
 you’d like to monitor 20 Nagios

Slide 21

Slide 21 text

Allison McKnight | @aemcknig Individual thresholds for each page
 you’d like to monitor 21 Nagios

Slide 22

Slide 22 text

Allison McKnight | @aemcknig How do you
 choose thresholds for 40 pages? 22

Slide 23

Slide 23 text

Creating tools

Slide 24

Slide 24 text

Allison McKnight | @aemcknig 24

Slide 25

Slide 25 text

Allison McKnight | @aemcknig 25 recommended thresholds

Slide 26

Slide 26 text

Allison McKnight | @aemcknig define  service  {
    use                                          graphite-­‐service
    host_name                              Performance
    service_description          Test  home  Performance
    check_command                      check_graphite_data!700!900!300!'http://...'
    check_interval                    1
    retry_interval                    5
    max_check_attempts            10
    notification_interval      1440
    contact_groups                    performance
 } 26

Slide 27

Slide 27 text

Allison McKnight | @aemcknig define  service  {
    use                                          graphite-­‐service
    host_name                              Performance
    service_description          Test  home  Performance
    check_command                      check_graphite_data!700!900!300!'http://...'
    check_interval                    1
    retry_interval                    5
    max_check_attempts            10
    notification_interval      1440
    contact_groups                    performance
 } 27

Slide 28

Slide 28 text

Allison McKnight | @aemcknig define  service  {
    use                                          graphite-­‐service
    host_name                              Performance
    service_description          Test  home  Performance
    check_command                      check_graphite_data!700!900!300!'http://...'
    check_interval                    1
    retry_interval                    5
    max_check_attempts            10
    notification_interval      1440
    contact_groups                    performance
 } 28

Slide 29

Slide 29 text

Allison McKnight | @aemcknig 29

Slide 30

Slide 30 text

Allison McKnight | @aemcknig define  service  {
    use                                          graphite-­‐service
    host_name                              Performance
    service_description          Test  home  Performance
    check_command                      check_graphite_data!700!900!300!'http://...'
    check_interval                    1
    retry_interval                    5
    max_check_attempts            10
    notification_interval      1440
    contact_groups                    performance
 } 30

Slide 31

Slide 31 text

Allison McKnight | @aemcknig Creating a tool to visualize our performance alerts
 helped us develop well-tuned alerts. 31

Slide 32

Slide 32 text

Allison McKnight | @aemcknig 32

Slide 33

Slide 33 text

Allison McKnight | @aemcknig 33 Current value: 768.5,
 warn threshold: 700.0,
 critical threshold: 1000.0

Slide 34

Slide 34 text

Allison McKnight | @aemcknig We needed alerts that helped us
 understand the problem. 34

Slide 35

Slide 35 text

Changing the alert format

Slide 36

Slide 36 text

Allison McKnight | @aemcknig a tool for adding context to Nagios alerts 36 Nagios Herald github.com/etsy/nagios-herald

Slide 37

Slide 37 text

Allison McKnight | @aemcknig 37 Allison McKnight | @aemcknig

Slide 38

Slide 38 text

Allison McKnight | @aemcknig 38 Allison McKnight | @aemcknig

Slide 39

Slide 39 text

Allison McKnight | @aemcknig 39 Allison McKnight | @aemcknig

Slide 40

Slide 40 text

Allison McKnight | @aemcknig Allison McKnight | @aemcknig 40

Slide 41

Slide 41 text

Allison McKnight | @aemcknig Allison McKnight | @aemcknig 41

Slide 42

Slide 42 text

Allison McKnight | @aemcknig With dependencies, we receive
 only actionable alerts 42

Slide 43

Slide 43 text

Improving sleuthing tools

Slide 44

Slide 44 text

Allison McKnight | @aemcknig top 44 Allison McKnight | @aemcknig 44 actual thresholds

Slide 45

Slide 45 text

Allison McKnight | @aemcknig 45 Allison McKnight | @aemcknig 45 hello

Slide 46

Slide 46 text

Allison McKnight | @aemcknig Allison McKnight | @aemcknig 46

Slide 47

Slide 47 text

Allison McKnight | @aemcknig Allison McKnight | @aemcknig 47

Slide 48

Slide 48 text

Allison McKnight | @aemcknig Allison McKnight | @aemcknig 48

Slide 49

Slide 49 text

Allison McKnight | @aemcknig Allison McKnight | @aemcknig 49

Slide 50

Slide 50 text

Allison McKnight | @aemcknig led to easier and faster investigation 50 Improved context and alerting tools

Slide 51

Slide 51 text

#performance

Slide 52

Slide 52 text

#performance

Slide 53

Slide 53 text

#performance

Slide 54

Slide 54 text

#performance

Slide 55

Slide 55 text

No content

Slide 56

Slide 56 text

#performance

Slide 57

Slide 57 text

No content

Slide 58

Slide 58 text

#performance

Slide 59

Slide 59 text

No content

Slide 60

Slide 60 text

#payments

Slide 61

Slide 61 text

Allison McKnight | @aemcknig improved cross-team collaboration 61 Improved context and alerting tools

Slide 62

Slide 62 text

What’s next?

Slide 63

Slide 63 text

Allison McKnight | @aemcknig 63 Alerting on improvements

Slide 64

Slide 64 text

Allison McKnight | @aemcknig Adding more context
 (teams, alert history) 64

Slide 65

Slide 65 text

Allison McKnight | @aemcknig Better alert integration
 (alert other teams, alert recent deployers) 65

Slide 66

Slide 66 text

Allison McKnight | @aemcknig More comprehensive alerting
 (front-end, mobile, API) 66

Slide 67

Slide 67 text

Use context to improve your tools

Slide 68

Slide 68 text

68 68 Questions? www.etsy.com/shop/cateanevski Resources Open-source tools:
 github.com/etsy/logster
 github.com/etsy/nagios_tools
 github.com/etsy/nagios-herald
 Icons: www.endlessicons.com @aemcknig Allison McKnight