Slide 1

Slide 1 text

It’s not in production unless it’s monitored. Wednesday, April 25, 2012

Slide 2

Slide 2 text

‣ @josephruscio ‣ Co-Founder/CTO Librato ‣ I <3 graphs Wednesday, April 25, 2012

Slide 3

Slide 3 text

Wednesday, April 25, 2012

Slide 4

Slide 4 text

SaaS 2002 ‣ Seed Round: $1.5M USD ‣ Infrastructure: CAPEX ‣ Dedicated Ops Team ‣ Custom Software Stack Wednesday, April 25, 2012

Slide 5

Slide 5 text

SaaS 2012 ‣ Seed Round: $20K USD ‣ Infrastructure: OPEX ‣ <=1 Ops Person ‣ OSS, External Services Wednesday, April 25, 2012

Slide 6

Slide 6 text

‣ agile infrastructure ‣ ephemeral infrastructure ‣ more change, worse tools! Wednesday, April 25, 2012

Slide 7

Slide 7 text

‣ continuous integration ‣ one-click deploy ‣ feature-flagging ‣ monitoring ‣ alerting Cont. Deployment Wednesday, April 25, 2012

Slide 8

Slide 8 text

Wednesday, April 25, 2012

Slide 9

Slide 9 text

Chunky Bacon!! Wednesday, April 25, 2012

Slide 10

Slide 10 text

Graphite StatsD OpenTSDB Cube d3.js Wednesday, April 25, 2012

Slide 11

Slide 11 text

Anti-Pattern • Custom Stats • MySQL threads • VMstat • .... Storage • CPU • Interface • Memory • Ping • Battery charge • .... Storage • Ping • CPU • Memory • Disks • SNMP Service • .... Storage ... Nagios Ganglia RRD/Cacti Wednesday, April 25, 2012

Slide 12

Slide 12 text

#monitoringsucks Wednesday, April 25, 2012

Slide 13

Slide 13 text

We need a better model Wednesday, April 25, 2012

Slide 14

Slide 14 text

Metrics ‣ business drivers ‣ application performance ‣ system resources ‣ network Wednesday, April 25, 2012

Slide 15

Slide 15 text

Collection Storage Aggregation Analysis Wednesday, April 25, 2012

Slide 16

Slide 16 text

Separation of Concerns Wednesday, April 25, 2012

Slide 17

Slide 17 text

Collection Wednesday, April 25, 2012

Slide 18

Slide 18 text

Logging ‣ etsy/logster ‣ logstash/logstash ‣ Papertrail et al. Wednesday, April 25, 2012

Slide 19

Slide 19 text

AS::Notifications ‣ pub/sub instrumentation ‣ mattmatt/lograge ‣ twinturbo/harness Wednesday, April 25, 2012

Slide 20

Slide 20 text

eric/metriks ‣ Ruby instrumentation ‣ counters,meters,timers ‣ multiple reporters Wednesday, April 25, 2012

Slide 21

Slide 21 text

Aggregation Wednesday, April 25, 2012

Slide 22

Slide 22 text

etsy/statsd ‣ ~319 SLOC Node.js ‣ counters, timers, gauges ‣ UDP Wednesday, April 25, 2012

Slide 23

Slide 23 text

Nginx Unicorn StatsD Front-End Wednesday, April 25, 2012

Slide 24

Slide 24 text

StatsD Clients ‣ zebrafishlabs/nginx-statsd ‣ github/rack-statsd ‣ shopify/statsd-instrument Wednesday, April 25, 2012

Slide 25

Slide 25 text

StatsD Servers Wednesday, April 25, 2012

Slide 26

Slide 26 text

Storage Wednesday, April 25, 2012

Slide 27

Slide 27 text

RRDTool ‣ Round-Robin Database Tool ‣ constant storage size ‣ rollups Wednesday, April 25, 2012

Slide 28

Slide 28 text

Graphite ‣ Whisper RRD ‣ flat.hierarchical.namespace ‣ HTTP queries Wednesday, April 25, 2012

Slide 29

Slide 29 text

OpenTSDB ‣ HBase ‣ multiple dimensions ‣ HTTP queries Wednesday, April 25, 2012

Slide 30

Slide 30 text

SaaS ‣ Librato Metrics et al. ‣ JSON over HTTP ‣ rollups ‣ interactive front-ends Wednesday, April 25, 2012

Slide 31

Slide 31 text

Visualization Wednesday, April 25, 2012

Slide 32

Slide 32 text

Correlation ‣ metrics ‣ annotations ‣ arbitrary combinations Wednesday, April 25, 2012

Slide 33

Slide 33 text

Wednesday, April 25, 2012

Slide 34

Slide 34 text

Wednesday, April 25, 2012

Slide 35

Slide 35 text

Dashboards ‣ shared understanding ‣ aberration detection ‣ fire-fighting manual Wednesday, April 25, 2012

Slide 36

Slide 36 text

Wednesday, April 25, 2012

Slide 37

Slide 37 text

Wednesday, April 25, 2012

Slide 38

Slide 38 text

Alerting Wednesday, April 25, 2012

Slide 39

Slide 39 text

Tuning Alerts ‣ trigger threshold ‣ cancel threshold ‣ re-arm window ‣ function ‣ window Wednesday, April 25, 2012

Slide 40

Slide 40 text

Wednesday, April 25, 2012

Slide 41

Slide 41 text

Aberrant Behavior Wednesday, April 25, 2012

Slide 42

Slide 42 text

‣ separation of concerns ‣ monitoring == tests ‣ arbitrary correlations ‣ dashboards ‣ living alerts Wednesday, April 25, 2012

Slide 43

Slide 43 text

fin Wednesday, April 25, 2012