Slide 1

Slide 1 text

harnessing the power of the eye to help direct troubleshooting efforts across a distributed & service-oriented architecture stopwatch (will have a better name before open-sourcing. maybe Hermes or BatmanSonar or something)

Slide 2

Slide 2 text

Matthew Lyon @mattly AppFog Platform as a Service by and for developers

Slide 3

Slide 3 text

PHP Fog our first product started as a prototype

Slide 4

Slide 4 text

symptoms of a problem lots of things are randomly very slow or fail without a seeming connection to each other except that it tends to happen at the same time

Slide 5

Slide 5 text

troubleshooting statsd/graphite + easy to store data + stores right data - hard to get at data - complex API - wasn’t helping me pinpoint the problem - doesn’t highlight relationships tailing log files + got a lot of raw data + helped understand the interaction between system components - cost sanity and time

Slide 6

Slide 6 text

1 month finding root cause gitolite on EBS 1 day writing code to replace gitolite at my in-laws. in Yakima, Washington. on Thanksgiving. 1 month convincing people I was right before high-risk deploy

Slide 7

Slide 7 text

open-source Platform-as-a-Service toolkit created by VMWare launched with Ruby, Java, Node.js runtimes we contributed PHP runtime support it now also runs Python and Erlang (and if you wanted perl, it wouldn’t be hard to add) they built it to run on vSphere we run it on AWS and others cloudfoundry

Slide 8

Slide 8 text

cloudfoundry

Slide 9

Slide 9 text

we run five of those on three continents

Slide 10

Slide 10 text

Hey it’s really slow right now. can you take a look? Sunday, 7:04am *sigh* on it.

Slide 11

Slide 11 text

observe & measure hey, there’s (one of the) problem(s)!

Slide 12

Slide 12 text

numbers are great, but you suck at stats especially if you’re not aware you do unless perhaps you’re German

Slide 13

Slide 13 text

averages lie especially in comparisons failures times what you want is the distribution

Slide 14

Slide 14 text

the human eye can quickly make sense of a lot of data but percentiles don’t tell the whole story and summaries lie too

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

break out by facets appRegistry: resolve service via database dispatcher: import service on rackspace (oh and guess what? this one failed)

Slide 17

Slide 17 text

Slide 18

Slide 18 text

we run the largest installations of cloudfoundry some bugs only manifest at the edges the only one run as a pay-for service on the public cloud (that is, AWS, Rackspace Cloud, etc)

Slide 19

Slide 19 text

if you can’t measure from the inside then observe it from the outside cf’s deploy mechanism had timeout issues particularly with AWS/ELB and large apps

Slide 20

Slide 20 text

make it obvious that something is wrong from across the room EBS... again

Slide 21

Slide 21 text

the site is unresponsive? failures gonna propagate

Slide 22

Slide 22 text

cloudfoundry academy, lesson 1: uncaught exceptions will kill you

Slide 23

Slide 23 text

summarize into buckets to help find the pain points

Slide 24

Slide 24 text

drill into the buckets to figure out what’s wrong in this case, the culprit was a new edge case in creating java apps

Slide 25

Slide 25 text

staying on top of network problems

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

AWS US-East is having problems... again with EBS... again this is the AWS outage around Thanksgiving 2012 that took down half the internet

Slide 28

Slide 28 text

No content

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

No content

Slide 31

Slide 31 text

a quick tour of how I learned to draw

Slide 32

Slide 32 text

Tufte’s three principles of data density: 1. Above all else, show the data 2. Maximize the data-to-ink ratio 3. Erase non-data ink basically, make every pixel mean something

Slide 33

Slide 33 text

d3 + lots of good tools for simple data + almost an all-in-one solution + many prefab “layouts” + fluent interface (ie, jQuery) ? uses svg - data “joins” are a little weird - fluent interface isn’t always predictable svg inserts a dom node per shape if you’ve got >50k data points, consider...

Slide 34

Slide 34 text

canvas + single dom-node - quickdraw-like API - lack of comprehensive docs - renders to pixels, no zooming also, blurry on Retina (first world problem, I know) - doesn’t necessarily give performance gains performance-tuning rendering changing render contexts is expensive

Slide 35

Slide 35 text

invert data by style use beginPath() and fill() sparingly (yes, this is coffeescript)