Slide 1

Slide 1 text

Modern Radiology for Distributed Systems Dietrich Featherston @d2fn Thursday, October 11, 12

Slide 2

Slide 2 text

This is a talk about monitoring Thursday, October 11, 12

Slide 3

Slide 3 text

But not just any kind of monitoring Non-invasive monitoring Thursday, October 11, 12

Slide 4

Slide 4 text

non-invasive monitoring measures taken to describe the state of a system with minimal changes to the system being monitored Thursday, October 11, 12

Slide 5

Slide 5 text

Insight Invasiveness Radiographic Imagery Thursday, October 11, 12

Slide 6

Slide 6 text

preventative care measures taken to prevent diseases or injuries rather than curing them or treating their symptoms Thursday, October 11, 12

Slide 7

Slide 7 text

Non-invasive monitoring techniques focus primarily on host-based metrics Why is this a problem? Thursday, October 11, 12

Slide 8

Slide 8 text

Because applications are distributed Thursday, October 11, 12

Slide 9

Slide 9 text

Information emitted about nodes in the network n Information emitted about edges in the network n² Network size Thursday, October 11, 12

Slide 10

Slide 10 text

We analyze cell-structure because we can’t envision the whole organism We react to disease and injury because we lack preventative care Thursday, October 11, 12

Slide 11

Slide 11 text

We lack preventative care for applications because our non-invasive monitoring techniques are growing less and less meaningful Thursday, October 11, 12

Slide 12

Slide 12 text

Radiology is useful in illuminating non-invasive monitoring of distributed systems Thursday, October 11, 12

Slide 13

Slide 13 text

Thursday, October 11, 12

Slide 14

Slide 14 text

Thursday, October 11, 12

Slide 15

Slide 15 text

Thursday, October 11, 12

Slide 16

Slide 16 text

Context is everything Thursday, October 11, 12

Slide 17

Slide 17 text

How do we use context? Thursday, October 11, 12

Slide 18

Slide 18 text

Context Your Big Dumb Data !!! Thursday, October 11, 12

Slide 19

Slide 19 text

Human brain + med school Radiographic Imagery Diagnoses Thursday, October 11, 12

Slide 20

Slide 20 text

Signal Processing VLA Output E.T. Thursday, October 11, 12

Slide 21

Slide 21 text

Network Data Application Behavior Application Topology Signal Processing Expert Brain Thursday, October 11, 12

Slide 22

Slide 22 text

dimensions (11) epoch seconds epoch minutes epoch hours node id source ip source port dest ip dest port interface country network/asn measurements (8) egress packets egress octets ingress packets ingress octets retransmits errors app-rtt handshake-rtt Thursday, October 11, 12

Slide 23

Slide 23 text

Case Study #1 GC-Death of a distributed JVM application Thursday, October 11, 12

Slide 24

Slide 24 text

Thursday, October 11, 12

Slide 25

Slide 25 text

Case Study #2 Symptoms: - Latent Riak handoff - Cluster throughput bottoming out Thursday, October 11, 12

Slide 26

Slide 26 text

Thursday, October 11, 12

Slide 27

Slide 27 text

busy_dist_port Thursday, October 11, 12

Slide 28

Slide 28 text

+zdbbl 8192 Thursday, October 11, 12

Slide 29

Slide 29 text

Thursday, October 11, 12

Slide 30

Slide 30 text

Case Study #3 Bringing a dead riak node back online Thursday, October 11, 12

Slide 31

Slide 31 text

Thursday, October 11, 12

Slide 32

Slide 32 text

Thursday, October 11, 12

Slide 33

Slide 33 text

Thursday, October 11, 12

Slide 34

Slide 34 text

Case Study #4 Retransmits 10% of total network throughput Thursday, October 11, 12

Slide 35

Slide 35 text

Thursday, October 11, 12

Slide 36

Slide 36 text

var put: HttpPut = null try { // ... put data } catch { case e: Exception => // ... handle exception } finally { if(put != null) { put.abort() } } Thursday, October 11, 12

Slide 37

Slide 37 text

var put: HttpPut = null try { // ... put data } catch { case e: Exception => // ... handle exception } finally { if(put != null) { put.abort() } } Thursday, October 11, 12

Slide 38

Slide 38 text

abort public void abort() Description copied from interface: HttpUriRequest Aborts execution of the request. Source: http://hc.apache.org/httpcomponents-client-ga/httpclient/apidocs/org/apache/http/client/methods/HttpRequestBase.html#abort() THANKS Thursday, October 11, 12

Slide 39

Slide 39 text

129 public void abort() { 130 ClientConnectionRequest localRequest; 131 ConnectionReleaseTrigger localTrigger; 132 133 this.abortLock.lock(); 134 try { 135 if (this.aborted) { 136 return; 137 } 138 this.aborted = true; 139 140 localRequest = connRequest; 141 localTrigger = releaseTrigger; 142 } finally { 143 this.abortLock.unlock(); 144 } 145 146 // Trigger the callbacks outside of the lock, to prevent 147 // deadlocks in the scenario where the callbacks have 148 // their own locks that may be used while calling 149 // setReleaseTrigger or setConnectionRequest. 150 if (localRequest != null) { 151 localRequest.abortRequest(); 152 } 153 if (localTrigger != null) { 154 try { 155 localTrigger.abortConnection(); 156 } catch (IOException ex) { 157 // ignore 158 } 159 } 160 } Thursday, October 11, 12

Slide 40

Slide 40 text

Thursday, October 11, 12

Slide 41

Slide 41 text

augmented intelligence precedes artificial intelligence Thursday, October 11, 12

Slide 42

Slide 42 text

1895 Wilhelm Röntgen discovers X-Rays First medical use of x-rays in human imaging takes place one month later Thursday, October 11, 12

Slide 43

Slide 43 text

1895 Wilhelm Röntgen discovers X-Rays First medical use of x-rays in human imaging takes place one month later 1905 First English text on chest radiography Thursday, October 11, 12

Slide 44

Slide 44 text

1895 Wilhelm Röntgen discovers X-Rays First medical use of x-rays in human imaging takes place one month later 1920 1905 First English text on chest radiography Society of Radiographers formed Thursday, October 11, 12

Slide 45

Slide 45 text

Recognition of radiology as a formal medical discipline was a cultural problem, not a technology problem http://www.bshr.org.uk/page13.html Thursday, October 11, 12

Slide 46

Slide 46 text

If you want to talk to me about the query language used to ask questions of the network data we collect at Boundary talk to me after or hit me up on twitter. @d2fn github.com/dietrichf Thursday, October 11, 12

Slide 47

Slide 47 text

Find 45 minutes of total traffic seen on meters 1, 2, 226, & 301 starting 18 hours ago broken down by peer ip retain top 10 by the ratio of retransmits to packets get volume_1s_meter_ip [ meter in {1, 2, 226, 301}; epochMillis from -18h for 45m; ] categorize sum(ingress) as ingress, sum(egress) as egress, sum(ingressPackets + egressPackets) as packets, sum(retransmits) as retransmits, mean(appRttUsec/1000) as appRttMs by epochMillis, ip retain top 10 per epochMillis on retransmits/packets Thursday, October 11, 12

Slide 48

Slide 48 text

Find 45 minutes of total traffic seen on meters 1, 2, 226, & 301 starting 18 hours ago broken down by peer ip retain top 10 by the ratio of retransmits to packets get volume_1s_meter_ip [ meter in {1, 2, 226, 301}; epochMillis from -18h for 45m; ] categorize sum(ingress) as ingress, sum(egress) as egress, sum(ingressPackets + egressPackets) as packets, sum(retransmits) as retransmits, mean(appRttUsec/1000) as appRttMs by epochMillis, ip retain top 10 per epochMillis on retransmits/packets Thursday, October 11, 12

Slide 49

Slide 49 text

Find 45 minutes of total traffic seen on meters 1, 2, 226, & 301 starting 18 hours ago broken down by peer ip retain top 10 by the ratio of retransmits to packets get volume_1s_meter_ip [ meter in {1, 2, 226, 301}; epochMillis from -18h for 45m; ] categorize sum(ingress) as ingress, sum(egress) as egress, sum(ingressPackets + egressPackets) as packets, sum(retransmits) as retransmits, mean(appRttUsec/1000) as appRttMs by epochMillis, ip retain top 10 per epochMillis on retransmits/packets Thursday, October 11, 12

Slide 50

Slide 50 text

Find 45 minutes of total traffic seen on meters 1, 2, 226, & 301 starting 18 hours ago broken down by peer ip retain top 10 by the ratio of retransmits to packets get volume_1s_meter_ip [ meter in {1, 2, 226, 301}; epochMillis from -18h for 45m; ] categorize sum(ingress) as ingress, sum(egress) as egress, sum(ingressPackets + egressPackets) as packets, sum(retransmits) as retransmits, mean(appRttUsec/1000) as appRttMs by epochMillis, ip retain top 10 per epochMillis on retransmits/packets Thursday, October 11, 12

Slide 51

Slide 51 text

Find 45 minutes of total traffic seen on meters 1, 2, 226, & 301 starting 18 hours ago broken down by peer ip retain top 10 by the ratio of retransmits to packets get volume_1s_meter_ip [ meter in {1, 2, 226, 301}; epochMillis from -18h for 45m; ] categorize sum(ingress) as ingress, sum(egress) as egress, sum(ingressPackets + egressPackets) as packets, sum(retransmits) as retransmits, mean(appRttUsec/1000) as appRttMs by epochMillis, ip retain top 10 per epochMillis on retransmits/packets Thursday, October 11, 12

Slide 52

Slide 52 text

Find 45 minutes of total traffic seen on meters 1, 2, 226, & 301 starting 18 hours ago broken down by peer ip retain top 10 by the ratio of retransmits to packets get volume_1s_meter_ip [ meter in {1, 2, 226, 301}; epochMillis from -18h for 45m; ] categorize sum(ingress) as ingress, sum(egress) as egress, sum(ingressPackets + egressPackets) as packets, sum(retransmits) as retransmits, mean(appRttUsec/1000) as appRttMs by epochMillis, ip retain top 10 per epochMillis on retransmits/packets Thursday, October 11, 12

Slide 53

Slide 53 text

Find 45 minutes of total traffic seen on meters 1, 2, 226, & 301 starting 18 hours ago broken down by peer ip retain top 10 by the ratio of retransmits to packets get volume_1s_meter_ip [ meter in {1, 2, 226, 301}; epochMillis from -18h for 45m; ] categorize sum(ingress) as ingress, sum(egress) as egress, sum(ingressPackets + egressPackets) as packets, sum(retransmits) as retransmits, mean(appRttUsec/1000) as appRttMs by epochMillis, ip retain top 10 per epochMillis on retransmits/packets Thursday, October 11, 12

Slide 54

Slide 54 text

Find 45 minutes of total traffic seen on meters 1, 2, 226, & 301 starting 18 hours ago broken down by peer ip retain top 10 by the ratio of retransmits to packets get volume_1s_meter_ip [ meter in {1, 2, 226, 301}; epochMillis from -18h for 45m; ] categorize sum(ingress) as ingress, sum(egress) as egress, sum(ingressPackets + egressPackets) as packets, sum(retransmits) as retransmits, mean(appRttUsec/1000) as appRttMs by epochMillis, ip retain top 10 per epochMillis on retransmits/packets Thursday, October 11, 12