Slide 1

Slide 1 text

Disaster Porn ... and the importance of being a generalist

Slide 2

Slide 2 text

About Me Scott Sanders Senior Systems Administrator RideCharge, Inc. @scott_sanders ssanders@taximagic.com

Slide 3

Slide 3 text

Surge Conference 2011 ● Ben Fried's (Google CIO) keynote speech talks about the importance being a generalist ● I think specializing is fine (and normal as your career advances), but it's VITAL to keep a generalist perspective ● Disaster porn!

Slide 4

Slide 4 text

Background ● Taxi Magic ○ Mobile applications to book/track/pay for taxis ○ Web booking integration for taxi fleets ○ In-car payment hardware (PIM) ● What's a PIM? ○ Passenger Information Monitor ○ 7" HD touchscreen ○ Credit card swipe ○ Wired into cab hardware and dispatch system ○ Uses cellular communication to talk to TM ○ Regular GPS events over UDP ○ Payment transactions over HTTPS

Slide 5

Slide 5 text

The problems begin... (June 5th) ● A handful of cab drivers in Los Angeles begin reporting failures when swiping CCs ● Embedded hardware team recalls a few cabs and investigates local log files ● Reports problems during SSL handshake to RideCharge servers ● Tech Ops team remaps httpd to the same libcrypto.so and libssl.so version as the PIM using libmap.conf(5) ● Problem vanishes! HOORAY!!! Beer!

Slide 6

Slide 6 text

Fast forward to June 12th... ● SHTF ● Widespread reports of failing CC swipes across the entire SoCal region ● Hardware team pulls more vehicles and notices the same SSL handshake problem ● Tech Ops team is unable to correlate this to a drop in traffic ● Furthermore, Tech Ops is still seeing regular GPS updates from ALL active cabs!

Slide 7

Slide 7 text

WTF?

Slide 8

Slide 8 text

Diving in... ● Our cellular ISP insists they aren't having any problems ● (Sound familiar to anyone?) ● I start running the standard toolkit looking for patterns ○ tcpdump ○ traceroute ○ NMAP ● NMAP is giving me some inconsistent results

Slide 9

Slide 9 text

Understanding how TCP/IP works ● How do you establish a TCP connection? ○ SYN (Hey, you there?) ○ SYN/ACK (Yeah, what's up?) ○ ACK (Cool, lets talk!) ● What happens if you connect to a port that doesn't have a service bound to it? ○ SYN (Hey, you there?) ○ RST (leave me alone!) ● So why am I only getting a RST every now and then? Why do I see timeouts instead? ● This is starting to smell like a routing problem

Slide 10

Slide 10 text

Proving the problem exists ● Since I am receiving GPS updates over UDP from all the cabs I can use this to identify the IP of a cab and its location at a point in time ● We know the expected behavior when attempting a connection to a closed port ● Let's run some tests and gather some data

Slide 11

Slide 11 text

comm_test.sh #!/usr/bin/env bash test_connection () { # fork a subshell to handle the tcp connect test ( # the result is either no-response or conn-refused result=$(nmap -P0 -T1 -sT -p22 --reason -q $4 | awk '/^22/{print $4}') echo "$1 $2 $3 $4 $result $8 $9" >> results.txt ) & } # connect to the gps receiver host and monitor real-time UDP gps updates ssh -t gps001.iad1.prod.rws 'tail -F gps_updates.csv' | while read line ; do # line format: Jun 16 15:14:45, 184.251.233.91, 0, 20, 2577, \ # 33.9822566666667, -118.4593 line=$(echo $line | tr -d ',') test_connection $line done

Slide 12

Slide 12 text

Results % comm_test.sh Jun 16 15:28:00 102.122.93.194 conn-refused 33.8221321105957 -116.548851013184 Jun 16 15:27:57 176.135.73.0 conn-refused 32.8885866666667 -97.0376933333333 Jun 16 15:27:59 181.251.163.200 conn-refused 33.9004183333333 -118.387591666667 Jun 16 15:27:53 178.156.201.182 conn-refused 44.9484977722168 -93.2568588256836 Jun 16 15:27:28 180.229.138.141 no-response 39.766675 -104.940496666667 Jun 16 15:27:28 187.231.74.250 no-response 33.80945 -118.206921666667 Jun 16 15:28:00 181.255.84.59 conn-refused 34.0593466666667 -118.24536 Jun 16 15:27:55 78.6.67.236 conn-refused 34.0581833333333 -118.415878333333

Slide 13

Slide 13 text

Awesome way to get non-techie's on your side and impress some management :-) Visualize the problem

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

Beating up your ISP (figuratively) ● After more than a dozen calls to the ISP and as many "escalations" we landed on a conference call with some lead networks engineers ● After 6 hours on this conference call reiterating the problem and showing the data one engineer asks us to "hold tight" ● Things get very quiet... ● Like magic all of my tests start succeeding!

Slide 17

Slide 17 text

WTF!?!

Slide 18

Slide 18 text

The backstory ● On June 5th, the ISP migrated the SoCal region to a new data center in Anaheim. This was an epic failure and they rolled back ● On June 12th, the ISP migrated again to Anaheim "successfully" ● Cell traffic is pooled by connection, and one of the pools was routing asymmetrically ● Asymmetric routing + stateful firewalls = BAD ● Updating the routing tables fixed everything

Slide 19

Slide 19 text

Being a generalist ● A DevOps culture requires generalists ● Understanding the full stack means being able to troubleshoot problems at all layers ● Fluid communication between sysadmins, developers, hardware engineers, and network engineers requires generalists ● Fewer people in the war room results in faster problem solving ● This saves time and money and makes your team more valuable to the business

Slide 20

Slide 20 text

We're hiring! https://taximagic.com/en_US/about/careers @scott_sanders ssanders@taximagic.com Thank you!