Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Disaster Porn Lightning Talk

70bd372389add5e121b7a9a929b2d887?s=47 Scott Sanders
September 27, 2012

Disaster Porn Lightning Talk

A bit of disaster porn, and why it's important to be a generalist.

70bd372389add5e121b7a9a929b2d887?s=128

Scott Sanders

September 27, 2012
Tweet

Transcript

  1. Disaster Porn ... and the importance of being a generalist

  2. About Me Scott Sanders Senior Systems Administrator RideCharge, Inc. @scott_sanders

    ssanders@taximagic.com
  3. Surge Conference 2011 • Ben Fried's (Google CIO) keynote speech

    talks about the importance being a generalist • I think specializing is fine (and normal as your career advances), but it's VITAL to keep a generalist perspective • Disaster porn!
  4. Background • Taxi Magic ◦ Mobile applications to book/track/pay for

    taxis ◦ Web booking integration for taxi fleets ◦ In-car payment hardware (PIM) • What's a PIM? ◦ Passenger Information Monitor ◦ 7" HD touchscreen ◦ Credit card swipe ◦ Wired into cab hardware and dispatch system ◦ Uses cellular communication to talk to TM ◦ Regular GPS events over UDP ◦ Payment transactions over HTTPS
  5. The problems begin... (June 5th) • A handful of cab

    drivers in Los Angeles begin reporting failures when swiping CCs • Embedded hardware team recalls a few cabs and investigates local log files • Reports problems during SSL handshake to RideCharge servers • Tech Ops team remaps httpd to the same libcrypto.so and libssl.so version as the PIM using libmap.conf(5) • Problem vanishes! HOORAY!!! Beer!
  6. Fast forward to June 12th... • SHTF • Widespread reports

    of failing CC swipes across the entire SoCal region • Hardware team pulls more vehicles and notices the same SSL handshake problem • Tech Ops team is unable to correlate this to a drop in traffic • Furthermore, Tech Ops is still seeing regular GPS updates from ALL active cabs!
  7. WTF?

  8. Diving in... • Our cellular ISP insists they aren't having

    any problems • (Sound familiar to anyone?) • I start running the standard toolkit looking for patterns ◦ tcpdump ◦ traceroute ◦ NMAP • NMAP is giving me some inconsistent results
  9. Understanding how TCP/IP works • How do you establish a

    TCP connection? ◦ SYN (Hey, you there?) ◦ SYN/ACK (Yeah, what's up?) ◦ ACK (Cool, lets talk!) • What happens if you connect to a port that doesn't have a service bound to it? ◦ SYN (Hey, you there?) ◦ RST (leave me alone!) • So why am I only getting a RST every now and then? Why do I see timeouts instead? • This is starting to smell like a routing problem
  10. Proving the problem exists • Since I am receiving GPS

    updates over UDP from all the cabs I can use this to identify the IP of a cab and its location at a point in time • We know the expected behavior when attempting a connection to a closed port • Let's run some tests and gather some data
  11. comm_test.sh #!/usr/bin/env bash test_connection () { # fork a subshell

    to handle the tcp connect test ( # the result is either no-response or conn-refused result=$(nmap -P0 -T1 -sT -p22 --reason -q $4 | awk '/^22/{print $4}') echo "$1 $2 $3 $4 $result $8 $9" >> results.txt ) & } # connect to the gps receiver host and monitor real-time UDP gps updates ssh -t gps001.iad1.prod.rws 'tail -F gps_updates.csv' | while read line ; do # line format: Jun 16 15:14:45, 184.251.233.91, 0, 20, 2577, \ # 33.9822566666667, -118.4593 line=$(echo $line | tr -d ',') test_connection $line done
  12. Results % comm_test.sh Jun 16 15:28:00 102.122.93.194 conn-refused 33.8221321105957 -116.548851013184

    Jun 16 15:27:57 176.135.73.0 conn-refused 32.8885866666667 -97.0376933333333 Jun 16 15:27:59 181.251.163.200 conn-refused 33.9004183333333 -118.387591666667 Jun 16 15:27:53 178.156.201.182 conn-refused 44.9484977722168 -93.2568588256836 Jun 16 15:27:28 180.229.138.141 no-response 39.766675 -104.940496666667 Jun 16 15:27:28 187.231.74.250 no-response 33.80945 -118.206921666667 Jun 16 15:28:00 181.255.84.59 conn-refused 34.0593466666667 -118.24536 Jun 16 15:27:55 78.6.67.236 conn-refused 34.0581833333333 -118.415878333333
  13. Awesome way to get non-techie's on your side and impress

    some management :-) Visualize the problem
  14. None
  15. None
  16. Beating up your ISP (figuratively) • After more than a

    dozen calls to the ISP and as many "escalations" we landed on a conference call with some lead networks engineers • After 6 hours on this conference call reiterating the problem and showing the data one engineer asks us to "hold tight" • Things get very quiet... • Like magic all of my tests start succeeding!
  17. WTF!?!

  18. The backstory • On June 5th, the ISP migrated the

    SoCal region to a new data center in Anaheim. This was an epic failure and they rolled back • On June 12th, the ISP migrated again to Anaheim "successfully" • Cell traffic is pooled by connection, and one of the pools was routing asymmetrically • Asymmetric routing + stateful firewalls = BAD • Updating the routing tables fixed everything
  19. Being a generalist • A DevOps culture requires generalists •

    Understanding the full stack means being able to troubleshoot problems at all layers • Fluid communication between sysadmins, developers, hardware engineers, and network engineers requires generalists • Fewer people in the war room results in faster problem solving • This saves time and money and makes your team more valuable to the business
  20. We're hiring! https://taximagic.com/en_US/about/careers @scott_sanders ssanders@taximagic.com Thank you!