Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops

September 21, 2018

Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops

Network failures continue to plague datacenter
operators as their symptoms may not have
direct correlation with where or why they occur. We
introduce 007, a lightweight, always-on diagnosis application
that can find problematic links and also
pinpoint problems for each TCP connection. 007 is
completely contained within the end host. During
its two month deployment in a tier-1 datacenter, it
detected every problem found by previously deployed
monitoring tools while also finding the sources of
other problems previously undetected.


September 21, 2018

More Decks by Michael

Other Decks in Technology


  1. Papers We Love Sept. 2018 007: Democratically Finding The Cause

    of Packet Drops Michael Kehoe Staff Site Reliability Engineer NDSI - https://www.usenix.org/conference/nsdi18/presentation/arzani
  2. 007: Democratically Finding The Cause of Packet Drops Behnaz Arzani

    Selim Ciraci Luiz Chamon Yibo Zhu Hongqiang Liu Jitu Padhye Boon Thau Loo Geoff Outhred
  3. Today’s agenda 1 Introduction & Motivation 2 TCP Monitoring Agent

    3 Path Discovery Agent 4 Analysis Agent 5 Evaluations: Simulations 6 Evaluations: Production 7 Discussion
  4. Introduction & Motivation “Even a small network outage or a

    few lossy links can cause the VM to “panic” and reboot. In fact, 17% of our VM reboots are due to network issues and in over 70% of these none of our monitoring tools were able to find the links that caused the problem.”
  5. Introduction & Motivation • Roy et al [2] • Requires

    modifications to routers • Requires additional features on switches
  6. “In a network of ≥ 106 links it’s a reasonable

    assumption that there is a non-zero chance that a number (> 10) of these links are bad (due to device, port, or cable, etc.)…However, currently we do not have a direct way to correlate customer impact with bad links". Introduction & Motivation
  7. “007 records the path of TCP connections (flows) suffering from

    one or more retransmissions and assigns proportional “blame” to each link on the path. It then provides a ranking of links that represents their relative drop rates.” Introduction & Motivation
  8. Introduction & Motivation 1. Does not require any changes to

    network infrastructure 2. Does not require any changes to client software 3. Detects in-band failures 4. Resilient to noise 5. Negligible overhead
  9. Assumptions DISCUSSION 1. L2 networks are not viable unless; 1.

    Support path discovery methods 2. Supports EverFlow 2. No use of Source NATs (SNATs) 3. Assumes ECMP (L3) Clos network design 4. Don’t try to reverse-engineer ECMP
  10. Design Overview • TCP monitoring agent: detects retransmissions at each

    end-host. • Path discovery agent: which identifies the flow’s path to the Destination IP (DIP) • At the end-hosts, a voting scheme is used based on the paths of flows that had retransmissions. At regular intervals of 30s the votes are tallied by a centralized analysis agent to find the top-voted links.
  11. Design Overview • 6000 lines of C++ code • 600KB

    memory usage • 1-3% CPU Usage • 200 KBs bandwidth utilization
  12. TCP Monitoring Agent • TCP Monitoring agent notifies Path Discovery

    Agent immediately after any retransmit • Use of ‘Event Tracing for Windows’ (ETW) • Could use BPF in Linux
  13. Path Discovery Agent “The path discovery agent uses traceroute packets

    to find the path of flows that suffer retransmissions. These packets are used solely to identify the path of a flow. They do not need to be dropped for 007 to operate”
  14. Path Discovery Agent “Once the TCP monitoring agent notifies the

    path discovery agent that a flow has suffered a retransmission, the path discovery agent checks its cache of discovered path for that epoch…It then sends 15 appropriately crafted TCP packets with TTL values ranging from 1–15.”
  15. Path Discovery Agent ENGINEERING CHALLENGES – ECMP • ECMP algorithms

    are unknown • All packets of a given flow, defined by the five-tuple, follow the same path

    • Traceroute itself may fail • A lossy link may cause one or more BGP sessions to fail, triggering rerouting

    a pre-mapped topology of: • Switch/Router names • Router/ Interface IP addresses
  18. Analysis Agent VOTING BASED SCHEME • Good votes are 0

    • Bad votes are ! " where h is the number of hops on the path • Each link on the path is given a vote
  19. Analysis Agent VOTING BASED SCHEME • Congestion & single drops

    are akin to noise • Single flow is unlikely to go through more than one failed link • Probability of errors in results diminishes exponentially with the number of flows
  20. Simulations PERFORMANCE • Accuracy: Proportion if correctly identified drop causes

    • Recall: How many of the failures are detected (false negatives) • Precision: How trusted are the results (false positives)
  21. Evaluation: Simulations PERFORMANCE: OPTIMAL CASE • 0.05 -1% drop rate

    • Accuracy is > 96% • Recall/ Precision is almost always 100% https://github.com/behnazak/Vigil-007SourceCode
  22. Evaluation: Simulations PERFORMANCE: VARYING DROP RATES • Maintains accuracy for

    both single and multiple failures https://github.com/behnazak/Vigil-007SourceCode
  23. Evaluation: Simulations PERFORMANCE: IMPACT OF NOISE • Almost no impact

  24. Evaluation: Simulations PERFORMANCE: TRAFFIC SKEWS • Can tolerate 50% skew

    • When TOR traffic >50% & >10 failures, accuracy suffers https://github.com/behnazak/Vigil-007SourceCode
  25. Evaluation: Simulations PERFORMANCE: BAD LINKS • 007 can detect up

    to 7 failures with accuracy > 90% https://github.com/behnazak/Vigil-007SourceCode
  26. Evaluation: Simulations PERFORMANCE: NETWORK SIZE • Single failure: • Accuracy

    >98% for up to 6 pods • Multiple failures: • Accuracy >98.01% for 30 failed links https://github.com/behnazak/Vigil-007SourceCode
  27. Evaluation: Production • 007 located bad link correctly in 281

    cases of VM reboot in Microsoft DCN • Identifies average 0.45 ± 0.12 as bad per epoch • Of links dropping packets: • 48%: Server to TOR • 24%: T1 – TOR • 6%: T2 – T1
  28. Discussion • Congestion detection • Ranking with bias • Finding

    the cause of other problems • 007 can also be used for: • Detection of switch failures