Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops

Michael
September 21, 2018

Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops

Network failures continue to plague datacenter
operators as their symptoms may not have
direct correlation with where or why they occur. We
introduce 007, a lightweight, always-on diagnosis application
that can find problematic links and also
pinpoint problems for each TCP connection. 007 is
completely contained within the end host. During
its two month deployment in a tier-1 datacenter, it
detected every problem found by previously deployed
monitoring tools while also finding the sources of
other problems previously undetected.

Michael

September 21, 2018
Tweet

More Decks by Michael

Other Decks in Technology

Transcript

  1. Papers We Love Sept. 2018
    007: Democratically Finding The Cause
    of Packet Drops
    Michael Kehoe
    Staff Site Reliability Engineer
    NDSI - https://www.usenix.org/conference/nsdi18/presentation/arzani

    View Slide

  2. 007: Democratically Finding The Cause of Packet Drops
    Behnaz Arzani Selim Ciraci Luiz Chamon Yibo Zhu
    Hongqiang Liu Jitu Padhye Boon Thau Loo Geoff Outhred

    View Slide

  3. Today’s
    agenda
    1 Introduction & Motivation
    2 TCP Monitoring Agent
    3 Path Discovery Agent
    4 Analysis Agent
    5 Evaluations: Simulations
    6 Evaluations: Production
    7 Discussion

    View Slide

  4. Introduction & Motivation

    View Slide

  5. Introduction & Motivation
    “Even a small network outage or
    a few lossy links can cause the
    VM to “panic” and reboot. In fact,
    17% of our VM reboots are due
    to network issues and in over
    70% of these none of our
    monitoring tools were able to
    find the links that caused the
    problem.”

    View Slide

  6. Introduction & Motivation
    • Pingmesh [1]
    • Leaves gaps
    • Overhead
    • Out-of-band

    View Slide

  7. Introduction & Motivation
    • Roy et al [2]
    • Requires modifications to
    routers
    • Requires additional
    features on switches

    View Slide

  8. Introduction & Motivation
    • Everflow [3]
    • Requires all traffic to be
    captured

    View Slide

  9. “In a network of ≥ 106 links it’s a reasonable
    assumption that there is a non-zero chance that a
    number (> 10) of these links are bad (due to device,
    port, or cable, etc.)…However, currently we do not
    have a direct way to correlate customer impact with
    bad links".
    Introduction & Motivation

    View Slide

  10. “007 records the path of TCP connections (flows)
    suffering from one or more retransmissions and
    assigns proportional “blame” to each link on the
    path. It then provides a ranking of links that
    represents their relative drop rates.”
    Introduction & Motivation

    View Slide

  11. Introduction & Motivation
    1. Does not require any changes to network
    infrastructure
    2. Does not require any changes to client
    software
    3. Detects in-band failures
    4. Resilient to noise
    5. Negligible overhead

    View Slide

  12. Assumptions
    DISCUSSION
    1. L2 networks are not viable unless;
    1. Support path discovery methods
    2. Supports EverFlow
    2. No use of Source NATs (SNATs)
    3. Assumes ECMP (L3) Clos network design
    4. Don’t try to reverse-engineer ECMP

    View Slide

  13. Assumptions
    DISCUSSION

    View Slide

  14. Design Overview

    View Slide

  15. Design Overview
    • TCP monitoring agent: detects
    retransmissions at each end-host.
    • Path discovery agent: which
    identifies the flow’s path to the
    Destination IP (DIP)
    • At the end-hosts, a voting scheme is
    used based on the paths of flows that
    had retransmissions. At regular
    intervals of 30s the votes are tallied
    by a centralized analysis agent to find
    the top-voted links.

    View Slide

  16. Design Overview
    • 6000 lines of C++ code
    • 600KB memory usage
    • 1-3% CPU Usage
    • 200 KBs bandwidth utilization

    View Slide

  17. TCP Monitoring Agent

    View Slide

  18. TCP Monitoring Agent
    • TCP Monitoring agent notifies
    Path Discovery Agent
    immediately after any
    retransmit
    • Use of ‘Event Tracing for
    Windows’ (ETW)
    • Could use BPF in Linux

    View Slide

  19. Path Discovery Agent

    View Slide

  20. Path Discovery Agent
    “The path discovery agent uses
    traceroute packets to find the
    path of flows that suffer
    retransmissions. These packets
    are used solely to identify the
    path of a flow. They do not need
    to be dropped for 007 to
    operate”

    View Slide

  21. Path Discovery Agent
    “Once the TCP monitoring agent
    notifies the path discovery agent
    that a flow has suffered a
    retransmission, the path
    discovery agent checks its cache
    of discovered path for that
    epoch…It then sends 15
    appropriately crafted TCP
    packets with TTL values ranging
    from 1–15.”

    View Slide

  22. Path Discovery Agent
    ENGINEERING CHALLENGES – ECMP
    • ECMP algorithms are
    unknown
    • All packets of a given flow,
    defined by the five-tuple,
    follow the same path

    View Slide

  23. Path Discovery Agent
    ENGINEERING CHALLENGES – RE-ROUTING & PACKET DROPS
    • Traceroute itself may fail
    • A lossy link may cause one or
    more BGP sessions to fail,
    triggering rerouting

    View Slide

  24. Path Discovery Agent
    ENGINEERING CHALLENGES – ROUTER ALIASING
    • Have a pre-mapped topology
    of:
    • Switch/Router names
    • Router/ Interface IP
    addresses

    View Slide

  25. Analysis Agent

    View Slide

  26. Analysis Agent
    VOTING BASED SCHEME
    • Good votes are 0
    • Bad votes are
    !
    "
    where h is the
    number of hops on the path
    • Each link on the path is given a
    vote

    View Slide

  27. Analysis Agent
    4
    2
    1 3
    0 0
    + 1/2
    1/2
    + 1/2
    + 1/2

    View Slide

  28. Analysis Agent
    VOTING BASED SCHEME
    • Congestion & single drops are
    akin to noise
    • Single flow is unlikely to go
    through more than one failed
    link
    • Probability of errors in results
    diminishes exponentially with
    the number of flows

    View Slide

  29. Simulations

    View Slide

  30. Simulations
    PERFORMANCE
    • Accuracy: Proportion if
    correctly identified drop
    causes
    • Recall: How many of the
    failures are detected (false
    negatives)
    • Precision: How trusted are the
    results (false positives)

    View Slide

  31. Evaluation: Simulations
    PERFORMANCE: OPTIMAL CASE
    • 0.05 -1% drop rate
    • Accuracy is > 96%
    • Recall/ Precision is almost
    always 100%
    https://github.com/behnazak/Vigil-007SourceCode

    View Slide

  32. Evaluation: Simulations
    PERFORMANCE: VARYING DROP RATES
    • Maintains accuracy for
    both single and multiple
    failures
    https://github.com/behnazak/Vigil-007SourceCode

    View Slide

  33. Evaluation: Simulations
    PERFORMANCE: IMPACT OF NOISE
    • Almost no impact
    https://github.com/behnazak/Vigil-007SourceCode

    View Slide

  34. Evaluation: Simulations
    PERFORMANCE: NUMBER OF CONNECTIONS
    • Almost no impact
    https://github.com/behnazak/Vigil-007SourceCode

    View Slide

  35. Evaluation: Simulations
    PERFORMANCE: TRAFFIC SKEWS
    • Can tolerate 50% skew
    • When TOR traffic >50% &
    >10 failures, accuracy
    suffers
    https://github.com/behnazak/Vigil-007SourceCode

    View Slide

  36. Evaluation: Simulations
    PERFORMANCE: BAD LINKS
    • 007 can detect up to 7
    failures with accuracy >
    90%
    https://github.com/behnazak/Vigil-007SourceCode

    View Slide

  37. Evaluation: Simulations
    PERFORMANCE: NETWORK SIZE
    • Single failure:
    • Accuracy >98% for up to 6
    pods
    • Multiple failures:
    • Accuracy >98.01% for 30
    failed links
    https://github.com/behnazak/Vigil-007SourceCode

    View Slide

  38. Evaluations: Production

    View Slide

  39. Evaluation: Production
    • 007 located bad link
    correctly in 281 cases of VM
    reboot in Microsoft DCN
    • Identifies average 0.45 ±
    0.12 as bad per epoch
    • Of links dropping packets:
    • 48%: Server to TOR
    • 24%: T1
    – TOR
    • 6%: T2
    – T1

    View Slide

  40. Discussion

    View Slide

  41. Discussion
    • Congestion detection
    • Ranking with bias
    • Finding the cause of other
    problems
    • 007 can also be used for:
    • Detection of switch failures

    View Slide

  42. Questions?

    View Slide

  43. View Slide