Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Microscope: Queue-based Performance Diagnosis for Network Functions

Microscope: Queue-based Performance Diagnosis for Network Functions

Group meeting presentation of CANLAB in NTHU

9fa56d41ed10a6ad67ff80c9e7626eb3?s=128

JackKuo

April 22, 2021
Tweet

Transcript

  1. Communications and Networking Lab, NTHU Microscope: Queue-based Performance Diagnosis for

    Network Functions 1 Junzhi Gong, Yuliang Li, Bilal Anwer, Aman Shaikh, 
 and Minlan Yu 
 2020 SIGCOMM Speaker: Chun-Fu Kuo Date: 2021.04.22
  2. Communications and Networking Lab, NTHU ▪ Introduction ▪ Survey on

    Performance Diagnosis ▪ Problem Formulation ▪ System Model ▪ Proposed Method ▪ Implementation ▪ Evaluation ▪ Conclusion ▪ Pros and Cons 
 2 Outline
  3. Communications and Networking Lab, NTHU 3 Introduction ▪ The flows

    that occupy a large amount of traffic ▪ A heavy hitter could correspond to an individual flow or connection ▪ It could also be an aggregation of multiple flows/connections that share some common property, but which themselves may not be heavy hitters ▪ It spends lots of time & lots of memory to analyze HHs Heavy Hitters (HHs)
  4. Communications and Networking Lab, NTHU 4 Introduction ▪ Hierarchically aggregate

    some properties of HH ▪ E.g., IP prefix ▪ Aggregations can be defined on one or more dimensions ▪ E.g., src/dest IP address prefix, src/dst port, protocol Hierarchical Heavy Hitters (HHHs)
  5. Communications and Networking Lab, NTHU 5 Introduction ▪ CAIDA is

    the acronym of Center for Applied Internet Data Analysis ▪ Hosted in UC San Diego ▪ CAIDA Traffic is a data set of 10G traces collected from high-speed monitors on a commercial backbone links ▪ From 2008 to 2019, provided in PCAP format ▪ Often used in academic for fair evaluation ▪ Anyone can apply for it, but NDA required CAIDA Traffic
  6. Communications and Networking Lab, NTHU 6 Survey on Performance Diagnosis

    ▪ Conducted in 2020.01, which includes ▪ 4 small small networks (< 1K hosts) ▪ 6 medium networks (1 ~ 10 K hosts) ▪ 4 large networks (10 ~ 100K hosts) ▪ 5 extra-large networks (> 100K hosts) ▪ Common problems: ▪ Low throughput ▪ Intermittent events ▪ Single user only problem Survey with ISPs, DCs, Enterprises
  7. Communications and Networking Lab, NTHU 7 Survey on Performance Diagnosis

    ▪ Root causes of the problems: ▪ NF bugs (15 operators) ▪ Traffic bursts (12 operators) ▪ Resource contention (7 operators) ▪ Interrupt (5 operators) ▪ Requirements for diagnosis tools: ▪ Ranked list of root causes (12 operators) ▪ Low overhead (12 operators) ▪ High accuracy (9 operators) ▪ Aggregated flows of each cause (7 operators) Survey with ISPs, DCs, Enterprises (Cont.) ▪ Multiple NFs affect mutually ▪ Upstream NFs’ traffic (6 operators) ▪ Misconfiguration (8 operators)
  8. Communications and Networking Lab, NTHU 8 Problem Formulation Blame Game

  9. Communications and Networking Lab, NTHU 9 Problem Formulation ▪ CAIDA

    traffic to a Firewall ▪ Inject a burst flow at 570 µs, lasting for 340 µs ▪ Result: experiences 3 ms long latency Lasting impacts of microsecond-level behaviors
  10. Communications and Networking Lab, NTHU 10 Problem Formulation ▪ VPN

    receives 2 flows, from Flow A & NAT ▪ NAT’s interrupt incurs burst, causing the queue build-up in VPN Lasting impact propagates across NFs
  11. Communications and Networking Lab, NTHU 11 Problem Formulation Lasting impact

    propagates across NFs (Cont.) ▪ Queuing packets make the throughput drop in Flow A ▪ Although the packets arrive after 1.5 ms have no overlap with the interrupt, they are affected as well
  12. Communications and Networking Lab, NTHU 12 Problem Formulation Different impacts

    from similar behaviors
  13. Communications and Networking Lab, NTHU 13 Problem Formulation ▪ When

    NAT & Monitor occur interrupt at the same time, all of the flows experience different levels of packets loss ▪ Who is the main culprit? NAT or Monitor? ▪ It’s hard to identify unless we refer to the input rate of VPN ▪ The authors want to quantify the impact of these behaviors Different impacts from similar behaviors (Cont.)
  14. Communications and Networking Lab, NTHU 14 Problem Formulation ▪ Challenge

    1: impact propagation over time ▪ Local diagnosis based on queuing period ▪ Challenge 2: impact propagation across NFs ▪ Propagation diagnosis based on timespan analysis ▪ Challenge 3: too many root causes for too many victim packets ▪ Pattern aggregation: use AutoFocus to aggregate diagnosis results Roadmap
  15. Communications and Networking Lab, NTHU 15 System Model

  16. Communications and Networking Lab, NTHU 16 System Model

  17. Communications and Networking Lab, NTHU 17 System Model ▪ :current

    NF ▪ :length of the queuing period of NF ▪ :number of packets arriving during time ▪ :number of packets processing during time ▪ :peak processing rate of an NF (with the same hardware/software settings) ▪ :# of extra input pkts, compared to the # of pkts can be process at peak rate during 
 ▪ :# of fewer pkts being processed, compared to the # of expected during (processing score) 
 f T f ni (T) T np (T) T ri Sf i T Sf p T Symbols counting process 😆
  18. Communications and Networking Lab, NTHU 18 System Model ▪ :the

    time period from the time when a queue starts 
 building (from 0 packets) to the current time ▪ Abnormality: if the NF’s performance is beyond 1 standard deviation 
 computed over recent history ▪ :when packet arrives at NF , the set of packets that have 
 arrived during the queuing period ▪ of : 
 the time between the first & last packets leaves the NF in ▪ :timespan of in source ▪ :timespan of in NF A Queuing period PreSet(p) p f T Timespan PreSet(p) PreSet(p) Tsource PreSet(p) TA PreSet(p) Definitions
  19. Communications and Networking Lab, NTHU 19 Proposed Method ▪ Microscope

    doesn’t access NFs’ internal code ▪ Only access the queue of each NF ▪ The information it collects as follows: Collections
  20. Communications and Networking Lab, NTHU 20 Proposed Method ▪ ▪

    Queue length: Sf i + Sf p = ni − np Local Diagnosis
  21. Communications and Networking Lab, NTHU 21 Proposed Method ▪ Focus

    on the whole queuing period 
 → detect the cause even if it does not exist at the current time Local Diagnosis
  22. Communications and Networking Lab, NTHU 22 Proposed Method ▪ means

    input rate is higher than peak processing rate 
 → queue must build up ▪ Reasons: ▪ Upstream NFs ▪ Input source ▪ We’re going to discuss the causal relations among NFs Sf i > 0 Propagation Diagnosis
  23. Communications and Networking Lab, NTHU 23 Proposed Method ▪ Identify

    which upstream NF is the culprit ▪ Which NF makes the traffic bursty ▪ Timespan becomes shorter after NF B ▪ NF B is the culprit ▪ Based on how shorter the timespan is, 
 we can quantify the impact of NF B Propagation Diagnosis - Traverse a Chain of NFs
  24. Communications and Networking Lab, NTHU 24 Proposed Method ▪ What

    if the timespan becomes larger 
 after an NF? Propagation Diagnosis - Traverse a Chain of NFs B makes the timespan larger • B is not a culprit • B mitigates impacts from A
  25. Communications and Networking Lab, NTHU 25 Proposed Method Propagation Diagnosis

    - Traverse a Chain of NFs increase reduce reduce
  26. Communications and Networking Lab, NTHU 26 Proposed Method Propagation Diagnosis

    - Traverse a Chain of NFs ▪ The expected timespan of is ▪ For C ▪ ▪ For source ▪ ▪ For A ▪ f Texp = ni (T)/r f i Sf←C = TB − TC Texp − TC ⋅ Sf i Sf←source = Texp − Tsource Texp − TC ⋅ Sf i Sf←A = Tsource − TB Texp − TC ⋅ Sf i reference value split score proportionally based on their relative timespan reduction from previous hops Sf i
  27. Communications and Networking Lab, NTHU 27 Proposed Method Propagation Diagnosis

    - Traverse a Chain of NFs ▪ NF C reduces the timespan because of the queue built up by other packets, the reason could be: ▪ Local processing problem ▪ Input traffic ▪ To address this problem, 
 we need recursive diagnosis split score proportionally based on their relative timespan reduction from previous hops Sf i
  28. Communications and Networking Lab, NTHU 28 Proposed Method Propagation Diagnosis

    - Traverse a DAG of NFs decomposition superposition
  29. Communications and Networking Lab, NTHU 29 Proposed Method Propagation Diagnosis

    - Traverse a DAG of NFs ▪ When goes through a DAG, the set of paths is called ▪ Packet on each path ≤ ▪ How to define the expected timespan of each path? ▪ If packets fully interleave ▪ Timespan of = timespan of B & C ▪ If packets don’t fully interleave ▪ Timespan of ≥ timespan of B & C ▪ Proportionally scale down all the scores 
 to match PreSet(p) PreSetPath(p) PreSet(p) f f Sf i
  30. Communications and Networking Lab, NTHU 30 Proposed Method Recursive Diagnosis

    of PreSet Packets ▪ Stop conditions: ▪ Reach source ▪ No NF with positive remains Si
  31. Communications and Networking Lab, NTHU 31 Proposed Method Recursive Diagnosis

    of PreSet Packets
  32. Communications and Networking Lab, NTHU 32 Proposed Method Pattern Aggregation

    ▪ Given many packet-level causal relations, the next step is to aggregate them into causal relation patterns
  33. Communications and Networking Lab, NTHU 33 Proposed Method Pattern Aggregation

    ▪ Given many packet-level causal relations, the next step is to aggregate them into causal relation patterns <culprit packets, culprit NF> → <victim packet, victim NF> : score <culprit flow aggregates, culprit NF set> → <victim flow aggregates, victim NF set> : score <culprit packets, culprit NF> → <victim packet, victim NF> : score <culprit packets, culprit NF> → <victim packet, victim NF> : score AutoFocus • fl ow aggregate s ◦ source IP pre fi x ◦ source port rang e ◦ destination IP pre fi x ◦ destination port rang e ◦ protocol set
  34. Communications and Networking Lab, NTHU ▪ Microscope consists of ▪

    Data collector (runtime) ▪ Instrument the DPDK lib’s I/O functions to collect required information ▪ About 200 LOC ▪ Diagnosis module (offline) ▪ Finding the causal relations of victim packets ▪ About 6000 LOC 34 Implementation
  35. Communications and Networking Lab, NTHU ▪ NF ▪ Use Click-DPDK

    ▪ Each instance (VM) run in single CPU core ▪ Use SR-IOV to share NIC resource ▪ Use MooGen (traffic generator) to send CAIDA 16 packets ▪ Use 64 bytes packets ▪ Since the performance of NF 
 is related to 
 the amount of pkts Hardware Dell R730(MooGen) 10 cores, 32 GB RAM 2-port 40Gbps Mellanox ConnectX-3 Pro Dell T640(16 NFs) 2 * 10 cores, 128 GB RAM 2-port 40Gbps Mellanox ConnectX-3 Pro 35 Evaluation Environment
  36. Communications and Networking Lab, NTHU ▪ Topology ▪ 4 NF

    types, total 16 instance ▪ Load balance via hash the packets ▪ If a flow matches a rule in Firewall, it is forwarded to the Monitor 36 Evaluation Environment
  37. Communications and Networking Lab, NTHU 37 Evaluation Effect of Different

    Time Window Size ▪ The authors take 10 ms for NetMedic to compare with Microscope 2009 SIGCOMM by Microsoft
  38. Communications and Networking Lab, NTHU 38 Evaluation Overall Accuracy Rank

    is the culprit’s order in the ranked list
  39. Communications and Networking Lab, NTHU 39 Evaluation Accuracy of Each

    Problem
  40. Communications and Networking Lab, NTHU ▪ Goal ▪ Capture the

    root cause of performance problem among NFs ▪ Method ▪ Take surveys on many companies ▪ Propose Microscope tool to analyze queue, without any access to NF’s code ▪ Result ▪ Diagnose the problems much more accurately than NetMedic 40 Conclusion
  41. Communications and Networking Lab, NTHU ▪ Pros ▪ Novel idea

    of leveraging queue to diagnose performance problems ▪ Take surveys on many companies to acquire the needs ▪ Cons ▪ Lack of descriptions & illustrations about "traverse a DAG of NFs" ▪ Didn’t indicate which part of the CAIDA traffic they use (5 second) 41 Pros & Cons