Microscope: Queue-based Performance Diagnosis for Network Functions

Communications and Networking Lab, NTHU Microscope: Queue-based Performance Diagnosis for
Network Functions 1 Junzhi Gong, Yuliang Li, Bilal Anwer, Aman Shaikh,   and Minlan Yu   2020 SIGCOMM Speaker: Chun-Fu Kuo Date: 2021.04.22

Communications and Networking Lab, NTHU ▪ Introduction ▪ Survey on
Performance Diagnosis ▪ Problem Formulation ▪ System Model ▪ Proposed Method ▪ Implementation ▪ Evaluation ▪ Conclusion ▪ Pros and Cons   2 Outline

Communications and Networking Lab, NTHU 3 Introduction ▪ The flows
that occupy a large amount of traffic ▪ A heavy hitter could correspond to an individual flow or connection ▪ It could also be an aggregation of multiple flows/connections that share some common property, but which themselves may not be heavy hitters ▪ It spends lots of time & lots of memory to analyze HHs Heavy Hitters (HHs)

Communications and Networking Lab, NTHU 4 Introduction ▪ Hierarchically aggregate
some properties of HH ▪ E.g., IP prefix ▪ Aggregations can be defined on one or more dimensions ▪ E.g., src/dest IP address prefix, src/dst port, protocol Hierarchical Heavy Hitters (HHHs)

Communications and Networking Lab, NTHU 5 Introduction ▪ CAIDA is
the acronym of Center for Applied Internet Data Analysis ▪ Hosted in UC San Diego ▪ CAIDA Traffic is a data set of 10G traces collected from high-speed monitors on a commercial backbone links ▪ From 2008 to 2019, provided in PCAP format ▪ Often used in academic for fair evaluation ▪ Anyone can apply for it, but NDA required CAIDA Traffic

Communications and Networking Lab, NTHU 6 Survey on Performance Diagnosis
▪ Conducted in 2020.01, which includes ▪ 4 small small networks (< 1K hosts) ▪ 6 medium networks (1 ~ 10 K hosts) ▪ 4 large networks (10 ~ 100K hosts) ▪ 5 extra-large networks (> 100K hosts) ▪ Common problems: ▪ Low throughput ▪ Intermittent events ▪ Single user only problem Survey with ISPs, DCs, Enterprises

Communications and Networking Lab, NTHU 7 Survey on Performance Diagnosis
▪ Root causes of the problems: ▪ NF bugs (15 operators) ▪ Traffic bursts (12 operators) ▪ Resource contention (7 operators) ▪ Interrupt (5 operators) ▪ Requirements for diagnosis tools: ▪ Ranked list of root causes (12 operators) ▪ Low overhead (12 operators) ▪ High accuracy (9 operators) ▪ Aggregated flows of each cause (7 operators) Survey with ISPs, DCs, Enterprises (Cont.) ▪ Multiple NFs affect mutually ▪ Upstream NFs’ traffic (6 operators) ▪ Misconfiguration (8 operators)

Communications and Networking Lab, NTHU 8 Problem Formulation Blame Game

Communications and Networking Lab, NTHU 9 Problem Formulation ▪ CAIDA
traffic to a Firewall ▪ Inject a burst flow at 570 µs, lasting for 340 µs ▪ Result: experiences 3 ms long latency Lasting impacts of microsecond-level behaviors

Communications and Networking Lab, NTHU 10 Problem Formulation ▪ VPN
receives 2 flows, from Flow A & NAT ▪ NAT’s interrupt incurs burst, causing the queue build-up in VPN Lasting impact propagates across NFs

Communications and Networking Lab, NTHU 11 Problem Formulation Lasting impact
propagates across NFs (Cont.) ▪ Queuing packets make the throughput drop in Flow A ▪ Although the packets arrive after 1.5 ms have no overlap with the interrupt, they are affected as well

Communications and Networking Lab, NTHU 12 Problem Formulation Different impacts
from similar behaviors

Communications and Networking Lab, NTHU 13 Problem Formulation ▪ When
NAT & Monitor occur interrupt at the same time, all of the flows experience different levels of packets loss ▪ Who is the main culprit? NAT or Monitor? ▪ It’s hard to identify unless we refer to the input rate of VPN ▪ The authors want to quantify the impact of these behaviors Different impacts from similar behaviors (Cont.)

Communications and Networking Lab, NTHU 14 Problem Formulation ▪ Challenge
1: impact propagation over time ▪ Local diagnosis based on queuing period ▪ Challenge 2: impact propagation across NFs ▪ Propagation diagnosis based on timespan analysis ▪ Challenge 3: too many root causes for too many victim packets ▪ Pattern aggregation: use AutoFocus to aggregate diagnosis results Roadmap

Communications and Networking Lab, NTHU 15 System Model

Communications and Networking Lab, NTHU 16 System Model

Communications and Networking Lab, NTHU 17 System Model ▪ ：current
NF ▪ ：length of the queuing period of NF ▪ ：number of packets arriving during time ▪ ：number of packets processing during time ▪ ：peak processing rate of an NF (with the same hardware/software settings) ▪ ：# of extra input pkts, compared to the # of pkts can be process at peak rate during   ▪ ：# of fewer pkts being processed, compared to the # of expected during (processing score)   f T f ni (T) T np (T) T ri Sf i T Sf p T Symbols counting process 😆

Communications and Networking Lab, NTHU 18 System Model ▪ ：the
time period from the time when a queue starts   building (from 0 packets) to the current time ▪ Abnormality: if the NF’s performance is beyond 1 standard deviation   computed over recent history ▪ ：when packet arrives at NF , the set of packets that have   arrived during the queuing period ▪ of ：   the time between the first & last packets leaves the NF in ▪ ：timespan of in source ▪ ：timespan of in NF A Queuing period PreSet(p) p f T Timespan PreSet(p) PreSet(p) Tsource PreSet(p) TA PreSet(p) Definitions

Communications and Networking Lab, NTHU 19 Proposed Method ▪ Microscope
doesn’t access NFs’ internal code ▪ Only access the queue of each NF ▪ The information it collects as follows: Collections

Communications and Networking Lab, NTHU 20 Proposed Method ▪ ▪
Queue length: Sf i + Sf p = ni − np Local Diagnosis

Communications and Networking Lab, NTHU 21 Proposed Method ▪ Focus
on the whole queuing period   → detect the cause even if it does not exist at the current time Local Diagnosis

Communications and Networking Lab, NTHU 22 Proposed Method ▪ means
input rate is higher than peak processing rate   → queue must build up ▪ Reasons: ▪ Upstream NFs ▪ Input source ▪ We’re going to discuss the causal relations among NFs Sf i > 0 Propagation Diagnosis

Communications and Networking Lab, NTHU 23 Proposed Method ▪ Identify
which upstream NF is the culprit ▪ Which NF makes the traffic bursty ▪ Timespan becomes shorter after NF B ▪ NF B is the culprit ▪ Based on how shorter the timespan is,   we can quantify the impact of NF B Propagation Diagnosis - Traverse a Chain of NFs

Communications and Networking Lab, NTHU 24 Proposed Method ▪ What
if the timespan becomes larger   after an NF? Propagation Diagnosis - Traverse a Chain of NFs B makes the timespan larger • B is not a culprit • B mitigates impacts from A

Communications and Networking Lab, NTHU 25 Proposed Method Propagation Diagnosis
- Traverse a Chain of NFs increase reduce reduce

- Traverse a Chain of NFs ▪ The expected timespan of is ▪ For C ▪ ▪ For source ▪ ▪ For A ▪ f Texp = ni (T)/r f i Sf←C = TB − TC Texp − TC ⋅ Sf i Sf←source = Texp − Tsource Texp − TC ⋅ Sf i Sf←A = Tsource − TB Texp − TC ⋅ Sf i reference value split score proportionally based on their relative timespan reduction from previous hops Sf i

- Traverse a Chain of NFs ▪ NF C reduces the timespan because of the queue built up by other packets, the reason could be: ▪ Local processing problem ▪ Input traffic ▪ To address this problem,   we need recursive diagnosis split score proportionally based on their relative timespan reduction from previous hops Sf i

- Traverse a DAG of NFs decomposition superposition

- Traverse a DAG of NFs ▪ When goes through a DAG, the set of paths is called ▪ Packet on each path ≤ ▪ How to define the expected timespan of each path? ▪ If packets fully interleave ▪ Timespan of = timespan of B & C ▪ If packets don’t fully interleave ▪ Timespan of ≥ timespan of B & C ▪ Proportionally scale down all the scores   to match PreSet(p) PreSetPath(p) PreSet(p) f f Sf i

Communications and Networking Lab, NTHU 30 Proposed Method Recursive Diagnosis
of PreSet Packets ▪ Stop conditions: ▪ Reach source ▪ No NF with positive remains Si

Communications and Networking Lab, NTHU 31 Proposed Method Recursive Diagnosis
of PreSet Packets

Communications and Networking Lab, NTHU 32 Proposed Method Pattern Aggregation
▪ Given many packet-level causal relations, the next step is to aggregate them into causal relation patterns

Communications and Networking Lab, NTHU 33 Proposed Method Pattern Aggregation
▪ Given many packet-level causal relations, the next step is to aggregate them into causal relation patterns <culprit packets, culprit NF> → <victim packet, victim NF> : score <culprit flow aggregates, culprit NF set> → <victim flow aggregates, victim NF set> : score <culprit packets, culprit NF> → <victim packet, victim NF> : score <culprit packets, culprit NF> → <victim packet, victim NF> : score AutoFocus • fl ow aggregate s ◦ source IP pre fi x ◦ source port rang e ◦ destination IP pre fi x ◦ destination port rang e ◦ protocol set

Communications and Networking Lab, NTHU ▪ Microscope consists of ▪
Data collector (runtime) ▪ Instrument the DPDK lib’s I/O functions to collect required information ▪ About 200 LOC ▪ Diagnosis module (offline) ▪ Finding the causal relations of victim packets ▪ About 6000 LOC 34 Implementation

Communications and Networking Lab, NTHU ▪ NF ▪ Use Click-DPDK
▪ Each instance (VM) run in single CPU core ▪ Use SR-IOV to share NIC resource ▪ Use MooGen (traffic generator) to send CAIDA 16 packets ▪ Use 64 bytes packets ▪ Since the performance of NF   is related to   the amount of pkts Hardware Dell R730（MooGen） 10 cores, 32 GB RAM 2-port 40Gbps Mellanox ConnectX-3 Pro Dell T640（16 NFs） 2 * 10 cores, 128 GB RAM 2-port 40Gbps Mellanox ConnectX-3 Pro 35 Evaluation Environment

Communications and Networking Lab, NTHU ▪ Topology ▪ 4 NF
types, total 16 instance ▪ Load balance via hash the packets ▪ If a flow matches a rule in Firewall, it is forwarded to the Monitor 36 Evaluation Environment

Communications and Networking Lab, NTHU 37 Evaluation Effect of Different
Time Window Size ▪ The authors take 10 ms for NetMedic to compare with Microscope 2009 SIGCOMM by Microsoft

Communications and Networking Lab, NTHU 38 Evaluation Overall Accuracy Rank
is the culprit’s order in the ranked list

Communications and Networking Lab, NTHU 39 Evaluation Accuracy of Each
Problem

Communications and Networking Lab, NTHU ▪ Goal ▪ Capture the
root cause of performance problem among NFs ▪ Method ▪ Take surveys on many companies ▪ Propose Microscope tool to analyze queue, without any access to NF’s code ▪ Result ▪ Diagnose the problems much more accurately than NetMedic 40 Conclusion

Communications and Networking Lab, NTHU ▪ Pros ▪ Novel idea
of leveraging queue to diagnose performance problems ▪ Take surveys on many companies to acquire the needs ▪ Cons ▪ Lack of descriptions & illustrations about "traverse a DAG of NFs" ▪ Didn’t indicate which part of the CAIDA traffic they use (5 second) 41 Pros & Cons

Microscope: Queue-based Performance Diagnosis f...

Microscope: Queue-based Performance Diagnosis for Network Functions

More Decks by JackKuo

Other Decks in Education

Featured

Transcript