Slide 1

Slide 1 text

Communications and Networking Lab, NTHU Microscope: Queue-based Performance Diagnosis for Network Functions 1 Junzhi Gong, Yuliang Li, Bilal Anwer, Aman Shaikh, 
 and Minlan Yu 
 2020 SIGCOMM Speaker: Chun-Fu Kuo Date: 2021.04.22

Slide 2

Slide 2 text

Communications and Networking Lab, NTHU ■ Introduction ■ Survey on Performance Diagnosis ■ Problem Formulation ■ System Model ■ Proposed Method ■ Implementation ■ Evaluation ■ Conclusion ■ Pros and Cons 
 2 Outline

Slide 3

Slide 3 text

Communications and Networking Lab, NTHU 3 Introduction ■ The flows that occupy a large amount of traffic ■ A heavy hitter could correspond to an individual flow or connection ■ It could also be an aggregation of multiple flows/connections that share some common property, but which themselves may not be heavy hitters ■ It spends lots of time & lots of memory to analyze HHs Heavy Hitters (HHs)

Slide 4

Slide 4 text

Communications and Networking Lab, NTHU 4 Introduction ■ Hierarchically aggregate some properties of HH ■ E.g., IP prefix ■ Aggregations can be defined on one or more dimensions ■ E.g., src/dest IP address prefix, src/dst port, protocol Hierarchical Heavy Hitters (HHHs)

Slide 5

Slide 5 text

Communications and Networking Lab, NTHU 5 Introduction ■ CAIDA is the acronym of Center for Applied Internet Data Analysis ■ Hosted in UC San Diego ■ CAIDA Traffic is a data set of 10G traces collected from high-speed monitors on a commercial backbone links ■ From 2008 to 2019, provided in PCAP format ■ Often used in academic for fair evaluation ■ Anyone can apply for it, but NDA required CAIDA Traffic

Slide 6

Slide 6 text

Communications and Networking Lab, NTHU 6 Survey on Performance Diagnosis ■ Conducted in 2020.01, which includes ■ 4 small small networks (< 1K hosts) ■ 6 medium networks (1 ~ 10 K hosts) ■ 4 large networks (10 ~ 100K hosts) ■ 5 extra-large networks (> 100K hosts) ■ Common problems: ■ Low throughput ■ Intermittent events ■ Single user only problem Survey with ISPs, DCs, Enterprises

Slide 7

Slide 7 text

Communications and Networking Lab, NTHU 7 Survey on Performance Diagnosis ■ Root causes of the problems: ■ NF bugs (15 operators) ■ Traffic bursts (12 operators) ■ Resource contention (7 operators) ■ Interrupt (5 operators) ■ Requirements for diagnosis tools: ■ Ranked list of root causes (12 operators) ■ Low overhead (12 operators) ■ High accuracy (9 operators) ■ Aggregated flows of each cause (7 operators) Survey with ISPs, DCs, Enterprises (Cont.) ■ Multiple NFs affect mutually ■ Upstream NFs’ traffic (6 operators) ■ Misconfiguration (8 operators)

Slide 8

Slide 8 text

Communications and Networking Lab, NTHU 8 Problem Formulation Blame Game

Slide 9

Slide 9 text

Communications and Networking Lab, NTHU 9 Problem Formulation ■ CAIDA traffic to a Firewall ■ Inject a burst flow at 570 µs, lasting for 340 µs ■ Result: experiences 3 ms long latency Lasting impacts of microsecond-level behaviors

Slide 10

Slide 10 text

Communications and Networking Lab, NTHU 10 Problem Formulation ■ VPN receives 2 flows, from Flow A & NAT ■ NAT’s interrupt incurs burst, causing the queue build-up in VPN Lasting impact propagates across NFs

Slide 11

Slide 11 text

Communications and Networking Lab, NTHU 11 Problem Formulation Lasting impact propagates across NFs (Cont.) ■ Queuing packets make the throughput drop in Flow A ■ Although the packets arrive after 1.5 ms have no overlap with the interrupt, they are affected as well

Slide 12

Slide 12 text

Communications and Networking Lab, NTHU 12 Problem Formulation Different impacts from similar behaviors

Slide 13

Slide 13 text

Communications and Networking Lab, NTHU 13 Problem Formulation ■ When NAT & Monitor occur interrupt at the same time, all of the flows experience different levels of packets loss ■ Who is the main culprit? NAT or Monitor? ■ It’s hard to identify unless we refer to the input rate of VPN ■ The authors want to quantify the impact of these behaviors Different impacts from similar behaviors (Cont.)

Slide 14

Slide 14 text

Communications and Networking Lab, NTHU 14 Problem Formulation ■ Challenge 1: impact propagation over time ■ Local diagnosis based on queuing period ■ Challenge 2: impact propagation across NFs ■ Propagation diagnosis based on timespan analysis ■ Challenge 3: too many root causes for too many victim packets ■ Pattern aggregation: use AutoFocus to aggregate diagnosis results Roadmap

Slide 15

Slide 15 text

Communications and Networking Lab, NTHU 15 System Model

Slide 16

Slide 16 text

Communications and Networking Lab, NTHU 16 System Model

Slide 17

Slide 17 text

Communications and Networking Lab, NTHU 17 System Model ■ :current NF ■ :length of the queuing period of NF ■ :number of packets arriving during time ■ :number of packets processing during time ■ :peak processing rate of an NF (with the same hardware/software settings) ■ :# of extra input pkts, compared to the # of pkts can be process at peak rate during 
 ■ :# of fewer pkts being processed, compared to the # of expected during (processing score) 
 f T f ni (T) T np (T) T ri Sf i T Sf p T Symbols counting process 😆

Slide 18

Slide 18 text

Communications and Networking Lab, NTHU 18 System Model ■ :the time period from the time when a queue starts 
 building (from 0 packets) to the current time ■ Abnormality: if the NF’s performance is beyond 1 standard deviation 
 computed over recent history ■ :when packet arrives at NF , the set of packets that have 
 arrived during the queuing period ■ of : 
 the time between the first & last packets leaves the NF in ■ :timespan of in source ■ :timespan of in NF A Queuing period PreSet(p) p f T Timespan PreSet(p) PreSet(p) Tsource PreSet(p) TA PreSet(p) Definitions

Slide 19

Slide 19 text

Communications and Networking Lab, NTHU 19 Proposed Method ■ Microscope doesn’t access NFs’ internal code ■ Only access the queue of each NF ■ The information it collects as follows: Collections

Slide 20

Slide 20 text

Communications and Networking Lab, NTHU 20 Proposed Method ■ ■ Queue length: Sf i + Sf p = ni − np Local Diagnosis

Slide 21

Slide 21 text

Communications and Networking Lab, NTHU 21 Proposed Method ■ Focus on the whole queuing period 
 → detect the cause even if it does not exist at the current time Local Diagnosis

Slide 22

Slide 22 text

Communications and Networking Lab, NTHU 22 Proposed Method ■ means input rate is higher than peak processing rate 
 → queue must build up ■ Reasons: ■ Upstream NFs ■ Input source ■ We’re going to discuss the causal relations among NFs Sf i > 0 Propagation Diagnosis

Slide 23

Slide 23 text

Communications and Networking Lab, NTHU 23 Proposed Method ■ Identify which upstream NF is the culprit ■ Which NF makes the traffic bursty ■ Timespan becomes shorter after NF B ■ NF B is the culprit ■ Based on how shorter the timespan is, 
 we can quantify the impact of NF B Propagation Diagnosis - Traverse a Chain of NFs

Slide 24

Slide 24 text

Communications and Networking Lab, NTHU 24 Proposed Method ■ What if the timespan becomes larger 
 after an NF? Propagation Diagnosis - Traverse a Chain of NFs B makes the timespan larger • B is not a culprit • B mitigates impacts from A

Slide 25

Slide 25 text

Communications and Networking Lab, NTHU 25 Proposed Method Propagation Diagnosis - Traverse a Chain of NFs increase reduce reduce

Slide 26

Slide 26 text

Communications and Networking Lab, NTHU 26 Proposed Method Propagation Diagnosis - Traverse a Chain of NFs ■ The expected timespan of is ■ For C ■ ■ For source ■ ■ For A ■ f Texp = ni (T)/r f i Sf←C = TB − TC Texp − TC ⋅ Sf i Sf←source = Texp − Tsource Texp − TC ⋅ Sf i Sf←A = Tsource − TB Texp − TC ⋅ Sf i reference value split score proportionally based on their relative timespan reduction from previous hops Sf i

Slide 27

Slide 27 text

Communications and Networking Lab, NTHU 27 Proposed Method Propagation Diagnosis - Traverse a Chain of NFs ■ NF C reduces the timespan because of the queue built up by other packets, the reason could be: ■ Local processing problem ■ Input traffic ■ To address this problem, 
 we need recursive diagnosis split score proportionally based on their relative timespan reduction from previous hops Sf i

Slide 28

Slide 28 text

Communications and Networking Lab, NTHU 28 Proposed Method Propagation Diagnosis - Traverse a DAG of NFs decomposition superposition

Slide 29

Slide 29 text

Communications and Networking Lab, NTHU 29 Proposed Method Propagation Diagnosis - Traverse a DAG of NFs ■ When goes through a DAG, the set of paths is called ■ Packet on each path ≤ ■ How to define the expected timespan of each path? ■ If packets fully interleave ■ Timespan of = timespan of B & C ■ If packets don’t fully interleave ■ Timespan of ≥ timespan of B & C ■ Proportionally scale down all the scores 
 to match PreSet(p) PreSetPath(p) PreSet(p) f f Sf i

Slide 30

Slide 30 text

Communications and Networking Lab, NTHU 30 Proposed Method Recursive Diagnosis of PreSet Packets ■ Stop conditions: ■ Reach source ■ No NF with positive remains Si

Slide 31

Slide 31 text

Communications and Networking Lab, NTHU 31 Proposed Method Recursive Diagnosis of PreSet Packets

Slide 32

Slide 32 text

Communications and Networking Lab, NTHU 32 Proposed Method Pattern Aggregation ■ Given many packet-level causal relations, the next step is to aggregate them into causal relation patterns

Slide 33

Slide 33 text

Communications and Networking Lab, NTHU 33 Proposed Method Pattern Aggregation ■ Given many packet-level causal relations, the next step is to aggregate them into causal relation patterns → : score → : score → : score → : score AutoFocus • fl ow aggregate s ◦ source IP pre fi x ◦ source port rang e ◦ destination IP pre fi x ◦ destination port rang e ◦ protocol set

Slide 34

Slide 34 text

Communications and Networking Lab, NTHU ■ Microscope consists of ■ Data collector (runtime) ■ Instrument the DPDK lib’s I/O functions to collect required information ■ About 200 LOC ■ Diagnosis module (offline) ■ Finding the causal relations of victim packets ■ About 6000 LOC 34 Implementation

Slide 35

Slide 35 text

Communications and Networking Lab, NTHU ■ NF ■ Use Click-DPDK ■ Each instance (VM) run in single CPU core ■ Use SR-IOV to share NIC resource ■ Use MooGen (traffic generator) to send CAIDA 16 packets ■ Use 64 bytes packets ■ Since the performance of NF 
 is related to 
 the amount of pkts Hardware Dell R730(MooGen) 10 cores, 32 GB RAM 2-port 40Gbps Mellanox ConnectX-3 Pro Dell T640(16 NFs) 2 * 10 cores, 128 GB RAM 2-port 40Gbps Mellanox ConnectX-3 Pro 35 Evaluation Environment

Slide 36

Slide 36 text

Communications and Networking Lab, NTHU ■ Topology ■ 4 NF types, total 16 instance ■ Load balance via hash the packets ■ If a flow matches a rule in Firewall, it is forwarded to the Monitor 36 Evaluation Environment

Slide 37

Slide 37 text

Communications and Networking Lab, NTHU 37 Evaluation Effect of Different Time Window Size ■ The authors take 10 ms for NetMedic to compare with Microscope 2009 SIGCOMM by Microsoft

Slide 38

Slide 38 text

Communications and Networking Lab, NTHU 38 Evaluation Overall Accuracy Rank is the culprit’s order in the ranked list

Slide 39

Slide 39 text

Communications and Networking Lab, NTHU 39 Evaluation Accuracy of Each Problem

Slide 40

Slide 40 text

Communications and Networking Lab, NTHU ■ Goal ■ Capture the root cause of performance problem among NFs ■ Method ■ Take surveys on many companies ■ Propose Microscope tool to analyze queue, without any access to NF’s code ■ Result ■ Diagnose the problems much more accurately than NetMedic 40 Conclusion

Slide 41

Slide 41 text

Communications and Networking Lab, NTHU ■ Pros ■ Novel idea of leveraging queue to diagnose performance problems ■ Take surveys on many companies to acquire the needs ■ Cons ■ Lack of descriptions & illustrations about "traverse a DAG of NFs" ■ Didn’t indicate which part of the CAIDA traffic they use (5 second) 41 Pros & Cons