Slide 1

Slide 1 text

Communications and Networking Lab, NTHU Contention-Aware Performance Prediction For Virtualized Network Functions 1 Antonis Manousis, Rahul Anand Sharma, Vyas Sekar, 
 and Justine Sherry 
 2020 SIGCOMM Speaker: Chun-Fu Kuo Date: 2020.12.03

Slide 2

Slide 2 text

Communications and Networking Lab, NTHU ■ Introduction ■ Problem Formulation ■ Environment ■ System Model ■ Proposed Method ■ Evaluation ■ Use case ■ Conclusion ■ Pros and Cons 
 2 Outline

Slide 3

Slide 3 text

Communications and Networking Lab, NTHU 3 Introduction Non-uniform memory access (NUMA) Symmetric multiprocessing (SMP)

Slide 4

Slide 4 text

Communications and Networking Lab, NTHU ■ Write-back: ■ CPU only writes data to cache and labels it ■ While cache evicts, data writes back to memory ■ Write-through: ■ CPU writes data to both cache and memory 
 ■ Post-write: ■ CPU writes data to both cache and buffer, and writes back to memory on proper time 
 4 Introduction Cache Writing Policy

Slide 5

Slide 5 text

Communications and Networking Lab, NTHU ■ A CPU, memory, PCIe resource monitor ■ CPU frequency ■ Instructions per cycle ■ Cache status ■ Memory bandwidth ■ PCIe bandwidth ■ Power consumption 
 ■ Launched by Intel, now is renamed to Processor Counter Monitor and maintained by community 
 5 Introduction Intel Performance Counter Monitor (PCM)

Slide 6

Slide 6 text

Communications and Networking Lab, NTHU ■ A kind of performance isolation technique for LLC ■ Why ■ If there is a noisy neighbor process (occupy lots of cache), 
 co-located processes could suffer from cache miss ■ Which is important for cloud computing tenant 6 Introduction Intel Cache Allocation Technology (CAT)

Slide 7

Slide 7 text

Communications and Networking Lab, NTHU ■ Allow NIC to pass through cache (LLC, L2) ■ Bypass the slow memory 7 Introduction Intel Data Direct I/O Technology

Slide 8

Slide 8 text

Communications and Networking Lab, NTHU ■ Co-resident NFs can interfere with each other ■ Since they share hardware resource ■ Primarily in the memory subsystem 8 Problem Formulation Problem Contention-induced throughput drop

Slide 9

Slide 9 text

Communications and Networking Lab, NTHU ■ Input NFs: ■ Target NF: ■ A NF whose performance drop we would like to estimate ■ Competing workload: ■ The set of NFs the target may be co-located with ■ Hardware configuration: ■ NFs exhibit extreme interactions with cache hierarchy ■ High data structures reuse (e.g., rules, routing tables) ■ Low packet data reuse S = {NFi } NFtarget ∈ S Compj = {S\NFtarget } Archk 9 Problem Formulation Problem - Prediction

Slide 10

Slide 10 text

Communications and Networking Lab, NTHU ■ Dobrescu’s method ■ Memory contention is the key source for throughput drop ■ Model memory as a monolithic source ■ Single metric -- cache access rate (CAR) 
 ■ BubbleUp ■ Model memory as a monolithic source ■ Single metric -- working set size of competing workloads 
 (cache occupancy) 10 Problem Formulation Existing Approaches - Performance Prediction

Slide 11

Slide 11 text

Communications and Networking Lab, NTHU ■ ResQ ■ Argues in favor of isolating shared resources ■ To prevent contention-specific performance degradation ■ Leverage CAT technique, DPDK packet buffer sizing ■ Provide the dedicated and non-overlapping LLC partition 11 Problem Formulation Existing Approaches - Performance Isolation

Slide 12

Slide 12 text

Communications and Networking Lab, NTHU ■ Incomplete solution for contention-induced slowdown ■ Since partitioning tools fail to isolation all source of contention ■ Also, isolation lead to inefficient resource utilization 
 12 Problem Formulation Performance Isolation Problem

Slide 13

Slide 13 text

Communications and Networking Lab, NTHU ■ Although previous work indicated the source of contention is memory subsystem ■ The author found that contention is multifacted 13 Problem Formulation Resource of Contention

Slide 14

Slide 14 text

Communications and Networking Lab, NTHU ■ Contention in LLC ■ Contention that compromises fast access to auxiliary data structures containing necessary data for packet processing ■ Contention for DDIO ■ Contention that slows down packets on the direct path between the NICs and the LLC ■ Contention for main memory bandwidth ■ Contention that increases the latency to 
 service a LLC miss from main memory 14 Problem Formulation Resource of Contention

Slide 15

Slide 15 text

Communications and Networking Lab, NTHU ■ Intel Xeon E5-2620 v4 (Broadwell) ■ Intel XL710-40Gbps * 2 ■ Intel Xeon Silver 4110 (Skylake) ■ Mellanox MT27700-100Gbps * 2 ■ Use SR-IOV to share resource with NFs 15 Environment

Slide 16

Slide 16 text

Communications and Networking Lab, NTHU 16 System Model

Slide 17

Slide 17 text

Communications and Networking Lab, NTHU ■ No DDIO in this test, competing NFs use separate memory channels ■ Red line marks exhaustion of the available LLC space ■ Before red line: occupancy is the best predictor of performance ■ After red line: cache access rate (CAR) is the best predictor of performance 17 Problem Formulation LLC Contention Depends on 
 1. Cache Occupancy 2. Cache Access Rate

Slide 18

Slide 18 text

Communications and Networking Lab, NTHU ■ DDIO partitions the LLC into a primary cache and I/O cache ■ Contention can occur when the total # of packets exceed the amount of space in the I/O cache (even though the LLC remains underloaded) 18 Problem Formulation DDIO Contention Depends on 
 1. Competitors’ Space Utilization 2. Access Rate

Slide 19

Slide 19 text

Communications and Networking Lab, NTHU ■ Use CAT technique to isolate LLC ■ Cache miss rate is stable, but throughput goes down 19 Problem Formulation Main Memory Latency depends on 
 Total Memory Bandwidth Consumption

Slide 20

Slide 20 text

Communications and Networking Lab, NTHU ■ 2 logical parts: ■ Offline: characterize the contentiousness & model sensitivity of NF instance ■ Online: make performance prediction of NF instance & mix of real competitors 20 Proposed Method SLOMO

Slide 21

Slide 21 text

Communications and Networking Lab, NTHU ■ ,set of NFs ■ ,server architecture ■ ,contentiousness tuple ■ ,synthetic contentiousness vector ■ ,performance of in response to synthetic contentiousness vector ■ ,sensitivity model which is trained by ■ Operator runs each with multiple configuration (tunable synthetic workload) ■ To profile for sensitivity, the author measure on each architecture ■ To profile for contentiousness, the operator collects a set of vectors S = {NFi . . . } Archk (NFi , Archk ) Vx Px i NFi Mi : V → P {(Vx , Px i ), . . . } NFi Px i {Vx i } 21 Proposed Method - SLOMO Introduction Offline Profiling

Slide 22

Slide 22 text

Communications and Networking Lab, NTHU ■ These profiling datasets are specific to ■ Particular NF type ■ Configuration ■ Traffic workload ■ Server architecture ■ In practice, a cluster may use only 1 or a small # of server architecture 
 (not change frequently) ■ But it’s possible that after deployment, an NF’s ruleset or its traffic workload might change 22 Proposed Method - SLOMO Introduction Offline Profiling

Slide 23

Slide 23 text

Communications and Networking Lab, NTHU ■ At runtime, the operator uses the pre-computed ’s and ’s for prediction ■ 2 NFs ( ) ■ To predict ’s throughput: ■ Put contentiousness vector into sensitivity model to produce ■ 3 NFs ( ) ■ Use composition function to compute offline ■ Then apply the aforementioned "2 NFs" Vi Mi NFA , NFB NFA VB MA PB A NFA , NFB , NFC CF : VB , VC → VB,C VB,C 23 Proposed Method - SLOMO Introduction Online Predictions

Slide 24

Slide 24 text

Communications and Networking Lab, NTHU ■ Having 3 key components: contentiousness characterization, sensitivity modeling, contentiousness composition ■ Take data-driven approach to design these components ■ Modeling sensitivity is a model fitting process ■ Choosing contentiousness metrics is a feature selection process ■ Composition is a simple regression modeling problem 24 Proposed Method SLOMO in Depth

Slide 25

Slide 25 text

Communications and Networking Lab, NTHU ■ Choose candidate contentiousness metrics from Intel PCM framework ■ Therefore, a natural limitation of SLOMO is: ■ It’s limited by the pool of metrics exposed by PCM… ■ PCM doesn’t provide visibility into the internals of a NIC ■ Any congestion of NIC (e.g., queue occupancy) will not be taken into consideration 25 Proposed Method - SLOMO in Depth Candidate Contentiousness metrics

Slide 26

Slide 26 text

Communications and Networking Lab, NTHU ■ Exercise the effects of contention on each NF with a synthetic workload of tunable intensity ■ Sample the space of possible contentiousness values NF could generate ■ Finally, we have contentiousness vector dataset ■ Experiment: ■ Click-based NF with incremental pressure to 1. I/O datapath: through the # of allocated packet buffers 2. Packet-processing datapath: by performing configurable # of memory operation 26 Proposed Method - SLOMO in Depth Synthetic Competition QPN[WUGFFWTKPIQHHNKPGRTQHKNKPI

Slide 27

Slide 27 text

Communications and Networking Lab, NTHU ■ Exercise these configurations for various traffic patterns ■ Rate, packet sizes, and flow counts ■ Exercise these configurations for various # of co-running instance ■ SLOMO profiles each NF with more than 1000 different configurations to get: ■ PCM values when the synthetic workload & NF under test run solo ■ PCM values when both the synthetic workload co-runs with the NF under test ■ Performance of target NF when running with the synthetic competitor 27 Synthetic Competition (Cont.) QPN[WUGFFWTKPIQHHNKPGRTQHKNKPI Proposed Method - SLOMO in Depth

Slide 28

Slide 28 text

Communications and Networking Lab, NTHU ■ To prevent unrelated PCM metrics hurts the model accuracy ■ Use Pearson correlation coefficient to analyze the statical dependency ■ PCM metrics vs observed performances of each NF ■ Use model-free (reinforcement learning) technique to train sensitivity model 28 Contentiousness Metrics Selection r = ∑n i=1 (xi − x)(yi − y) ∑n i=1 (xi − x)2(yi − y)2 Proposed Method - SLOMO in Depth

Slide 29

Slide 29 text

Communications and Networking Lab, NTHU ■ PCM metrics at CPU-socket- and System-level granularities adequately capture aggregate contentiousness ■ But instead of core-level (since each NF is isolated in dedicated core) 29 Contentiousness Metrics Selection (Cont.) Proposed Method - SLOMO in Depth

Slide 30

Slide 30 text

Communications and Networking Lab, NTHU ■ Different sources of contention are best captured by different metrics ■ As NFs can depend on multiple contention sources ■ DDIO contention: best captured through memory bandwidth utilization metrics ■ LLC contention: best captured through LLC-related metrics ■ Memory Bandwidth: memory bandwidth utilization metrics 30 Contentiousness Metrics Selection (Cont.) Since packets buffer evictions are not captured by LLC metrics in DMA engine About 15 important metrics can be used Proposed Method - SLOMO in Depth

Slide 31

Slide 31 text

Communications and Networking Lab, NTHU ■ Can be viewed as a regression problem because: ■ Its input (contentiousness of the competition) ■ Output (target NF performance) are both continuous variable ■ We need to model each NF since: ■ Different NFs response differently to the various source of contention ■ Training: use synthetic, NF-specific contentiousness aforementioned ■ Run time: replace the synthetic inputs with the aggregate contentiousness of the real competitors ■ Testing: generate for each NF and architecture a dataset of real experiments where each target NF is co-run with various combinations of NF 31 Modeling Sensitivity Proposed Method - SLOMO in Depth

Slide 32

Slide 32 text

Communications and Networking Lab, NTHU ■ Sensitivity is a non-linear and non-continuous function (multivariate input) ■ Cannot be accurately modeled with: ■ Regression (linear, polynomial) ■ Decision trees ■ Simple neural network ■ Nonetheless, common pattern we detect across sensitivity functions are phase transitions 32 Sensitivity Can Be A Complex Function 
 Cannot Be Captured By Simple Regression Models Proposed Method - SLOMO in Depth

Slide 33

Slide 33 text

Communications and Networking Lab, NTHU ■ Phase transitions ■ When LLC occupancy > LLC size: cache miss rate drop sharply 33 (Cont.) Sensitivity Can Be A Complex Function 
 Cannot Be Captured By Simple Regression Models Proposed Method - SLOMO in Depth

Slide 34

Slide 34 text

Communications and Networking Lab, NTHU ■ Model the different sub-spaces of sensitivity separately ■ Then combine the resulting models into a larger, comprehensive one ■ Which is a technique in machine learning called ensemble modeling ■ This paper use Gradient Boosting Regression 34 Sensitivity Can Be Modeled As 
 A Piecewise Function of Its Input Proposed Method - SLOMO in Depth

Slide 35

Slide 35 text

Communications and Networking Lab, NTHU ■ Measuring ’s contentiousness: ■ Measure the PCM for while it runs alone on the server ■ Measure contentiousness for while it’s running against the various synthetic competitors ■ Each time is subjected to a unique , so we can get a set of ■ Group the based on how many co-runner (utilized cores) ■ Then take the average of each group, which is in 3 NFs condition NFi NFi NFi NFx NFi Vx Vi Vx i VC B 35 Measuring Contentiousness Inaccurate! Proposed Method - SLOMO in Depth

Slide 36

Slide 36 text

Communications and Networking Lab, NTHU ■ Composition: ■ Since the aggregate contentiousness metrics we wish to estimate are by definition the sum or average of the constituent per-core metrics ■ E.g., ■ The CAR of a CPU-socket is the sum of each core’s CAR in the CPU socket 36 Measuring Contentiousness (Cont.) Proposed Method - SLOMO in Depth

Slide 37

Slide 37 text

Communications and Networking Lab, NTHU ■ might experience modifications during its lifecycle 
 (e.g., migration across server, changes in configuration/traffic) ■ This will make that NF change to ■ SLOMO can extrapolate a quick-yet-accurate performance prediction for ■ Without triggering a slow offline profiling operation ■ By leveraging existing profiles of NFi NF′  i NF′  i NFi 37 Proposed Method - Extrapolating Sensitivity Measuring Contentiousness

Slide 38

Slide 38 text

Communications and Networking Lab, NTHU ■ Example: NF is heavily sensitive to unique traffic flow it receives 
 (memory contention) ■ When flows reduce, sensitivity also reduces 38 Proposed Method - Extrapolating Sensitivity Change in NF’s Traffic Configuration Change the Reliance On Shared Memory (Sensitivity)

Slide 39

Slide 39 text

Communications and Networking Lab, NTHU ■ Extrapolation heuristics is based on the assumption that the change of is small ■ Thus, there is overlap between the sensitivity profiles of & ■ If configuration or traffic profiles differ significantly 
 (e.g., firewall with 1 vs 10k rules) ■ There is little to no overlap between the respective sensitivity profiles NFi NFi NF′  i 39 Proposed Method - Extrapolating Sensitivity Scope of Extrapolation

Slide 40

Slide 40 text

Communications and Networking Lab, NTHU ■ SLOMO is accurate, with a mean prediction error of 5.4% ■ Reducing Dobrescu’s 12.72% error by 58% ■ Reducing BubbleUp’s 15.2% average error by 64% ■ SLOMO’s predictions are robust across operating conditions ■ The design decisions behind each of SLOMO’s components contribute to improved accuracy ■ SLOMO is efficient and enables smart scheduling decisions in an NFV cluster ■ SLOMO is extensible, allowing the accurate extrapolation of the 
 sensitivity function of new NF instances to account for changes 
 in an NF’s traffic profile or configuration 
 40 Evaluation Comparison

Slide 41

Slide 41 text

Communications and Networking Lab, NTHU ■ Average prediction error and error variance are the best ■ Some cases are very close (VPN, stateless firewall, Maglev) ■ Because they are not I/O bound 41 Evaluation Accuracy

Slide 42

Slide 42 text

Communications and Networking Lab, NTHU ■ The absolute error follows an increasing trend as a function of the # of competing NFs ■ Since the additive, composition-related error factor 42 Evaluation Robustness

Slide 43

Slide 43 text

Communications and Networking Lab, NTHU ■ Prediction error doesn’t change with packet size 43 Evaluation Robustness

Slide 44

Slide 44 text

Communications and Networking Lab, NTHU ■ Top 3 metrics for a collection of NF instances ■ 2 architectures: Broadwell, Skylake ■ Skylake has more "MEM WRITE" ■ Smaller LLC (20 MB vs 11 MB) ■ No write-back cache policy 44 Evaluation Factor Analysis

Slide 45

Slide 45 text

Communications and Networking Lab, NTHU ■ Compare with other technique 45 Evaluation Factor Analysis

Slide 46

Slide 46 text

Communications and Networking Lab, NTHU ■ The operator’s goal is to maximum resource utilization while maintaining SLA ■ If there is no feasible schedule ■ The operator provisions an additional server ■ The author exhaustively run all possible combinations ■ Resource overhead: how many additional machine with respect to the optimal 46 Use case Scheduling for Cluster

Slide 47

Slide 47 text

Communications and Networking Lab, NTHU ■ Goal ■ Predict the contention between co-located NFs ■ Help the provisioning and placement decision in NFV orchestration framework ■ Method ■ Data-driven to design the SLOMO ■ Take multi-variable into consideration ■ Use machine learning to build model ■ Result ■ Prediction error rate is much better than previous works 47 Conclusion

Slide 48

Slide 48 text

Communications and Networking Lab, NTHU ■ Pros ■ Comprehensive analysis with contention in memory subsystem ■ Partial implementation code is open source ■ Cons ■ No details in its training, prediction process ■ No details in some statement ■ No consideration for NIC 48 Pros & Cons