Contention-Aware Performance Prediction For Virtualized Network Functions

Communications and Networking Lab, NTHU Contention-Aware Performance Prediction For Virtualized
Network Functions 1 Antonis Manousis, Rahul Anand Sharma, Vyas Sekar,   and Justine Sherry   2020 SIGCOMM Speaker: Chun-Fu Kuo Date: 2020.12.03

Communications and Networking Lab, NTHU ▪ Introduction ▪ Problem Formulation
▪ Environment ▪ System Model ▪ Proposed Method ▪ Evaluation ▪ Use case ▪ Conclusion ▪ Pros and Cons   2 Outline

Communications and Networking Lab, NTHU 3 Introduction Non-uniform memory access
(NUMA) Symmetric multiprocessing (SMP)

Communications and Networking Lab, NTHU ▪ Write-back: ▪ CPU only
writes data to cache and labels it ▪ While cache evicts, data writes back to memory ▪ Write-through: ▪ CPU writes data to both cache and memory   ▪ Post-write: ▪ CPU writes data to both cache and buffer, and writes back to memory on proper time   4 Introduction Cache Writing Policy

Communications and Networking Lab, NTHU ▪ A CPU, memory, PCIe
resource monitor ▪ CPU frequency ▪ Instructions per cycle ▪ Cache status ▪ Memory bandwidth ▪ PCIe bandwidth ▪ Power consumption   ▪ Launched by Intel, now is renamed to Processor Counter Monitor and maintained by community   5 Introduction Intel Performance Counter Monitor (PCM)

Communications and Networking Lab, NTHU ▪ A kind of performance
isolation technique for LLC ▪ Why ▪ If there is a noisy neighbor process (occupy lots of cache),   co-located processes could suffer from cache miss ▪ Which is important for cloud computing tenant 6 Introduction Intel Cache Allocation Technology (CAT)

Communications and Networking Lab, NTHU ▪ Allow NIC to pass
through cache (LLC, L2) ▪ Bypass the slow memory 7 Introduction Intel Data Direct I/O Technology

Communications and Networking Lab, NTHU ▪ Co-resident NFs can interfere
with each other ▪ Since they share hardware resource ▪ Primarily in the memory subsystem 8 Problem Formulation Problem Contention-induced throughput drop

Communications and Networking Lab, NTHU ▪ Input NFs: ▪ Target
NF: ▪ A NF whose performance drop we would like to estimate ▪ Competing workload: ▪ The set of NFs the target may be co-located with ▪ Hardware configuration: ▪ NFs exhibit extreme interactions with cache hierarchy ▪ High data structures reuse (e.g., rules, routing tables) ▪ Low packet data reuse S = {NFi } NFtarget ∈ S Compj = {S\NFtarget } Archk 9 Problem Formulation Problem - Prediction

Communications and Networking Lab, NTHU ▪ Dobrescu’s method ▪ Memory
contention is the key source for throughput drop ▪ Model memory as a monolithic source ▪ Single metric -- cache access rate (CAR)   ▪ BubbleUp ▪ Model memory as a monolithic source ▪ Single metric -- working set size of competing workloads   (cache occupancy) 10 Problem Formulation Existing Approaches - Performance Prediction

Communications and Networking Lab, NTHU ▪ ResQ ▪ Argues in
favor of isolating shared resources ▪ To prevent contention-specific performance degradation ▪ Leverage CAT technique, DPDK packet buffer sizing ▪ Provide the dedicated and non-overlapping LLC partition 11 Problem Formulation Existing Approaches - Performance Isolation

Communications and Networking Lab, NTHU ▪ Incomplete solution for contention-induced
slowdown ▪ Since partitioning tools fail to isolation all source of contention ▪ Also, isolation lead to inefficient resource utilization   12 Problem Formulation Performance Isolation Problem

Communications and Networking Lab, NTHU ▪ Although previous work indicated
the source of contention is memory subsystem ▪ The author found that contention is multifacted 13 Problem Formulation Resource of Contention

Communications and Networking Lab, NTHU ▪ Contention in LLC ▪
Contention that compromises fast access to auxiliary data structures containing necessary data for packet processing ▪ Contention for DDIO ▪ Contention that slows down packets on the direct path between the NICs and the LLC ▪ Contention for main memory bandwidth ▪ Contention that increases the latency to   service a LLC miss from main memory 14 Problem Formulation Resource of Contention

Communications and Networking Lab, NTHU ▪ Intel Xeon E5-2620 v4
(Broadwell) ▪ Intel XL710-40Gbps * 2 ▪ Intel Xeon Silver 4110 (Skylake) ▪ Mellanox MT27700-100Gbps * 2 ▪ Use SR-IOV to share resource with NFs 15 Environment

Communications and Networking Lab, NTHU 16 System Model

Communications and Networking Lab, NTHU ▪ No DDIO in this
test, competing NFs use separate memory channels ▪ Red line marks exhaustion of the available LLC space ▪ Before red line: occupancy is the best predictor of performance ▪ After red line: cache access rate (CAR) is the best predictor of performance 17 Problem Formulation LLC Contention Depends on   1. Cache Occupancy 2. Cache Access Rate

Communications and Networking Lab, NTHU ▪ DDIO partitions the LLC
into a primary cache and I/O cache ▪ Contention can occur when the total # of packets exceed the amount of space in the I/O cache (even though the LLC remains underloaded) 18 Problem Formulation DDIO Contention Depends on   1. Competitors’ Space Utilization 2. Access Rate

Communications and Networking Lab, NTHU ▪ Use CAT technique to
isolate LLC ▪ Cache miss rate is stable, but throughput goes down 19 Problem Formulation Main Memory Latency depends on   Total Memory Bandwidth Consumption

Communications and Networking Lab, NTHU ▪ 2 logical parts: ▪
Offline: characterize the contentiousness & model sensitivity of NF instance ▪ Online: make performance prediction of NF instance & mix of real competitors 20 Proposed Method SLOMO

Communications and Networking Lab, NTHU ▪ ，set of NFs ▪
，server architecture ▪ ，contentiousness tuple ▪ ，synthetic contentiousness vector ▪ ，performance of in response to synthetic contentiousness vector ▪ ，sensitivity model which is trained by ▪ Operator runs each with multiple configuration (tunable synthetic workload) ▪ To profile for sensitivity, the author measure on each architecture ▪ To profile for contentiousness, the operator collects a set of vectors S = {NFi . . . } Archk (NFi , Archk ) Vx Px i NFi Mi : V → P {(Vx , Px i ), . . . } NFi Px i {Vx i } 21 Proposed Method - SLOMO Introduction Offline Profiling

Communications and Networking Lab, NTHU ▪ These profiling datasets are
specific to ▪ Particular NF type ▪ Configuration ▪ Traffic workload ▪ Server architecture ▪ In practice, a cluster may use only 1 or a small # of server architecture   (not change frequently) ▪ But it’s possible that after deployment, an NF’s ruleset or its traffic workload might change 22 Proposed Method - SLOMO Introduction Offline Profiling

Communications and Networking Lab, NTHU ▪ At runtime, the operator
uses the pre-computed ’s and ’s for prediction ▪ 2 NFs ( ) ▪ To predict ’s throughput: ▪ Put contentiousness vector into sensitivity model to produce ▪ 3 NFs ( ) ▪ Use composition function to compute offline ▪ Then apply the aforementioned "2 NFs" Vi Mi NFA , NFB NFA VB MA PB A NFA , NFB , NFC CF : VB , VC → VB,C VB,C 23 Proposed Method - SLOMO Introduction Online Predictions

Communications and Networking Lab, NTHU ▪ Having 3 key components:
contentiousness characterization, sensitivity modeling, contentiousness composition ▪ Take data-driven approach to design these components ▪ Modeling sensitivity is a model fitting process ▪ Choosing contentiousness metrics is a feature selection process ▪ Composition is a simple regression modeling problem 24 Proposed Method SLOMO in Depth

Communications and Networking Lab, NTHU ▪ Choose candidate contentiousness metrics
from Intel PCM framework ▪ Therefore, a natural limitation of SLOMO is: ▪ It’s limited by the pool of metrics exposed by PCM… ▪ PCM doesn’t provide visibility into the internals of a NIC ▪ Any congestion of NIC (e.g., queue occupancy) will not be taken into consideration 25 Proposed Method - SLOMO in Depth Candidate Contentiousness metrics

Communications and Networking Lab, NTHU ▪ Exercise the effects of
contention on each NF with a synthetic workload of tunable intensity ▪ Sample the space of possible contentiousness values NF could generate ▪ Finally, we have contentiousness vector dataset ▪ Experiment: ▪ Click-based NF with incremental pressure to 1. I/O datapath: through the # of allocated packet buffers 2. Packet-processing datapath: by performing configurable # of memory operation 26 Proposed Method - SLOMO in Depth Synthetic Competition QPN[WUGFFWTKPIQHHNKPGRTQHKNKPI

Communications and Networking Lab, NTHU ▪ Exercise these configurations for
various traffic patterns ▪ Rate, packet sizes, and flow counts ▪ Exercise these configurations for various # of co-running instance ▪ SLOMO profiles each NF with more than 1000 different configurations to get: ▪ PCM values when the synthetic workload & NF under test run solo ▪ PCM values when both the synthetic workload co-runs with the NF under test ▪ Performance of target NF when running with the synthetic competitor 27 Synthetic Competition (Cont.) QPN[WUGFFWTKPIQHHNKPGRTQHKNKPI Proposed Method - SLOMO in Depth

Communications and Networking Lab, NTHU ▪ To prevent unrelated PCM
metrics hurts the model accuracy ▪ Use Pearson correlation coefficient to analyze the statical dependency ▪ PCM metrics vs observed performances of each NF ▪ Use model-free (reinforcement learning) technique to train sensitivity model 28 Contentiousness Metrics Selection r = ∑n i=1 (xi − x)(yi − y) ∑n i=1 (xi − x)2(yi − y)2 Proposed Method - SLOMO in Depth

Communications and Networking Lab, NTHU ▪ PCM metrics at CPU-socket-
and System-level granularities adequately capture aggregate contentiousness ▪ But instead of core-level (since each NF is isolated in dedicated core) 29 Contentiousness Metrics Selection (Cont.) Proposed Method - SLOMO in Depth

Communications and Networking Lab, NTHU ▪ Different sources of contention
are best captured by different metrics ▪ As NFs can depend on multiple contention sources ▪ DDIO contention: best captured through memory bandwidth utilization metrics ▪ LLC contention: best captured through LLC-related metrics ▪ Memory Bandwidth: memory bandwidth utilization metrics 30 Contentiousness Metrics Selection (Cont.) Since packets buffer evictions are not captured by LLC metrics in DMA engine About 15 important metrics can be used Proposed Method - SLOMO in Depth

Communications and Networking Lab, NTHU ▪ Can be viewed as
a regression problem because: ▪ Its input (contentiousness of the competition) ▪ Output (target NF performance) are both continuous variable ▪ We need to model each NF since: ▪ Different NFs response differently to the various source of contention ▪ Training: use synthetic, NF-specific contentiousness aforementioned ▪ Run time: replace the synthetic inputs with the aggregate contentiousness of the real competitors ▪ Testing: generate for each NF and architecture a dataset of real experiments where each target NF is co-run with various combinations of NF 31 Modeling Sensitivity Proposed Method - SLOMO in Depth

Communications and Networking Lab, NTHU ▪ Sensitivity is a non-linear
and non-continuous function (multivariate input) ▪ Cannot be accurately modeled with: ▪ Regression (linear, polynomial) ▪ Decision trees ▪ Simple neural network ▪ Nonetheless, common pattern we detect across sensitivity functions are phase transitions 32 Sensitivity Can Be A Complex Function   Cannot Be Captured By Simple Regression Models Proposed Method - SLOMO in Depth

Communications and Networking Lab, NTHU ▪ Phase transitions ▪ When
LLC occupancy > LLC size: cache miss rate drop sharply 33 (Cont.) Sensitivity Can Be A Complex Function   Cannot Be Captured By Simple Regression Models Proposed Method - SLOMO in Depth

Communications and Networking Lab, NTHU ▪ Model the different sub-spaces
of sensitivity separately ▪ Then combine the resulting models into a larger, comprehensive one ▪ Which is a technique in machine learning called ensemble modeling ▪ This paper use Gradient Boosting Regression 34 Sensitivity Can Be Modeled As   A Piecewise Function of Its Input Proposed Method - SLOMO in Depth

Communications and Networking Lab, NTHU ▪ Measuring ’s contentiousness: ▪
Measure the PCM for while it runs alone on the server ▪ Measure contentiousness for while it’s running against the various synthetic competitors ▪ Each time is subjected to a unique , so we can get a set of ▪ Group the based on how many co-runner (utilized cores) ▪ Then take the average of each group, which is in 3 NFs condition NFi NFi NFi NFx NFi Vx Vi Vx i VC B 35 Measuring Contentiousness Inaccurate! Proposed Method - SLOMO in Depth

Communications and Networking Lab, NTHU ▪ Composition: ▪ Since the
aggregate contentiousness metrics we wish to estimate are by definition the sum or average of the constituent per-core metrics ▪ E.g., ▪ The CAR of a CPU-socket is the sum of each core’s CAR in the CPU socket 36 Measuring Contentiousness (Cont.) Proposed Method - SLOMO in Depth

Communications and Networking Lab, NTHU ▪ might experience modifications during
its lifecycle   (e.g., migration across server, changes in configuration/traffic) ▪ This will make that NF change to ▪ SLOMO can extrapolate a quick-yet-accurate performance prediction for ▪ Without triggering a slow offline profiling operation ▪ By leveraging existing profiles of NFi NF′ i NF′ i NFi 37 Proposed Method - Extrapolating Sensitivity Measuring Contentiousness

Communications and Networking Lab, NTHU ▪ Example: NF is heavily
sensitive to unique traffic flow it receives   (memory contention) ▪ When flows reduce, sensitivity also reduces 38 Proposed Method - Extrapolating Sensitivity Change in NF’s Traffic Configuration Change the Reliance On Shared Memory (Sensitivity)

Communications and Networking Lab, NTHU ▪ Extrapolation heuristics is based
on the assumption that the change of is small ▪ Thus, there is overlap between the sensitivity profiles of & ▪ If configuration or traffic profiles differ significantly   (e.g., firewall with 1 vs 10k rules) ▪ There is little to no overlap between the respective sensitivity profiles NFi NFi NF′ i 39 Proposed Method - Extrapolating Sensitivity Scope of Extrapolation

Communications and Networking Lab, NTHU ▪ SLOMO is accurate, with
a mean prediction error of 5.4% ▪ Reducing Dobrescu’s 12.72% error by 58% ▪ Reducing BubbleUp’s 15.2% average error by 64% ▪ SLOMO’s predictions are robust across operating conditions ▪ The design decisions behind each of SLOMO’s components contribute to improved accuracy ▪ SLOMO is efficient and enables smart scheduling decisions in an NFV cluster ▪ SLOMO is extensible, allowing the accurate extrapolation of the   sensitivity function of new NF instances to account for changes   in an NF’s traffic profile or configuration   40 Evaluation Comparison

Communications and Networking Lab, NTHU ▪ Average prediction error and
error variance are the best ▪ Some cases are very close (VPN, stateless firewall, Maglev) ▪ Because they are not I/O bound 41 Evaluation Accuracy

Communications and Networking Lab, NTHU ▪ The absolute error follows
an increasing trend as a function of the # of competing NFs ▪ Since the additive, composition-related error factor 42 Evaluation Robustness

Communications and Networking Lab, NTHU ▪ Prediction error doesn’t change
with packet size 43 Evaluation Robustness

Communications and Networking Lab, NTHU ▪ Top 3 metrics for
a collection of NF instances ▪ 2 architectures: Broadwell, Skylake ▪ Skylake has more "MEM WRITE" ▪ Smaller LLC (20 MB vs 11 MB) ▪ No write-back cache policy 44 Evaluation Factor Analysis

Communications and Networking Lab, NTHU ▪ Compare with other technique
45 Evaluation Factor Analysis

Communications and Networking Lab, NTHU ▪ The operator’s goal is
to maximum resource utilization while maintaining SLA ▪ If there is no feasible schedule ▪ The operator provisions an additional server ▪ The author exhaustively run all possible combinations ▪ Resource overhead: how many additional machine with respect to the optimal 46 Use case Scheduling for Cluster

Communications and Networking Lab, NTHU ▪ Goal ▪ Predict the
contention between co-located NFs ▪ Help the provisioning and placement decision in NFV orchestration framework ▪ Method ▪ Data-driven to design the SLOMO ▪ Take multi-variable into consideration ▪ Use machine learning to build model ▪ Result ▪ Prediction error rate is much better than previous works 47 Conclusion

Communications and Networking Lab, NTHU ▪ Pros ▪ Comprehensive analysis
with contention in memory subsystem ▪ Partial implementation code is open source ▪ Cons ▪ No details in its training, prediction process ▪ No details in some statement ▪ No consideration for NIC 48 Pros & Cons

Contention-Aware Performance Prediction For Vir...

Contention-Aware Performance Prediction For Virtualized Network Functions

More Decks by JackKuo

Featured

Transcript