writes data to cache and labels it ▪ While cache evicts, data writes back to memory ▪ Write-through: ▪ CPU writes data to both cache and memory ▪ Post-write: ▪ CPU writes data to both cache and buffer, and writes back to memory on proper time 4 Introduction Cache Writing Policy
resource monitor ▪ CPU frequency ▪ Instructions per cycle ▪ Cache status ▪ Memory bandwidth ▪ PCIe bandwidth ▪ Power consumption ▪ Launched by Intel, now is renamed to Processor Counter Monitor and maintained by community 5 Introduction Intel Performance Counter Monitor (PCM)
isolation technique for LLC ▪ Why ▪ If there is a noisy neighbor process (occupy lots of cache), co-located processes could suffer from cache miss ▪ Which is important for cloud computing tenant 6 Introduction Intel Cache Allocation Technology (CAT)
with each other ▪ Since they share hardware resource ▪ Primarily in the memory subsystem 8 Problem Formulation Problem Contention-induced throughput drop
NF: ▪ A NF whose performance drop we would like to estimate ▪ Competing workload: ▪ The set of NFs the target may be co-located with ▪ Hardware configuration: ▪ NFs exhibit extreme interactions with cache hierarchy ▪ High data structures reuse (e.g., rules, routing tables) ▪ Low packet data reuse S = {NFi } NFtarget ∈ S Compj = {S\NFtarget } Archk 9 Problem Formulation Problem - Prediction
contention is the key source for throughput drop ▪ Model memory as a monolithic source ▪ Single metric -- cache access rate (CAR) ▪ BubbleUp ▪ Model memory as a monolithic source ▪ Single metric -- working set size of competing workloads (cache occupancy) 10 Problem Formulation Existing Approaches - Performance Prediction
slowdown ▪ Since partitioning tools fail to isolation all source of contention ▪ Also, isolation lead to inefficient resource utilization 12 Problem Formulation Performance Isolation Problem
Contention that compromises fast access to auxiliary data structures containing necessary data for packet processing ▪ Contention for DDIO ▪ Contention that slows down packets on the direct path between the NICs and the LLC ▪ Contention for main memory bandwidth ▪ Contention that increases the latency to service a LLC miss from main memory 14 Problem Formulation Resource of Contention
test, competing NFs use separate memory channels ▪ Red line marks exhaustion of the available LLC space ▪ Before red line: occupancy is the best predictor of performance ▪ After red line: cache access rate (CAR) is the best predictor of performance 17 Problem Formulation LLC Contention Depends on 1. Cache Occupancy 2. Cache Access Rate
into a primary cache and I/O cache ▪ Contention can occur when the total # of packets exceed the amount of space in the I/O cache (even though the LLC remains underloaded) 18 Problem Formulation DDIO Contention Depends on 1. Competitors’ Space Utilization 2. Access Rate
isolate LLC ▪ Cache miss rate is stable, but throughput goes down 19 Problem Formulation Main Memory Latency depends on Total Memory Bandwidth Consumption
Offline: characterize the contentiousness & model sensitivity of NF instance ▪ Online: make performance prediction of NF instance & mix of real competitors 20 Proposed Method SLOMO
,server architecture ▪ ,contentiousness tuple ▪ ,synthetic contentiousness vector ▪ ,performance of in response to synthetic contentiousness vector ▪ ,sensitivity model which is trained by ▪ Operator runs each with multiple configuration (tunable synthetic workload) ▪ To profile for sensitivity, the author measure on each architecture ▪ To profile for contentiousness, the operator collects a set of vectors S = {NFi . . . } Archk (NFi , Archk ) Vx Px i NFi Mi : V → P {(Vx , Px i ), . . . } NFi Px i {Vx i } 21 Proposed Method - SLOMO Introduction Offline Profiling
specific to ▪ Particular NF type ▪ Configuration ▪ Traffic workload ▪ Server architecture ▪ In practice, a cluster may use only 1 or a small # of server architecture (not change frequently) ▪ But it’s possible that after deployment, an NF’s ruleset or its traffic workload might change 22 Proposed Method - SLOMO Introduction Offline Profiling
uses the pre-computed ’s and ’s for prediction ▪ 2 NFs ( ) ▪ To predict ’s throughput: ▪ Put contentiousness vector into sensitivity model to produce ▪ 3 NFs ( ) ▪ Use composition function to compute offline ▪ Then apply the aforementioned "2 NFs" Vi Mi NFA , NFB NFA VB MA PB A NFA , NFB , NFC CF : VB , VC → VB,C VB,C 23 Proposed Method - SLOMO Introduction Online Predictions
contentiousness characterization, sensitivity modeling, contentiousness composition ▪ Take data-driven approach to design these components ▪ Modeling sensitivity is a model fitting process ▪ Choosing contentiousness metrics is a feature selection process ▪ Composition is a simple regression modeling problem 24 Proposed Method SLOMO in Depth
from Intel PCM framework ▪ Therefore, a natural limitation of SLOMO is: ▪ It’s limited by the pool of metrics exposed by PCM… ▪ PCM doesn’t provide visibility into the internals of a NIC ▪ Any congestion of NIC (e.g., queue occupancy) will not be taken into consideration 25 Proposed Method - SLOMO in Depth Candidate Contentiousness metrics
contention on each NF with a synthetic workload of tunable intensity ▪ Sample the space of possible contentiousness values NF could generate ▪ Finally, we have contentiousness vector dataset ▪ Experiment: ▪ Click-based NF with incremental pressure to 1. I/O datapath: through the # of allocated packet buffers 2. Packet-processing datapath: by performing configurable # of memory operation 26 Proposed Method - SLOMO in Depth Synthetic Competition QPN[WUGFFWTKPIQHHNKPGRTQHKNKPI
various traffic patterns ▪ Rate, packet sizes, and flow counts ▪ Exercise these configurations for various # of co-running instance ▪ SLOMO profiles each NF with more than 1000 different configurations to get: ▪ PCM values when the synthetic workload & NF under test run solo ▪ PCM values when both the synthetic workload co-runs with the NF under test ▪ Performance of target NF when running with the synthetic competitor 27 Synthetic Competition (Cont.) QPN[WUGFFWTKPIQHHNKPGRTQHKNKPI Proposed Method - SLOMO in Depth
metrics hurts the model accuracy ▪ Use Pearson correlation coefficient to analyze the statical dependency ▪ PCM metrics vs observed performances of each NF ▪ Use model-free (reinforcement learning) technique to train sensitivity model 28 Contentiousness Metrics Selection r = ∑n i=1 (xi − x)(yi − y) ∑n i=1 (xi − x)2(yi − y)2 Proposed Method - SLOMO in Depth
and System-level granularities adequately capture aggregate contentiousness ▪ But instead of core-level (since each NF is isolated in dedicated core) 29 Contentiousness Metrics Selection (Cont.) Proposed Method - SLOMO in Depth
are best captured by different metrics ▪ As NFs can depend on multiple contention sources ▪ DDIO contention: best captured through memory bandwidth utilization metrics ▪ LLC contention: best captured through LLC-related metrics ▪ Memory Bandwidth: memory bandwidth utilization metrics 30 Contentiousness Metrics Selection (Cont.) Since packets buffer evictions are not captured by LLC metrics in DMA engine About 15 important metrics can be used Proposed Method - SLOMO in Depth
a regression problem because: ▪ Its input (contentiousness of the competition) ▪ Output (target NF performance) are both continuous variable ▪ We need to model each NF since: ▪ Different NFs response differently to the various source of contention ▪ Training: use synthetic, NF-specific contentiousness aforementioned ▪ Run time: replace the synthetic inputs with the aggregate contentiousness of the real competitors ▪ Testing: generate for each NF and architecture a dataset of real experiments where each target NF is co-run with various combinations of NF 31 Modeling Sensitivity Proposed Method - SLOMO in Depth
and non-continuous function (multivariate input) ▪ Cannot be accurately modeled with: ▪ Regression (linear, polynomial) ▪ Decision trees ▪ Simple neural network ▪ Nonetheless, common pattern we detect across sensitivity functions are phase transitions 32 Sensitivity Can Be A Complex Function Cannot Be Captured By Simple Regression Models Proposed Method - SLOMO in Depth
LLC occupancy > LLC size: cache miss rate drop sharply 33 (Cont.) Sensitivity Can Be A Complex Function Cannot Be Captured By Simple Regression Models Proposed Method - SLOMO in Depth
of sensitivity separately ▪ Then combine the resulting models into a larger, comprehensive one ▪ Which is a technique in machine learning called ensemble modeling ▪ This paper use Gradient Boosting Regression 34 Sensitivity Can Be Modeled As A Piecewise Function of Its Input Proposed Method - SLOMO in Depth
Measure the PCM for while it runs alone on the server ▪ Measure contentiousness for while it’s running against the various synthetic competitors ▪ Each time is subjected to a unique , so we can get a set of ▪ Group the based on how many co-runner (utilized cores) ▪ Then take the average of each group, which is in 3 NFs condition NFi NFi NFi NFx NFi Vx Vi Vx i VC B 35 Measuring Contentiousness Inaccurate! Proposed Method - SLOMO in Depth
aggregate contentiousness metrics we wish to estimate are by definition the sum or average of the constituent per-core metrics ▪ E.g., ▪ The CAR of a CPU-socket is the sum of each core’s CAR in the CPU socket 36 Measuring Contentiousness (Cont.) Proposed Method - SLOMO in Depth
its lifecycle (e.g., migration across server, changes in configuration/traffic) ▪ This will make that NF change to ▪ SLOMO can extrapolate a quick-yet-accurate performance prediction for ▪ Without triggering a slow offline profiling operation ▪ By leveraging existing profiles of NFi NF′  i NF′  i NFi 37 Proposed Method - Extrapolating Sensitivity Measuring Contentiousness
sensitive to unique traffic flow it receives (memory contention) ▪ When flows reduce, sensitivity also reduces 38 Proposed Method - Extrapolating Sensitivity Change in NF’s Traffic Configuration Change the Reliance On Shared Memory (Sensitivity)
on the assumption that the change of is small ▪ Thus, there is overlap between the sensitivity profiles of & ▪ If configuration or traffic profiles differ significantly (e.g., firewall with 1 vs 10k rules) ▪ There is little to no overlap between the respective sensitivity profiles NFi NFi NF′  i 39 Proposed Method - Extrapolating Sensitivity Scope of Extrapolation
a mean prediction error of 5.4% ▪ Reducing Dobrescu’s 12.72% error by 58% ▪ Reducing BubbleUp’s 15.2% average error by 64% ▪ SLOMO’s predictions are robust across operating conditions ▪ The design decisions behind each of SLOMO’s components contribute to improved accuracy ▪ SLOMO is efficient and enables smart scheduling decisions in an NFV cluster ▪ SLOMO is extensible, allowing the accurate extrapolation of the sensitivity function of new NF instances to account for changes in an NF’s traffic profile or configuration 40 Evaluation Comparison
to maximum resource utilization while maintaining SLA ▪ If there is no feasible schedule ▪ The operator provisions an additional server ▪ The author exhaustively run all possible combinations ▪ Resource overhead: how many additional machine with respect to the optimal 46 Use case Scheduling for Cluster
contention between co-located NFs ▪ Help the provisioning and placement decision in NFV orchestration framework ▪ Method ▪ Data-driven to design the SLOMO ▪ Take multi-variable into consideration ▪ Use machine learning to build model ▪ Result ▪ Prediction error rate is much better than previous works 47 Conclusion
with contention in memory subsystem ▪ Partial implementation code is open source ▪ Cons ▪ No details in its training, prediction process ▪ No details in some statement ▪ No consideration for NIC 48 Pros & Cons