Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Contention-Aware Performance Prediction For Vir...

JackKuo
December 03, 2020
24

Contention-Aware Performance Prediction For Virtualized Network Functions

Group meeting presentation of CANLAB in NTHU

JackKuo

December 03, 2020
Tweet

Transcript

  1. Communications and Networking Lab, NTHU Contention-Aware Performance Prediction For Virtualized

    Network Functions 1 Antonis Manousis, Rahul Anand Sharma, Vyas Sekar, 
 and Justine Sherry 
 2020 SIGCOMM Speaker: Chun-Fu Kuo Date: 2020.12.03
  2. Communications and Networking Lab, NTHU ▪ Introduction ▪ Problem Formulation

    ▪ Environment ▪ System Model ▪ Proposed Method ▪ Evaluation ▪ Use case ▪ Conclusion ▪ Pros and Cons 
 2 Outline
  3. Communications and Networking Lab, NTHU ▪ Write-back: ▪ CPU only

    writes data to cache and labels it ▪ While cache evicts, data writes back to memory ▪ Write-through: ▪ CPU writes data to both cache and memory 
 ▪ Post-write: ▪ CPU writes data to both cache and buffer, and writes back to memory on proper time 
 4 Introduction Cache Writing Policy
  4. Communications and Networking Lab, NTHU ▪ A CPU, memory, PCIe

    resource monitor ▪ CPU frequency ▪ Instructions per cycle ▪ Cache status ▪ Memory bandwidth ▪ PCIe bandwidth ▪ Power consumption 
 ▪ Launched by Intel, now is renamed to Processor Counter Monitor and maintained by community 
 5 Introduction Intel Performance Counter Monitor (PCM)
  5. Communications and Networking Lab, NTHU ▪ A kind of performance

    isolation technique for LLC ▪ Why ▪ If there is a noisy neighbor process (occupy lots of cache), 
 co-located processes could suffer from cache miss ▪ Which is important for cloud computing tenant 6 Introduction Intel Cache Allocation Technology (CAT)
  6. Communications and Networking Lab, NTHU ▪ Allow NIC to pass

    through cache (LLC, L2) ▪ Bypass the slow memory 7 Introduction Intel Data Direct I/O Technology
  7. Communications and Networking Lab, NTHU ▪ Co-resident NFs can interfere

    with each other ▪ Since they share hardware resource ▪ Primarily in the memory subsystem 8 Problem Formulation Problem Contention-induced throughput drop
  8. Communications and Networking Lab, NTHU ▪ Input NFs: ▪ Target

    NF: ▪ A NF whose performance drop we would like to estimate ▪ Competing workload: ▪ The set of NFs the target may be co-located with ▪ Hardware configuration: ▪ NFs exhibit extreme interactions with cache hierarchy ▪ High data structures reuse (e.g., rules, routing tables) ▪ Low packet data reuse S = {NFi } NFtarget ∈ S Compj = {S\NFtarget } Archk 9 Problem Formulation Problem - Prediction
  9. Communications and Networking Lab, NTHU ▪ Dobrescu’s method ▪ Memory

    contention is the key source for throughput drop ▪ Model memory as a monolithic source ▪ Single metric -- cache access rate (CAR) 
 ▪ BubbleUp ▪ Model memory as a monolithic source ▪ Single metric -- working set size of competing workloads 
 (cache occupancy) 10 Problem Formulation Existing Approaches - Performance Prediction
  10. Communications and Networking Lab, NTHU ▪ ResQ ▪ Argues in

    favor of isolating shared resources ▪ To prevent contention-specific performance degradation ▪ Leverage CAT technique, DPDK packet buffer sizing ▪ Provide the dedicated and non-overlapping LLC partition 11 Problem Formulation Existing Approaches - Performance Isolation
  11. Communications and Networking Lab, NTHU ▪ Incomplete solution for contention-induced

    slowdown ▪ Since partitioning tools fail to isolation all source of contention ▪ Also, isolation lead to inefficient resource utilization 
 12 Problem Formulation Performance Isolation Problem
  12. Communications and Networking Lab, NTHU ▪ Although previous work indicated

    the source of contention is memory subsystem ▪ The author found that contention is multifacted 13 Problem Formulation Resource of Contention
  13. Communications and Networking Lab, NTHU ▪ Contention in LLC ▪

    Contention that compromises fast access to auxiliary data structures containing necessary data for packet processing ▪ Contention for DDIO ▪ Contention that slows down packets on the direct path between the NICs and the LLC ▪ Contention for main memory bandwidth ▪ Contention that increases the latency to 
 service a LLC miss from main memory 14 Problem Formulation Resource of Contention
  14. Communications and Networking Lab, NTHU ▪ Intel Xeon E5-2620 v4

    (Broadwell) ▪ Intel XL710-40Gbps * 2 ▪ Intel Xeon Silver 4110 (Skylake) ▪ Mellanox MT27700-100Gbps * 2 ▪ Use SR-IOV to share resource with NFs 15 Environment
  15. Communications and Networking Lab, NTHU ▪ No DDIO in this

    test, competing NFs use separate memory channels ▪ Red line marks exhaustion of the available LLC space ▪ Before red line: occupancy is the best predictor of performance ▪ After red line: cache access rate (CAR) is the best predictor of performance 17 Problem Formulation LLC Contention Depends on 
 1. Cache Occupancy 2. Cache Access Rate
  16. Communications and Networking Lab, NTHU ▪ DDIO partitions the LLC

    into a primary cache and I/O cache ▪ Contention can occur when the total # of packets exceed the amount of space in the I/O cache (even though the LLC remains underloaded) 18 Problem Formulation DDIO Contention Depends on 
 1. Competitors’ Space Utilization 2. Access Rate
  17. Communications and Networking Lab, NTHU ▪ Use CAT technique to

    isolate LLC ▪ Cache miss rate is stable, but throughput goes down 19 Problem Formulation Main Memory Latency depends on 
 Total Memory Bandwidth Consumption
  18. Communications and Networking Lab, NTHU ▪ 2 logical parts: ▪

    Offline: characterize the contentiousness & model sensitivity of NF instance ▪ Online: make performance prediction of NF instance & mix of real competitors 20 Proposed Method SLOMO
  19. Communications and Networking Lab, NTHU ▪ ,set of NFs ▪

    ,server architecture ▪ ,contentiousness tuple ▪ ,synthetic contentiousness vector ▪ ,performance of in response to synthetic contentiousness vector ▪ ,sensitivity model which is trained by ▪ Operator runs each with multiple configuration (tunable synthetic workload) ▪ To profile for sensitivity, the author measure on each architecture ▪ To profile for contentiousness, the operator collects a set of vectors S = {NFi . . . } Archk (NFi , Archk ) Vx Px i NFi Mi : V → P {(Vx , Px i ), . . . } NFi Px i {Vx i } 21 Proposed Method - SLOMO Introduction Offline Profiling
  20. Communications and Networking Lab, NTHU ▪ These profiling datasets are

    specific to ▪ Particular NF type ▪ Configuration ▪ Traffic workload ▪ Server architecture ▪ In practice, a cluster may use only 1 or a small # of server architecture 
 (not change frequently) ▪ But it’s possible that after deployment, an NF’s ruleset or its traffic workload might change 22 Proposed Method - SLOMO Introduction Offline Profiling
  21. Communications and Networking Lab, NTHU ▪ At runtime, the operator

    uses the pre-computed ’s and ’s for prediction ▪ 2 NFs ( ) ▪ To predict ’s throughput: ▪ Put contentiousness vector into sensitivity model to produce ▪ 3 NFs ( ) ▪ Use composition function to compute offline ▪ Then apply the aforementioned "2 NFs" Vi Mi NFA , NFB NFA VB MA PB A NFA , NFB , NFC CF : VB , VC → VB,C VB,C 23 Proposed Method - SLOMO Introduction Online Predictions
  22. Communications and Networking Lab, NTHU ▪ Having 3 key components:

    contentiousness characterization, sensitivity modeling, contentiousness composition ▪ Take data-driven approach to design these components ▪ Modeling sensitivity is a model fitting process ▪ Choosing contentiousness metrics is a feature selection process ▪ Composition is a simple regression modeling problem 24 Proposed Method SLOMO in Depth
  23. Communications and Networking Lab, NTHU ▪ Choose candidate contentiousness metrics

    from Intel PCM framework ▪ Therefore, a natural limitation of SLOMO is: ▪ It’s limited by the pool of metrics exposed by PCM… ▪ PCM doesn’t provide visibility into the internals of a NIC ▪ Any congestion of NIC (e.g., queue occupancy) will not be taken into consideration 25 Proposed Method - SLOMO in Depth Candidate Contentiousness metrics
  24. Communications and Networking Lab, NTHU ▪ Exercise the effects of

    contention on each NF with a synthetic workload of tunable intensity ▪ Sample the space of possible contentiousness values NF could generate ▪ Finally, we have contentiousness vector dataset ▪ Experiment: ▪ Click-based NF with incremental pressure to 1. I/O datapath: through the # of allocated packet buffers 2. Packet-processing datapath: by performing configurable # of memory operation 26 Proposed Method - SLOMO in Depth Synthetic Competition QPN[WUGFFWTKPIQHHNKPGRTQHKNKPI
  25. Communications and Networking Lab, NTHU ▪ Exercise these configurations for

    various traffic patterns ▪ Rate, packet sizes, and flow counts ▪ Exercise these configurations for various # of co-running instance ▪ SLOMO profiles each NF with more than 1000 different configurations to get: ▪ PCM values when the synthetic workload & NF under test run solo ▪ PCM values when both the synthetic workload co-runs with the NF under test ▪ Performance of target NF when running with the synthetic competitor 27 Synthetic Competition (Cont.) QPN[WUGFFWTKPIQHHNKPGRTQHKNKPI Proposed Method - SLOMO in Depth
  26. Communications and Networking Lab, NTHU ▪ To prevent unrelated PCM

    metrics hurts the model accuracy ▪ Use Pearson correlation coefficient to analyze the statical dependency ▪ PCM metrics vs observed performances of each NF ▪ Use model-free (reinforcement learning) technique to train sensitivity model 28 Contentiousness Metrics Selection r = ∑n i=1 (xi − x)(yi − y) ∑n i=1 (xi − x)2(yi − y)2 Proposed Method - SLOMO in Depth
  27. Communications and Networking Lab, NTHU ▪ PCM metrics at CPU-socket-

    and System-level granularities adequately capture aggregate contentiousness ▪ But instead of core-level (since each NF is isolated in dedicated core) 29 Contentiousness Metrics Selection (Cont.) Proposed Method - SLOMO in Depth
  28. Communications and Networking Lab, NTHU ▪ Different sources of contention

    are best captured by different metrics ▪ As NFs can depend on multiple contention sources ▪ DDIO contention: best captured through memory bandwidth utilization metrics ▪ LLC contention: best captured through LLC-related metrics ▪ Memory Bandwidth: memory bandwidth utilization metrics 30 Contentiousness Metrics Selection (Cont.) Since packets buffer evictions are not captured by LLC metrics in DMA engine About 15 important metrics can be used Proposed Method - SLOMO in Depth
  29. Communications and Networking Lab, NTHU ▪ Can be viewed as

    a regression problem because: ▪ Its input (contentiousness of the competition) ▪ Output (target NF performance) are both continuous variable ▪ We need to model each NF since: ▪ Different NFs response differently to the various source of contention ▪ Training: use synthetic, NF-specific contentiousness aforementioned ▪ Run time: replace the synthetic inputs with the aggregate contentiousness of the real competitors ▪ Testing: generate for each NF and architecture a dataset of real experiments where each target NF is co-run with various combinations of NF 31 Modeling Sensitivity Proposed Method - SLOMO in Depth
  30. Communications and Networking Lab, NTHU ▪ Sensitivity is a non-linear

    and non-continuous function (multivariate input) ▪ Cannot be accurately modeled with: ▪ Regression (linear, polynomial) ▪ Decision trees ▪ Simple neural network ▪ Nonetheless, common pattern we detect across sensitivity functions are phase transitions 32 Sensitivity Can Be A Complex Function 
 Cannot Be Captured By Simple Regression Models Proposed Method - SLOMO in Depth
  31. Communications and Networking Lab, NTHU ▪ Phase transitions ▪ When

    LLC occupancy > LLC size: cache miss rate drop sharply 33 (Cont.) Sensitivity Can Be A Complex Function 
 Cannot Be Captured By Simple Regression Models Proposed Method - SLOMO in Depth
  32. Communications and Networking Lab, NTHU ▪ Model the different sub-spaces

    of sensitivity separately ▪ Then combine the resulting models into a larger, comprehensive one ▪ Which is a technique in machine learning called ensemble modeling ▪ This paper use Gradient Boosting Regression 34 Sensitivity Can Be Modeled As 
 A Piecewise Function of Its Input Proposed Method - SLOMO in Depth
  33. Communications and Networking Lab, NTHU ▪ Measuring ’s contentiousness: ▪

    Measure the PCM for while it runs alone on the server ▪ Measure contentiousness for while it’s running against the various synthetic competitors ▪ Each time is subjected to a unique , so we can get a set of ▪ Group the based on how many co-runner (utilized cores) ▪ Then take the average of each group, which is in 3 NFs condition NFi NFi NFi NFx NFi Vx Vi Vx i VC B 35 Measuring Contentiousness Inaccurate! Proposed Method - SLOMO in Depth
  34. Communications and Networking Lab, NTHU ▪ Composition: ▪ Since the

    aggregate contentiousness metrics we wish to estimate are by definition the sum or average of the constituent per-core metrics ▪ E.g., ▪ The CAR of a CPU-socket is the sum of each core’s CAR in the CPU socket 36 Measuring Contentiousness (Cont.) Proposed Method - SLOMO in Depth
  35. Communications and Networking Lab, NTHU ▪ might experience modifications during

    its lifecycle 
 (e.g., migration across server, changes in configuration/traffic) ▪ This will make that NF change to ▪ SLOMO can extrapolate a quick-yet-accurate performance prediction for ▪ Without triggering a slow offline profiling operation ▪ By leveraging existing profiles of NFi NF′  i NF′  i NFi 37 Proposed Method - Extrapolating Sensitivity Measuring Contentiousness
  36. Communications and Networking Lab, NTHU ▪ Example: NF is heavily

    sensitive to unique traffic flow it receives 
 (memory contention) ▪ When flows reduce, sensitivity also reduces 38 Proposed Method - Extrapolating Sensitivity Change in NF’s Traffic Configuration Change the Reliance On Shared Memory (Sensitivity)
  37. Communications and Networking Lab, NTHU ▪ Extrapolation heuristics is based

    on the assumption that the change of is small ▪ Thus, there is overlap between the sensitivity profiles of & ▪ If configuration or traffic profiles differ significantly 
 (e.g., firewall with 1 vs 10k rules) ▪ There is little to no overlap between the respective sensitivity profiles NFi NFi NF′  i 39 Proposed Method - Extrapolating Sensitivity Scope of Extrapolation
  38. Communications and Networking Lab, NTHU ▪ SLOMO is accurate, with

    a mean prediction error of 5.4% ▪ Reducing Dobrescu’s 12.72% error by 58% ▪ Reducing BubbleUp’s 15.2% average error by 64% ▪ SLOMO’s predictions are robust across operating conditions ▪ The design decisions behind each of SLOMO’s components contribute to improved accuracy ▪ SLOMO is efficient and enables smart scheduling decisions in an NFV cluster ▪ SLOMO is extensible, allowing the accurate extrapolation of the 
 sensitivity function of new NF instances to account for changes 
 in an NF’s traffic profile or configuration 
 40 Evaluation Comparison
  39. Communications and Networking Lab, NTHU ▪ Average prediction error and

    error variance are the best ▪ Some cases are very close (VPN, stateless firewall, Maglev) ▪ Because they are not I/O bound 41 Evaluation Accuracy
  40. Communications and Networking Lab, NTHU ▪ The absolute error follows

    an increasing trend as a function of the # of competing NFs ▪ Since the additive, composition-related error factor 42 Evaluation Robustness
  41. Communications and Networking Lab, NTHU ▪ Top 3 metrics for

    a collection of NF instances ▪ 2 architectures: Broadwell, Skylake ▪ Skylake has more "MEM WRITE" ▪ Smaller LLC (20 MB vs 11 MB) ▪ No write-back cache policy 44 Evaluation Factor Analysis
  42. Communications and Networking Lab, NTHU ▪ The operator’s goal is

    to maximum resource utilization while maintaining SLA ▪ If there is no feasible schedule ▪ The operator provisions an additional server ▪ The author exhaustively run all possible combinations ▪ Resource overhead: how many additional machine with respect to the optimal 46 Use case Scheduling for Cluster
  43. Communications and Networking Lab, NTHU ▪ Goal ▪ Predict the

    contention between co-located NFs ▪ Help the provisioning and placement decision in NFV orchestration framework ▪ Method ▪ Data-driven to design the SLOMO ▪ Take multi-variable into consideration ▪ Use machine learning to build model ▪ Result ▪ Prediction error rate is much better than previous works 47 Conclusion
  44. Communications and Networking Lab, NTHU ▪ Pros ▪ Comprehensive analysis

    with contention in memory subsystem ▪ Partial implementation code is open source ▪ Cons ▪ No details in its training, prediction process ▪ No details in some statement ▪ No consideration for NIC 48 Pros & Cons