$30 off During Our Annual Pro Sale. View Details »

Applying The Universal Scalability Law to Distr...

Applying The Universal Scalability Law to Distributed Systems

After demonstrating that both Amdahl's law and the Universal Scalability law (USL) are fundamental, in that they can be derived from queueing theory, the USL is used to quantify the scalability of a wide variety of distributed systems, including Memcache, Hadoop, Zookeeper, AWS cloud, and Avalanche DLT. Contact is also made with some well-known concepts in modern distributed systems, viz., Hasse diagrams, the CAP theorem, eventual consistency and Paxos coordination.

Dr. Neil Gunther

February 16, 2019
Tweet

More Decks by Dr. Neil Gunther

Other Decks in Technology

Transcript

  1. Applying The Universal Scalability Law to Distributed Systems Dr. Neil

    J. Gunther Performance Dynamics Distributed Systems Conference Pune, INDIA 16 February 2019 SM c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 1 / 81
  2. Introduction Outline 1 Introduction 2 Universal Computational Scalability (8) 3

    Queueing Theory View of the USL (15) 4 Distributed Application Scaling (22) 5 Distributed in the Cloud (30) 6 Optimal Node Sizing (38) 7 Summary (45) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 2 / 81
  3. Introduction It’s all about NONLINEARITIES “Using a term like nonlinear

    science is like referring to the bulk of zoology as the study of non-elephant animals.” –Stan Ulam Models are a must All performance analysis (including scalability analysis) is about nonlinearity. But nonlinear is a class name, like cancer: there are at least as many forms of cancer as there are cell types. You need to correctly characterize the specific cell type to determine the best treatment. Similarly, you need to correctly characterize the performance nonlinearity in order to improve it. The only sane way to do that is with the appropriate performance model. Data alone is not sufficient. 1 Algorithms are only half the scalability story 2 Measurements are the other half 3 And models are the other other half c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 3 / 81
  4. Introduction What is distributed computing? “A collection of independent computers

    that appears to its users as a single coherent system,” –Andy Tanenbaum (2007) Anything outside a single execution unit (von Neumann bottleneck) 1 PROBLEM: How to connect these? RAM Cache ? Disk Processor Tightly coupled vs. loosely coupled Many issues are not really new; a question of scale Confusing overloaded terminology: function, consistency, linearity, monotonicity, concurrency, .... c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 4 / 81
  5. Introduction Easy to be naive about scalability COST (Configuration that

    Outperforms a Single Thread), 20151: Single-threaded scaling can be better than multi-threaded scaling. (And your point is? —njg) Scalability! But at what COST? Frank McSherry Michael Isard Derek G. Murray Unaffiliated Unaffiliated⇤ Unaffiliated† Abstract We offer a new metric for big data platforms, COST, or the Configuration that Outperforms a Single Thread. The COST of a given platform for a given problem is the hardware configuration required before the platform out- performs a competent single-threaded implementation. COST weighs a system’s scalability against the over- heads introduced by the system, and indicates the actual performance gains of the system, without rewarding sys- tems that bring substantial but parallelizable overheads. We survey measurements of data-parallel systems re- cently reported in SOSP and OSDI, and find that many systems have either a surprisingly large COST, often hundreds of cores, or simply underperform one thread for all of their reported configurations. 1 Introduction “You can have a second computer once you’ve shown you know how to use the first one.” –Paul Barham The published work on big data systems has fetishized 300 1 10 100 50 1 10 cores speed-up system A system B 300 1 10 100 1000 8 100 cores seconds system A system B Figure 1: Scaling and performance measurements for a data-parallel algorithm, before (system A) and after (system B) a simple performance optimization. The unoptimized implementation “scales” far better, despite (or rather, because of) its poor performance. While this may appear to be a contrived example, we will argue that many published big data systems more closely resemble system A than they resemble system B. 1.1 Methodology In this paper we take several recent graph processing pa- pers from the systems literature and compare their re- ported performance against simple, single-threaded im- plementations on the same datasets using a high-end Gene Amdahl, 19672: Demonstration is made of the continued validity of the single processor approach and of the weaknesses of the multiple processor approach. (Wrong then but REALLY wrong now! —njg) All single CPUs/cores now tapped out ≤ 5 GHz since 2005 SPEC.org benchmarks made the same mistake 30 yrs ago SPECrate: multiple single-threaded instances (dumb fix) SPEC SDM: Unix multi-user processes (very revealing) 1 McSherry et al., “Scalability! But at what COST?”, Usenix conference, 2015 2 “Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities,” AFIPS conference, 1967 c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 5 / 81
  6. Introduction Can haz linear scalability Shared nothing architecture 3 —

    not easy in general 4 Tandem Himalaya with NonStop SQL Local processing and memory + no global lock Linear OLTP scaling on TPC-C up to N = 128 nodes c.1996 L Interconnect network P C M P C M Processors P C M o o g 3 D. J. DeWitt and J. Gray, “Parallel Database Systems: The Future of High Performance Database Processing,” Comm. ACM, Vol. 36, No. 6, 85–98 (1992) 4 NJG, “Scaling and Shared Nothingness,” 4th HPTS Workshop, Asilomar, California (1993) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 6 / 81
  7. Universal Computational Scalability (8) Outline 1 Introduction 2 Universal Computational

    Scalability (8) 3 Queueing Theory View of the USL (15) 4 Distributed Application Scaling (22) 5 Distributed in the Cloud (30) 6 Optimal Node Sizing (38) 7 Summary (45) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 7 / 81
  8. Universal Computational Scalability (8) Motivation for USL Pyramid Technology, c.1992

    Vendor Model Unix DBMS TPS-B nCUBE nCUBE2 × OPS 1017.0 Pyramid MISserver UNIFY 468.5 DEC VaxCluster4 × OPS 425.7 DEC VaxCluster3 × OPS 329.8 Sequent Symmetry 2000 ORA 319.6 c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 8 / 81
  9. Universal Computational Scalability (8) Amdahl’s law How not to write

    a formula (Hennessy & Patterson, p.30) How not to plot performance data (Wikipedia) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 9 / 81
  10. Universal Computational Scalability (8) Universal scaling properties 1. Equal bang

    for the buck α = 0 β = 0 Processes Capacity Everybody wants this c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 10 / 81
  11. Universal Computational Scalability (8) Universal scaling properties 1. Equal bang

    for the buck α = 0 β = 0 Processes Capacity Everybody wants this 2. Diminishing returns α > 0 β = 0 Processes Capacity Everybody usually gets this c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 10 / 81
  12. Universal Computational Scalability (8) Universal scaling properties 1. Equal bang

    for the buck α = 0 β = 0 Processes Capacity Everybody wants this 3. Bottleneck limit 1/α α >> 0 β = 0 Processes Capacity Everybody hates this 2. Diminishing returns α > 0 β = 0 Processes Capacity Everybody usually gets this c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 10 / 81
  13. Universal Computational Scalability (8) Universal scaling properties 1. Equal bang

    for the buck α = 0 β = 0 Processes Capacity Everybody wants this 3. Bottleneck limit 1/α α >> 0 β = 0 Processes Capacity Everybody hates this 2. Diminishing returns α > 0 β = 0 Processes Capacity Everybody usually gets this 4. Retrograde throughput 1/α 1/N α > 0 β > 0 Processes Capacity Everybody thinks this never happens c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 10 / 81
  14. Universal Computational Scalability (8) The universal scalability law (USL) N:

    processes provide system stimulus or load 5 NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conference (1993) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 11 / 81
  15. Universal Computational Scalability (8) The universal scalability law (USL) N:

    processes provide system stimulus or load XN : response function or relative capacity 5 NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conference (1993) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 11 / 81
  16. Universal Computational Scalability (8) The universal scalability law (USL) N:

    processes provide system stimulus or load XN : response function or relative capacity Question: What kind of function? 5 NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conference (1993) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 11 / 81
  17. Universal Computational Scalability (8) The universal scalability law (USL) N:

    processes provide system stimulus or load XN : response function or relative capacity Question: What kind of function? Answer: A rational function5 (nonlinear) 5 NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conference (1993) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 11 / 81
  18. Universal Computational Scalability (8) The universal scalability law (USL) N:

    processes provide system stimulus or load XN : response function or relative capacity Question: What kind of function? Answer: A rational function5 (nonlinear) XN (α, β, γ) = γ N 1 + α (N − 1) + β N(N − 1) 5 NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conference (1993) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 11 / 81
  19. Universal Computational Scalability (8) The universal scalability law (USL) N:

    processes provide system stimulus or load XN : response function or relative capacity Question: What kind of function? Answer: A rational function5 (nonlinear) XN (α, β, γ) = γ N 1 + α (N − 1) + β N(N − 1) Three Cs: 5 NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conference (1993) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 11 / 81
  20. Universal Computational Scalability (8) The universal scalability law (USL) N:

    processes provide system stimulus or load XN : response function or relative capacity Question: What kind of function? Answer: A rational function5 (nonlinear) XN (α, β, γ) = γ N 1 + α (N − 1) + β N(N − 1) Three Cs: 1 Concurrency (0 < γ < ∞) 5 NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conference (1993) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 11 / 81
  21. Universal Computational Scalability (8) The universal scalability law (USL) N:

    processes provide system stimulus or load XN : response function or relative capacity Question: What kind of function? Answer: A rational function5 (nonlinear) XN (α, β, γ) = γ N 1 + α (N − 1) + β N(N − 1) Three Cs: 1 Concurrency (0 < γ < ∞) 2 Contention (0 < α < 1) 5 NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conference (1993) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 11 / 81
  22. Universal Computational Scalability (8) The universal scalability law (USL) N:

    processes provide system stimulus or load XN : response function or relative capacity Question: What kind of function? Answer: A rational function5 (nonlinear) XN (α, β, γ) = γ N 1 + α (N − 1) + β N(N − 1) Three Cs: 1 Concurrency (0 < γ < ∞) 2 Contention (0 < α < 1) 3 Coherency (0 < β < 1) 5 NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conference (1993) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 11 / 81
  23. Universal Computational Scalability (8) Measurement meets model X(N) X(1) Thruput

    data c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 12 / 81
  24. Universal Computational Scalability (8) Measurement meets model X(N) X(1) Thruput

    data −→ CN Capacity metric c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 12 / 81
  25. Universal Computational Scalability (8) Measurement meets model X(N) X(1) Thruput

    data −→ CN Capacity metric ←− N 1 + α (N − 1) + β N(N − 1) USL model c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 12 / 81
  26. Universal Computational Scalability (8) Measurement meets model X(N) X(1) Thruput

    data −→ CN Capacity metric ←− N 1 + α (N − 1) + β N(N − 1) USL model 0 20 40 60 80 100 5 10 15 20 Processes (N) Relative capacity, C(N) Linear scaling Amdahl−like scaling Retrograde scaling c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 12 / 81
  27. Universal Computational Scalability (8) Why is coherency quadratic? USL coherency

    term β N (N − 1) is O(N2) Fully connected graph KN with N nodes K3 K6 K8 K16 has N 2 ≡ N! 2! (N − 2)! = 1 2 N (N − 1) edges (quadratic in N) that represent: communication between each pair of nodes 6 exchange of data or objects between processing nodes actual performance impact captured by magnitude of β parameter 6 This term can also be related to Brooks’ law (too many cooks increase delivery time) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 13 / 81
  28. Universal Computational Scalability (8) Coherency or consistency? c 2019 Performance

    Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 14 / 81
  29. Universal Computational Scalability (8) Coherency or consistency? c 2019 Performance

    Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 14 / 81
  30. Universal Computational Scalability (8) Coherency or consistency? July 19, 2000

    July 19, 2000 ers OK ers OK stic) stic) PODC Keynote, July 19, 2000 PODC Keynote, July 19, 2000 The CAP Theorem The CAP Theorem Consistency Availability Tolerance to network Partitions Theorem: You can have at most two of these properties for any shared-data system What the man said c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 14 / 81
  31. Universal Computational Scalability (8) Coherency or consistency? July 19, 2000

    July 19, 2000 ers OK ers OK stic) stic) PODC Keynote, July 19, 2000 PODC Keynote, July 19, 2000 The CAP Theorem The CAP Theorem Consistency Availability Tolerance to network Partitions Theorem: You can have at most two of these properties for any shared-data system What the man said C A P CA CP AP Correct Venn diagram c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 14 / 81
  32. Universal Computational Scalability (8) Eventual consistency and monotonicity Consistency: partial

    order of replicates via a binary relation (posets) Hasse diagrams: replica events converge (like time) to largest (cumulative) value Monotonicity: any upward Hasse path terminates with the same result Developer view is not a performance guarantee Lattice of integers ordered by binary max relation Time 9 7 9 4 7 9 OO 2 4 7 9 Lattice of sets ordered by binary join relation Time {a, b, c, d} {a, b, c} {b, c, d} {a, b, d} {b, c} {a, b) {c, d} {a, d} OO {b} {c} {a} {d} Logical monotonicity = monotonically increasing function A monotonic application can exhibit monotonically decreasing scalability c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 15 / 81
  33. Universal Computational Scalability (8) Determining the USL parameters Throughput measurements

    XN at various process loads N sourced from: 1 Load test harness, e.g., LoadRunner, Jmeter 2 Production monitoring, e.g., Linux \proc, JMX interface Want to determine the α, β, γ that best models XN data XN (α, β, γ) = γ N 1 + α (N − 1) + β N(N − 1) XN is a rational function (tricky) Brute force: Gene Amdahl 1967 (only α) Clever ways: 1 Optimization Solver in Excel 2 Nonlinear statistical regression in R c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 16 / 81
  34. Queueing Theory View of the USL (15) Outline 1 Introduction

    2 Universal Computational Scalability (8) 3 Queueing Theory View of the USL (15) 4 Distributed Application Scaling (22) 5 Distributed in the Cloud (30) 6 Optimal Node Sizing (38) 7 Summary (45) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 17 / 81
  35. Queueing Theory View of the USL (15) Queueing theory for

    those who can’t wait Amdahl’s law for parallel speedup SA T1 /TN with N processors 7 SA (N) = N 1 + α (N − 1) (1) 1 Parameter 0 ≤ α ≤ 1 is serial fraction of time spent running single threaded 2 Subsumed by USL model contention term (with γ = 1 and β = 0) 3 Msg: Forget linearity, you’re gonna hit a ceiling at lim N→∞ SA ∼ 1/α Some people (all named Frank?) consider Amdahl’s law to be ad hoc: Franco P. Preparata, “Should Amdahl’s law be repealed?”, Proc. 6th International Symposium on Algorithms and Computation (ISAAC’95), Cairns, Australia, December 4–6 1995. https: //www.researchgate.net/publication/221543174_Should_Amdahl%27s_Law_Be_Repealed_Abstract Frank L. Devai, “The Refutation of Amdahl?s Law and Its Variants,” 17th International Conference on Computational Science and Applications ICCSA, Trieste, Italy, July 2017 https://www.researchgate.net/publication/ 318648496_The_Refutation_of_Amdahl%27s_Law_and_Its_Variants 7 Gene Amdahl (1967) never wrote down this equation and he measured the mainframe workloads by brute force c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 18 / 81
  36. Queueing Theory View of the USL (15) 0 20 40

    60 80 100 5 10 15 20 Processes (N) Relative capacity, C(N) Linear scaling Amdahl−like scaling Retrograde scaling Top blue curve is Amdahl approaching a ceiling speedup C(N) ∼ 20 Which also means α ∼ 0.05 or app serialized 5% of the runtime c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 19 / 81
  37. Queueing Theory View of the USL (15) Queue characterization Please

    link back to the page you downloaded this from, or just link to parkablogs.blogspot.com Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com W S L m ρ Time: R Space: Q + + = = Arriving Wai3ng Servicing Depar3ng A familiar queue
  38. Queueing Theory View of the USL (15) Queue characterization Please

    link back to the page you downloaded this from, or just link to parkablogs.blogspot.com Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com W S L m ρ Time: R Space: Q + + = = Arriving Wai3ng Servicing Depar3ng A familiar queue An abstract queue
  39. Queueing Theory View of the USL (15) Queue characterization Please

    link back to the page you downloaded this from, or just link to parkablogs.blogspot.com Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com W S L m ρ Time: R Space: Q + + = = Arriving Wai3ng Servicing Depar3ng A familiar queue An abstract queue 1. Where is the service facility?
  40. Queueing Theory View of the USL (15) Queue characterization Please

    link back to the page you downloaded this from, or just link to parkablogs.blogspot.com Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com W S L m ρ Time: R Space: Q + + = = Arriving Wai3ng Servicing Depar3ng A familiar queue An abstract queue 1. Where is the service facility? 2. Did anything finish yet?
  41. Queueing Theory View of the USL (15) Queue characterization Please

    link back to the page you downloaded this from, or just link to parkablogs.blogspot.com Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com W S L m ρ Time: R Space: Q + + = = Arriving Wai3ng Servicing Depar3ng A familiar queue An abstract queue 1. Where is the service facility? 2. Did anything finish yet? 3. Is anything waiting?
  42. Queueing Theory View of the USL (15) Queue characterization Please

    link back to the page you downloaded this from, or just link to parkablogs.blogspot.com Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com W S L m ρ Time: R Space: Q + + = = Arriving Wai3ng Servicing Depar3ng A familiar queue An abstract queue 1. Where is the service facility? 2. Did anything finish yet? 3. Is anything waiting? 4. Is anything arriving? c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 20 / 81
  43. Queueing Theory View of the USL (15) Amdahl queueing model

    ... ... Interconnect network Processing nodes Figure 1: Amdahl queueing model Classical: Queueing model of assembly-line robots (LHS bubbles) Classical: Robots that fail go to service stations (RHS) for repairs Classical: Repair queues impact manufacturing “cycle time” Amdahl: Robots become finite number of (distributed) processing nodes Amdahl: Repair stations becomes interconnect network Amdahl: Network latency impacts application scalability c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 21 / 81
  44. Queueing Theory View of the USL (15) Amdahl queueing theorem

    Theorem 1 (Gunther 2002) Amdahl’s law for parallel speedup is equivalent to the synchronous queueing bound on throughput in the machine repairman model of a multi-node computer system. Proof. 1 ”A New Interpretation of Amdahl’s Law and Geometric Scalability”, arXiv, Submitted 17 Oct (2002) 2 “Celebrity Boxing (and sizing): Alan Greenspan vs. Gene Amdahl,” 28th International Computer Measurement Group Conference, December 8–13, Reno, NV (2002) 3 “Unification of Amdahl’s Law, LogP and Other Performance Models for Message-Passing Architectures”, Proc. Parallel and Distributed Computer Systems (PDCS’05) November 14–16, 569–576, Phoenix, AZ (2005) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 22 / 81
  45. Queueing Theory View of the USL (15) Queueing bounds on

    throughput 0 20 40 60 80 100 120 Processing nodes (N) Throughput X(N) Xlin (N) = 1 (R1 + Z) Xmax = 1 Smax Xmean (N) = N (RN + Z) Xsync (N) = N (N R1 + Z) Xsax = 1 R1 Figure 2: Average (blue) and synchronous (red) throughput X(N) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 23 / 81
  46. Queueing Theory View of the USL (15) The speedup mapping

    1 Z: is the mean processor execution time 2 Sk : service time at kth (network) queue 3 R1 : minimum round-trip time, i.e., k Sk + Z for N = 1 Amdahl speedup SA (N) is given by SA (N) = N · R1 N · k Sk + Z — queueing form (2) SA (N) = N 1 + α(N − 1) — usual “ad hoc” law (3) Connecting (2) and (3) requires identity k Sk R1 ≡ α →    0 as Sk → 0 (processor only) 1 as Z → 0 (network only) (4) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 24 / 81
  47. Queueing Theory View of the USL (15) USL queueing model

    ... Load-dependent Interconnect Processing nodes Figure 3: USL load-dependent queueing model Amdahl: Robots become finite number of (distributed) processing nodes Amdahl: Repair stations becomes interconnect network Amdahl: Network latency impacts application scalability USL: Interconnect service time (latency) depends on queue length USL: Requests are first exchange sorted O(N2) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 25 / 81
  48. Queueing Theory View of the USL (15) USL load-dependent queueing

    Theorem 2 (Gunther 2008) The USL relative capacity is equivalent to the synchronous queueing bound on throughput for a linear load-dependent machine repairman model of a multi-node computer system. Proof. 1 Guerrilla Capacity Planning: A Tactical Approach to Planning for Highly Scalable Applications and Services (Springer 2007) 2 A General Theory of Computational Scalability Based on Rational Functions, arXiv, Submitted on 11 Aug (2008) 3 Discrete-event simulations presented in “Getting in the Zone for Successful Scalability” Proc. Computer Measurement Group International Conference, December 7th–12, Las Vegas, NV (2008) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 26 / 81
  49. Queueing Theory View of the USL (15) Experimental results 0

    20 40 60 80 100 0 2 4 6 8 10 Processing nodes (N) Relative capacity C(N) Synchronized repairman simulation USL model: α = 0.1, β = 0 Load−dependent repairman simulation USL model: α = 0.1, β = 0.001 Discrete-event simulations (Jim Holtman 2008, 2018) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 27 / 81
  50. Queueing Theory View of the USL (15) Key points for

    distributed USL Each USL denominator term has a physical meaning (not purely stats model): 1 Constant O(1) ⇒ maximum nodal concurrency (no queueing) 2 Linear O(N) ⇒ intra-nodal contention (synchronous queueing) 3 Quadratic O(N2) ⇒ inter-nodal exchange (load-dependent sync queueing) Useful for interpreting USL analysis to developers, engineers, architects, etc. Example 1 If β > α then look for inter-nodal messaging activity rather than the size of message-queues in any particular node Worst-case queueing corresponds to an analytic equation (rational function) USL is agnostic about the application architecture and interconnect topology So where is all that information contained? (exercise for the reader) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 28 / 81
  51. Distributed Application Scaling (22) Outline 1 Introduction 2 Universal Computational

    Scalability (8) 3 Queueing Theory View of the USL (15) 4 Distributed Application Scaling (22) 5 Distributed in the Cloud (30) 6 Optimal Node Sizing (38) 7 Summary (45) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 29 / 81
  52. Distributed Application Scaling (22) Memcached “Hidden Scalability Gotchas in Memcached

    and Friends” NJG, S. Subramanyam, and S. Parvu Sun Microsystems Presented at Velocity 2010 c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 30 / 81
  53. Distributed Application Scaling (22) Distributed scaling strategies Scaleup Scaleout c

    2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 31 / 81
  54. Distributed Application Scaling (22) Scaleout generations Distributed cache of key-value

    pairs Data pre-loaded from RDBMS backend Deploy memcache on previous generation CPUs (not always multicore) Single worker thread ok — until next hardware roll (multicore) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 32 / 81
  55. Distributed Application Scaling (22) Controlled load-tests 0 2 4 6

    8 10 12 0 50 100 150 200 250 300 Worker threads (N) Throughput KOPS X(N) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 33 / 81
  56. Distributed Application Scaling (22) Explains these memcached warnings Configuring the

    memcached server Threading is used to scale memcached across CPU’s. The model is by ‘‘worker threads’’, meaning that each thread handles concurrent connections. ... By default 4 threads are allocated. Setting it to very large values (80+) will make it run considerably slower. Linux man pages - memcached (1) -t <threads> Number of threads to use to process incoming requests. ... It is typically not useful to set this higher than the number of CPU cores on the memcached server. Setting a high number (64 or more) of worker threads is not recommended. The default is 4. c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 34 / 81
  57. Distributed Application Scaling (22) Load test data in R c

    2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 35 / 81
  58. Distributed Application Scaling (22) R regression analysis c 2019 Performance

    Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 36 / 81
  59. Distributed Application Scaling (22) USL scalability model 0 2 4

    6 8 10 12 0 50 100 150 200 250 300 Worker threads (N) Throughput KOPS X(N) α = 0.0468 β = 0.021016 γ = 84.89 Nmax = 6.73 Xmax = 274.87 Xroof = 1814.82 c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 37 / 81
  60. Distributed Application Scaling (22) Concurrency parameter 0 2 4 6

    8 10 12 0 50 100 150 200 250 300 Worker threads (N) Throughput KOPS X(N) α = 0.0468 β = 0.021016 γ = 84.89 Nmax = 6.73 Xmax = 274.87 Xroof = 1814.82 α = 0 β = 0 Processes Capacity 1 γ = 84.89 2 Slope of linear bound as Kops/thread 3 Estimate of throughput X(1) = 84.89 Kops at N = 1 thread c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 38 / 81
  61. Distributed Application Scaling (22) Contention parameter 0 2 4 6

    8 10 12 0 50 100 150 200 250 300 Worker threads (N) Throughput KOPS X(N) α = 0.0468 β = 0.021016 γ = 84.89 Nmax = 6.73 Xmax = 274.87 Xroof = 1814.82 1/α α >> 0 β = 0 Processes Capacity α = 0.0468 Waiting or queueing for resources about 4.6% of the time Max possible throughput is X(1)/α = 1814.78 Kops (Xroof ) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 39 / 81
  62. Distributed Application Scaling (22) Coherency parameter 0 2 4 6

    8 10 12 0 50 100 150 200 250 300 Worker threads (N) Throughput KOPS X(N) α = 0.0468 β = 0.021016 γ = 84.89 Nmax = 6.73 Xmax = 274.87 Xroof = 1814.82 1/α 1/N α > 0 β > 0 Processes Capacity β = 0.0210 corresponds to retrograde throughput Distributed copies of data (e.g., caches) have to be exchanged/updated about 2.1% of the time to be consistent Peak occurs at Nmax = (1 − α)/β = 6.73 threads c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 40 / 81
  63. Distributed Application Scaling (22) Improving scalability performance 0 10 20

    30 40 50 0 5 10 15 20 25 Threads (N) Speedup S(N) mcd 1.2.8 mcd 1.3.2 mcd 1.3.2 + patch c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 41 / 81
  64. Distributed Application Scaling (22) Sirius and Zookeeper M. Bevilacqua-Linn, M.

    Byron, P. Cline, J. Moore and S. Muir (Comcast), “Sirius: Distributing and Coordinating Application”, Usenix ATC 2014 P. Hunt, M. Konar, F. P. Junqueira and B. Reed (Yahoo), “ZooKeeper: Wait-free coordination for Internet-scale systems”, Usenix ATC 2010 c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 42 / 81
  65. Distributed Application Scaling (22) Coordination throughput data 0 500 1000

    1500 Writes per second 1 3 5 7 9 11 13 15 Cluster size Sirius Sirius−NoBrain Sirius−NoDisk USL model c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 43 / 81
  66. Distributed Application Scaling (22) Coordination throughput data 0 500 1000

    1500 Writes per second 1 3 5 7 9 11 13 15 Cluster size Sirius Sirius−NoBrain Sirius−NoDisk USL model It’s all downhill on the coordination ski slopes c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 43 / 81
  67. Distributed Application Scaling (22) Coordination throughput data 0 500 1000

    1500 Writes per second 1 3 5 7 9 11 13 15 Cluster size Sirius Sirius−NoBrain Sirius−NoDisk USL model It’s all downhill on the coordination ski slopes But that’s very bad ... isn’t it? c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 43 / 81
  68. Distributed Application Scaling (22) Coordination throughput data 0 500 1000

    1500 Writes per second 1 3 5 7 9 11 13 15 Cluster size Sirius Sirius−NoBrain Sirius−NoDisk USL model It’s all downhill on the coordination ski slopes But that’s very bad ... isn’t it? What does the USL say? c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 43 / 81
  69. Distributed Application Scaling (22) USL scalability model 0 500 1000

    1500 Cluster size Writes per second 1 3 5 7 9 11 13 15 Sirius Sirius−NoBrain Sirius−NoDisk USL model α = 0.05 β = 0.165142 γ = 1024.98 Nmax = 2.4 Xmax = 1513.93 Xroof = 20499.54 c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 44 / 81
  70. Distributed Application Scaling (22) Concurrency parameter 0 500 1000 1500

    Cluster size Writes per second 1 3 5 7 9 11 13 15 Sirius Sirius−NoBrain Sirius−NoDisk USL model α = 0.05 β = 0.165142 γ = 1024.98 Nmax = 2.4 Xmax = 1513.93 Xroof = 20499.54 α = 0 β = 0 Processes Capacity 1 γ = 1024.98 2 Single node is meaningless (need N ≥ 3 for majority) 3 Interpret γ as N = 1 virtual throughput 4 USL estimates X(1) = 1, 024.98 WPS (black square) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 45 / 81
  71. Distributed Application Scaling (22) Contention parameter 0 500 1000 1500

    Cluster size Writes per second 1 3 5 7 9 11 13 15 Sirius Sirius−NoBrain Sirius−NoDisk USL model α = 0.05 β = 0.165142 γ = 1024.98 Nmax = 2.4 Xmax = 1513.93 Xroof = 20499.54 1/α α >> 0 β = 0 Processes Capacity α = 0.05 Queueing for resources about 5% of the time Max possible throughput is X(1)/α = 20, 499.54 WPS (Xroof ) But Xroof not feasible in these systems c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 46 / 81
  72. Distributed Application Scaling (22) Coherency parameter 0 500 1000 1500

    Cluster size Writes per second 1 3 5 7 9 11 13 15 Sirius Sirius−NoBrain Sirius−NoDisk USL model α = 0.05 β = 0.165142 γ = 1024.98 Nmax = 2.4 Xmax = 1513.93 Xroof = 20499.54 1/α 1/N α > 0 β > 0 Processes Capacity β = 0.1651 says retrograde throughput dominates! Distributed data being exchanged (compared?) about 16.5% of the time (virtual) Peak at Nmax = (1 − α)/β = 2.4 cluster nodes Shocking but that’s what Paxos-type scaling looks like c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 47 / 81
  73. Distributed Application Scaling (22) Comparison of Sirius and Zookeeper 0

    500 1000 1500 Cluster size Writes per second 1 3 5 7 9 11 13 15 Sirius Sirius−NoBrain Sirius−NoDisk USL model α = 0.05 β = 0.165142 γ = 1024.98 Nmax = 2.4 Xmax = 1513.93 Xroof = 20499.54 0 20000 40000 60000 80000 Cluster size Reqs per second 1 3 5 7 9 11 13 15 Mean CI_lo CI_hi USL model α = 0.05 β = 0.157701 γ = 41536.96 Nmax = 2.45 Xmax = 62328.47 Xroof = 830739.1 Both coordination services exhibit equivalent USL scaling parameters: 5% contention delay — not “wait-free” (per Yahoo paper title) 16% coherency delay despite Sirius being write-intensive and Zookeeper being read-intensive (i.e., 100x throughput) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 48 / 81
  74. Distributed Application Scaling (22) Distributed Ledger Scalability A. Charapko, A.

    Ailijiang and M. Demirbas “Bridging Paxos and Blockchain Consensus”, 2018 Team Rocket (Cornell), “Snowflake to Avalanche: A Novel Metastable Consensus Protocol Family for Cryptocurrencies”, 2018 c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 49 / 81
  75. Distributed Application Scaling (22) Avalanche protocol VISA can handle 60,000

    TPS Bitcoin bottlenecks at 7 TPS 8 That’s FOUR decades performance inferiority! Avalanche rumor-mongering coordination: Leaderless BFT Multiple samples of small nodal neighborhood (ns ) Threshold > k ns when most of system has “heard” the rumor Greener than Bitcoin (Nakamoto) No PoW needed 8 K. Stinchcombe, “Ten years in, nobody has come up with a use for blockchain, Hackernoon, Dec 22, 2017 c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 50 / 81
  76. Distributed Application Scaling (22) Avalanche scalability 0 500 1000 1500

    2000 0 500 1000 1500 2000 Network nodes Throughput (TPS) α = 0.9 β = 6.5e−05 γ = 1648.27 Npeak = 39.15 Xpeak = 1821.21 Xroof = 1831.41 c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 51 / 81
  77. Distributed Application Scaling (22) Comparison with Paxos 0 500 1000

    1500 2000 0 500 1000 1500 2000 Network nodes Throughput (TPS) α = 0.9 β = 6.5e−05 γ = 1648.27 Npeak = 39.15 Xpeak = 1821.21 Xroof = 1831.41 0 500 1000 1500 Cluster size Writes per second 1 3 5 7 9 11 13 15 Sirius Sirius−NoBrain Sirius−NoDisk USL model α = 0.05 β = 0.165142 γ = 1024.98 Nmax = 2.4 Xmax = 1513.93 Xroof = 20499.54 Contention: α about 90% X(1)/α ⇒ immediate saturation at about 2000 TPS Coherency: β extremely small but not zero Much slower ‘linear’ degradation than Zookeeper/Paxos Saturation throughput maintained over 2000 nodes (100x Sirius) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 52 / 81
  78. Distributed Application Scaling (22) Hadoop Super Scaling NJG, P. J.

    Puglia, and K. Tomasette, “Hadoop Superlinear Scalability: The perpetual motion of parallel performance”, Comm. ACM, Vol. 58 No. 4, 46–55 (2015) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 53 / 81
  79. Distributed Application Scaling (22) Super-scalability is a thing Superlinear Linear

    Sublinear Processors Speedup More than 100% efficient! c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 54 / 81
  80. Distributed Application Scaling (22) Seeing is believing “Speedup in Parallel

    Contexts.” http://en.wikipedia.org/wiki/Speedup#Speedup_in_Parallel_Contexts “Where does super-linear speedup come from?” http://stackoverflow.com/questions/4332967/where-does-super-linear-speedup-come-from “Sun Fire X2270 M2 Super-Linear Scaling of Hadoop TeraSort and CloudBurst Benchmarks.” https://blogs.oracle.com/BestPerf/entry/20090920_x2270m2_hadoop Haas, R. “Scalability, in Graphical Form, Analyzed.” http://rhaas.blogspot.com/2011/09/scalability-in-graphical-form-analyzed.html Sutter, H. 2008. “Going Superlinear.” Dr. Dobb’s Journal 33(3), March. http://www.drdobbs.com/cpp/going-superlinear/206100542 Sutter, H. 2008. “Super Linearity and the Bigger Machine.” Dr. Dobb’s Journal 33(4), April. http://www.drdobbs.com/parallel/super-linearity-and-the-bigger-machine/206903306 “SDN analytics and control using sFlow standard — Superlinear.” http://blog.sflow.com/2010/09/superlinear.html Eijkhout, V. 2014. Introduction to High Performance Scientific Computing. Lulu.com c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 55 / 81
  81. Distributed Application Scaling (22) Empirical evidence 16-node Sun Fire X2270

    M2 cluster c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 56 / 81
  82. Distributed Application Scaling (22) Smells like perpetual motion c 2019

    Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 57 / 81
  83. Distributed Application Scaling (22) Terasort cluster emulations Want a controlled

    environment to study superlinearity Terasort workload sorts 1 TB data in parallel Terasort has benchmarked Hadoop MapReduce performance We used just 100 GB data input (weren’t benchmarking anything) Simulate in AWS cloud EC2 instances (more flexible and much cheaper) Enabled multiple runs in parallel (thank you Comcast for cycle $s) Table 1: Amazon EC2 Configurations Optimized Processor vCPU Memory Instance Network for Arch number (GiB) Storage (GB) Performance BigMem Memory 64-bit 4 34.2 1 x 850 Moderate BigDisk Compute 64-bit 8 7 4 x 420 High c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 58 / 81
  84. Distributed Application Scaling (22) USL analysis of BigMem N ≤

    50 nodes 0 50 100 150 0 50 100 150 USL Analysis of BigMem Hadoop Terrasort EC2 m2 nodes (N) Speedup S(N) α = −0.0288 β = 0.000447 Nmax = 47.96 Smax = 73.48 Sroof = N A Ncross = 64.5 c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 59 / 81
  85. Distributed Application Scaling (22) USL analysis of BigMem N ≤

    150 nodes 0 50 100 150 0 50 100 150 USL Model of BigMem Hadoop TS Data EC2 m2 nodes (N) Speedup S(N) α = −0.0089 β = 9e−05 Nmax = 105.72 Smax = 99.53 Sroof = N A Ncross = 99.14 c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 60 / 81
  86. Distributed Application Scaling (22) Speedup on N ≤ 10 BigMem

    Nodes (1 disk) 0 5 10 15 0 5 10 15 BigMem Hadoop Terasort Speedup Data EC2 m2 nodes (N) Speedup S(N) Superlinear region Sublinear region c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 61 / 81
  87. Distributed Application Scaling (22) Speedup on N ≤ 10 BigDisk

    Nodes (4 disks) 0 5 10 15 0 5 10 15 BigDisk Hadoop Terasort Speedup Data EC2 c1 nodes (N) Speedup S(N) Superlinear region Sublinear region c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 62 / 81
  88. Distributed Application Scaling (22) Superlinear payback trap Superlinearity is real

    (i.e., measurable) But due to hidden resource constraint (disk or memory subsystem sizing) Sets wrong normalization for speedup scale Shows up in USL as negative α value (cf. GCAP book) Theorem 3 (Gunther 2013) The apparent advantage of superlinear scaling (α < 0) is inevitably lost due to crossover into a region of severe performance degradation (β 0) that can be worse than if superlinearity had not been present in the first place. Proof was presented at Hotsos 2013. Superlinear Linear Sublinear Processors Speedup Superlinear Payback Processors Speedup c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 63 / 81
  89. Distributed in the Cloud (30) Outline 1 Introduction 2 Universal

    Computational Scalability (8) 3 Queueing Theory View of the USL (15) 4 Distributed Application Scaling (22) 5 Distributed in the Cloud (30) 6 Optimal Node Sizing (38) 7 Summary (45) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 64 / 81
  90. Distributed in the Cloud (30) AWS Cloud Application “Exposing the

    Cost of Performance Hidden in the Cloud” NJG and M. Chawla Presented at CMG cloudXchange 2018 c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 65 / 81
  91. Distributed in the Cloud (30) Production data Previously measured X

    and R directly on test rig Table 2: Converting data to performance metrics Data Meaning Metrics Meaning T Elapsed time X = C/T Throughput Tp Processing time R = (Tp /T)(T/C) Response time C Completed work N = X × R Concurrent threads Ucpu CPU utilization S = Ucpu /X Service time Example 4 (Coalesced metrics) Linux epoch Timestamp interval between rows is 300 seconds Timestamp, X, N, S, R, U_cpu 1486771200000, 502.171674, 170.266663, 0.000912, 0.336740, 0.458120 1486771500000, 494.403035, 175.375000, 0.001043, 0.355975, 0.515420 1486771800000, 509.541751, 188.866669, 0.000885, 0.360924, 0.450980 1486772100000, 507.089094, 188.437500, 0.000910, 0.367479, 0.461700 1486772400000, 532.803039, 191.466660, 0.000880, 0.362905, 0.468860 ... c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 66 / 81
  92. Distributed in the Cloud (30) Tomcat data from AWS 0

    100 200 300 400 500 0 200 400 600 800 Tomcat threads Throughput (RPS) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 67 / 81
  93. Distributed in the Cloud (30) USL nonlinear analysis 0 100

    200 300 400 500 0 200 400 600 800 Tomcat threads Throughput (RPS) α = 0 β = 3e−06 γ = 3 Nmax = 539.2 Xmax = 809.55 Nopt = 274.8 c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 68 / 81
  94. Distributed in the Cloud (30) Concurrency parameter 0 100 200

    300 400 500 0 200 400 600 800 Tomcat threads Throughput (RPS) α = 0 β = 3e−06 γ = 3 Nmax = 539.2 Xmax = 809.55 Nopt = 274.8 α = 0 β = 0 Processes Capacity 1 γ = 3.0 2 Smallest number of threads during 24 hr sample is N > 100 3 Nonetheless USL estimates throughput X(1) = 3 RPS c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 69 / 81
  95. Distributed in the Cloud (30) Contention parameter 0 100 200

    300 400 500 0 200 400 600 800 Tomcat threads Throughput (RPS) α = 0 β = 3e−06 γ = 3 Nmax = 539.2 Xmax = 809.55 Nopt = 274.8 1/α α >> 0 β = 0 Processes Capacity α = 0 No significant waiting or queueing Max possible throughput Xroof not defined c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 70 / 81
  96. Distributed in the Cloud (30) Coherency parameter 0 100 200

    300 400 500 0 200 400 600 800 Tomcat threads Throughput (RPS) α = 0 β = 3e−06 γ = 3 Nmax = 539.2 Xmax = 809.55 Nopt = 274.8 1/α 1/N α > 0 β > 0 Processes Capacity β = 3 × 10−6 implies very weak retrograde throughput Extremely little data exchange But entirely responsible for sublinearity And peak throughput Xmax = 809.55 RPS Peak occurs at Nmax = 539.2 threads c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 71 / 81
  97. Distributed in the Cloud (30) Revised USL analysis Parallel threads

    implies linear scaling Linear slope γ ∼ 3: γ = 2.65 Should be no contention, i.e., α = 0 Discontinuity at N ∼ 275 threads Throughput plateaus, i.e., β = 0 Saturation occurs at processor utilization UCPU ≥ 75% Linux OS can’t do that! Pseudo-saturation due to AWS Auto Scaling policy (hypervisor?) Many EC2 instances spun up and down during 24 hrs c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 72 / 81
  98. Distributed in the Cloud (30) Corrected USL linear model 0

    100 200 300 400 500 0 200 400 600 800 Tomcat threads Throughput (RPS) α = 0 β = 0 γ = 2.65 Nmax = NaN Xmax = 727.03 Nopt = 274.8 Parallel threads Pseudo−saturation c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 73 / 81
  99. Optimal Node Sizing (38) Outline 1 Introduction 2 Universal Computational

    Scalability (8) 3 Queueing Theory View of the USL (15) 4 Distributed Application Scaling (22) 5 Distributed in the Cloud (30) 6 Optimal Node Sizing (38) 7 Summary (45) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 74 / 81
  100. Optimal Node Sizing (38) 1. Canonical sublinear scalability 0 50

    100 150 0 100 200 300 400 500 Processes (N) Thruput X(N) α = 0.05 β = 6e−05 γ = 20.8563 Nmax = 125.5 Xmax = 320.47 Xroof = 417.13 Amhdahl curve (dotted) Linear scaling bound (LHS red) Saturation asymptote (top red) Nopt = 1/α = 20 at intersection USL curve (solid blue) is retrograde Nopt shifts right to Nmax = 125 Throughput is lower than Xroof c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 75 / 81
  101. Optimal Node Sizing (38) 2. Memcache scalability 0 5 10

    15 20 25 0 500 1000 1500 2000 Worker threads (N) Throughput X(N) α = 0.0468 β = 0.021016 γ = 84.89 Nmax = 6.73 Xmax = 274.87 Nopt = 21.38 0 10 20 30 40 50 0 100 200 300 400 500 Worker processes (N) Throughput X(N) α = 0 β = 0.00052 γ = 17.76 Npeak = 43.87 Xpeak = 394.06 Nopt = Inf Initial scalability (green curve) Nmax occurs before Nopt Nmax = 6.73 Nopt = 1/0.0468 = 21.36752 Sun SPARC patch (blue curve) Nmax = 43.87 Now exceeds previous Nopt And throughput increased by approx 50% c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 76 / 81
  102. Optimal Node Sizing (38) 3. Cloud scalability 0 100 200

    300 400 500 0 200 400 600 800 Tomcat threads Throughput (RPS) α = 0 β = 0 γ = 2.65 Nmax = NaN Xmax = 727.03 Nopt = 274.8 Parallel threads Pseudo−saturation 0 100 200 300 400 500 0.0 0.2 0.4 0.6 0.8 1.0 Tomcat processes (N) Response time R(N) α = 0 β = 0 γ = 2.65 Rmin = 0.38 Sbnk = 0.001365 Nopt = 274.8 Throughput profile X(N) data tightly follow bounds Nopt = 274.8 due to autoscaling policy Significant dispersion about mean X Response time profile Autoscaling throttles throughput Causes R(N) to increase linearly Will users tolerate R(N) > Nopt ? c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 77 / 81
  103. Optimal Node Sizing (38) 4. DLT scalability 0 20000 40000

    60000 80000 Cluster size Reqs per second 1 3 5 7 9 11 13 15 Mean CI_lo CI_hi USL model α = 0.05 β = 0.157701 γ = 41536.96 Nmax = 2.45 Xmax = 62328.47 Xroof = 830739.1 0 500 1000 1500 2000 0 500 1000 1500 2000 Network nodes Throughput (TPS) α = 0.9 β = 6.5e−05 γ = 1648.27 Npeak = 39.15 Xpeak = 1821.21 Xroof = 1831.41 Throughput profile Optimum is the smallest configuration Likely bigger in real applications Coordination kills Scaling degrades (that’s how it works!) No good for DLT Throughput profile There is no optimum Throughput similar to Zookeeper peak Pick a number N Throughput is independent of scale But N scale is 100x Zookeeper Generally needed for DLT c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 78 / 81
  104. Summary (45) Outline 1 Introduction 2 Universal Computational Scalability (8)

    3 Queueing Theory View of the USL (15) 4 Distributed Application Scaling (22) 5 Distributed in the Cloud (30) 6 Optimal Node Sizing (38) 7 Summary (45) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 79 / 81
  105. Summary (45) Summary All performance is nonlinear Need both data

    and models (e.g., USL) USL is a nonlinear (rational) function USL is architecture agnostic All scalability information contained in parameters: 1 α: contention, queueing 2 β: coherency, consistency (pairwise data exchange) 3 γ: concurrency, X(1) value Apply USL via nonlinear statistical regression CAP-like consistency associated with magnitude of quadratic β term Can apply USL to both controlled (testing) and production platforms USL analysis often different from paper designs and algorithms c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 80 / 81
  106. Summary (45) Thank you for the (virtual) invitation! www.perfdynamics.com Castro

    Valley, California Twitter twitter.com/DrQz Facebook facebook.com/PerformanceDynamics Blog perfdynamics.blogspot.com Training classes perfdynamics.com/Classes [email protected] +1-510-537-5758 c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 81 / 81