Slide 1

Slide 1 text

Applying The Universal Scalability Law to Distributed Systems Dr. Neil J. Gunther Performance Dynamics Distributed Systems Conference Pune, INDIA 16 February 2019 SM c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 1 / 81

Slide 2

Slide 2 text

Introduction Outline 1 Introduction 2 Universal Computational Scalability (8) 3 Queueing Theory View of the USL (15) 4 Distributed Application Scaling (22) 5 Distributed in the Cloud (30) 6 Optimal Node Sizing (38) 7 Summary (45) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 2 / 81

Slide 3

Slide 3 text

Introduction It’s all about NONLINEARITIES “Using a term like nonlinear science is like referring to the bulk of zoology as the study of non-elephant animals.” –Stan Ulam Models are a must All performance analysis (including scalability analysis) is about nonlinearity. But nonlinear is a class name, like cancer: there are at least as many forms of cancer as there are cell types. You need to correctly characterize the specific cell type to determine the best treatment. Similarly, you need to correctly characterize the performance nonlinearity in order to improve it. The only sane way to do that is with the appropriate performance model. Data alone is not sufficient. 1 Algorithms are only half the scalability story 2 Measurements are the other half 3 And models are the other other half c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 3 / 81

Slide 4

Slide 4 text

Introduction What is distributed computing? “A collection of independent computers that appears to its users as a single coherent system,” –Andy Tanenbaum (2007) Anything outside a single execution unit (von Neumann bottleneck) 1 PROBLEM: How to connect these? RAM Cache ? Disk Processor Tightly coupled vs. loosely coupled Many issues are not really new; a question of scale Confusing overloaded terminology: function, consistency, linearity, monotonicity, concurrency, .... c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 4 / 81

Slide 5

Slide 5 text

Introduction Easy to be naive about scalability COST (Configuration that Outperforms a Single Thread), 20151: Single-threaded scaling can be better than multi-threaded scaling. (And your point is? —njg) Scalability! But at what COST? Frank McSherry Michael Isard Derek G. Murray Unaffiliated Unaffiliated⇤ Unaffiliated† Abstract We offer a new metric for big data platforms, COST, or the Configuration that Outperforms a Single Thread. The COST of a given platform for a given problem is the hardware configuration required before the platform out- performs a competent single-threaded implementation. COST weighs a system’s scalability against the over- heads introduced by the system, and indicates the actual performance gains of the system, without rewarding sys- tems that bring substantial but parallelizable overheads. We survey measurements of data-parallel systems re- cently reported in SOSP and OSDI, and find that many systems have either a surprisingly large COST, often hundreds of cores, or simply underperform one thread for all of their reported configurations. 1 Introduction “You can have a second computer once you’ve shown you know how to use the first one.” –Paul Barham The published work on big data systems has fetishized 300 1 10 100 50 1 10 cores speed-up system A system B 300 1 10 100 1000 8 100 cores seconds system A system B Figure 1: Scaling and performance measurements for a data-parallel algorithm, before (system A) and after (system B) a simple performance optimization. The unoptimized implementation “scales” far better, despite (or rather, because of) its poor performance. While this may appear to be a contrived example, we will argue that many published big data systems more closely resemble system A than they resemble system B. 1.1 Methodology In this paper we take several recent graph processing pa- pers from the systems literature and compare their re- ported performance against simple, single-threaded im- plementations on the same datasets using a high-end Gene Amdahl, 19672: Demonstration is made of the continued validity of the single processor approach and of the weaknesses of the multiple processor approach. (Wrong then but REALLY wrong now! —njg) All single CPUs/cores now tapped out ≤ 5 GHz since 2005 SPEC.org benchmarks made the same mistake 30 yrs ago SPECrate: multiple single-threaded instances (dumb fix) SPEC SDM: Unix multi-user processes (very revealing) 1 McSherry et al., “Scalability! But at what COST?”, Usenix conference, 2015 2 “Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities,” AFIPS conference, 1967 c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 5 / 81

Slide 6

Slide 6 text

Introduction Can haz linear scalability Shared nothing architecture 3 — not easy in general 4 Tandem Himalaya with NonStop SQL Local processing and memory + no global lock Linear OLTP scaling on TPC-C up to N = 128 nodes c.1996 L Interconnect network P C M P C M Processors P C M o o g 3 D. J. DeWitt and J. Gray, “Parallel Database Systems: The Future of High Performance Database Processing,” Comm. ACM, Vol. 36, No. 6, 85–98 (1992) 4 NJG, “Scaling and Shared Nothingness,” 4th HPTS Workshop, Asilomar, California (1993) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 6 / 81

Slide 7

Slide 7 text

Universal Computational Scalability (8) Outline 1 Introduction 2 Universal Computational Scalability (8) 3 Queueing Theory View of the USL (15) 4 Distributed Application Scaling (22) 5 Distributed in the Cloud (30) 6 Optimal Node Sizing (38) 7 Summary (45) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 7 / 81

Slide 8

Slide 8 text

Universal Computational Scalability (8) Motivation for USL Pyramid Technology, c.1992 Vendor Model Unix DBMS TPS-B nCUBE nCUBE2 × OPS 1017.0 Pyramid MISserver UNIFY 468.5 DEC VaxCluster4 × OPS 425.7 DEC VaxCluster3 × OPS 329.8 Sequent Symmetry 2000 ORA 319.6 c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 8 / 81

Slide 9

Slide 9 text

Universal Computational Scalability (8) Amdahl’s law How not to write a formula (Hennessy & Patterson, p.30) How not to plot performance data (Wikipedia) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 9 / 81

Slide 10

Slide 10 text

Universal Computational Scalability (8) Universal scaling properties 1. Equal bang for the buck α = 0 β = 0 Processes Capacity Everybody wants this c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 10 / 81

Slide 11

Slide 11 text

Universal Computational Scalability (8) Universal scaling properties 1. Equal bang for the buck α = 0 β = 0 Processes Capacity Everybody wants this 2. Diminishing returns α > 0 β = 0 Processes Capacity Everybody usually gets this c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 10 / 81

Slide 12

Slide 12 text

Universal Computational Scalability (8) Universal scaling properties 1. Equal bang for the buck α = 0 β = 0 Processes Capacity Everybody wants this 3. Bottleneck limit 1/α α >> 0 β = 0 Processes Capacity Everybody hates this 2. Diminishing returns α > 0 β = 0 Processes Capacity Everybody usually gets this c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 10 / 81

Slide 13

Slide 13 text

Universal Computational Scalability (8) Universal scaling properties 1. Equal bang for the buck α = 0 β = 0 Processes Capacity Everybody wants this 3. Bottleneck limit 1/α α >> 0 β = 0 Processes Capacity Everybody hates this 2. Diminishing returns α > 0 β = 0 Processes Capacity Everybody usually gets this 4. Retrograde throughput 1/α 1/N α > 0 β > 0 Processes Capacity Everybody thinks this never happens c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 10 / 81

Slide 14

Slide 14 text

Universal Computational Scalability (8) The universal scalability law (USL) N: processes provide system stimulus or load 5 NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conference (1993) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 11 / 81

Slide 15

Slide 15 text

Universal Computational Scalability (8) The universal scalability law (USL) N: processes provide system stimulus or load XN : response function or relative capacity 5 NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conference (1993) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 11 / 81

Slide 16

Slide 16 text

Universal Computational Scalability (8) The universal scalability law (USL) N: processes provide system stimulus or load XN : response function or relative capacity Question: What kind of function? 5 NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conference (1993) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 11 / 81

Slide 17

Slide 17 text

Universal Computational Scalability (8) The universal scalability law (USL) N: processes provide system stimulus or load XN : response function or relative capacity Question: What kind of function? Answer: A rational function5 (nonlinear) 5 NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conference (1993) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 11 / 81

Slide 18

Slide 18 text

Universal Computational Scalability (8) The universal scalability law (USL) N: processes provide system stimulus or load XN : response function or relative capacity Question: What kind of function? Answer: A rational function5 (nonlinear) XN (α, β, γ) = γ N 1 + α (N − 1) + β N(N − 1) 5 NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conference (1993) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 11 / 81

Slide 19

Slide 19 text

Universal Computational Scalability (8) The universal scalability law (USL) N: processes provide system stimulus or load XN : response function or relative capacity Question: What kind of function? Answer: A rational function5 (nonlinear) XN (α, β, γ) = γ N 1 + α (N − 1) + β N(N − 1) Three Cs: 5 NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conference (1993) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 11 / 81

Slide 20

Slide 20 text

Universal Computational Scalability (8) The universal scalability law (USL) N: processes provide system stimulus or load XN : response function or relative capacity Question: What kind of function? Answer: A rational function5 (nonlinear) XN (α, β, γ) = γ N 1 + α (N − 1) + β N(N − 1) Three Cs: 1 Concurrency (0 < γ < ∞) 5 NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conference (1993) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 11 / 81

Slide 21

Slide 21 text

Universal Computational Scalability (8) The universal scalability law (USL) N: processes provide system stimulus or load XN : response function or relative capacity Question: What kind of function? Answer: A rational function5 (nonlinear) XN (α, β, γ) = γ N 1 + α (N − 1) + β N(N − 1) Three Cs: 1 Concurrency (0 < γ < ∞) 2 Contention (0 < α < 1) 5 NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conference (1993) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 11 / 81

Slide 22

Slide 22 text

Universal Computational Scalability (8) The universal scalability law (USL) N: processes provide system stimulus or load XN : response function or relative capacity Question: What kind of function? Answer: A rational function5 (nonlinear) XN (α, β, γ) = γ N 1 + α (N − 1) + β N(N − 1) Three Cs: 1 Concurrency (0 < γ < ∞) 2 Contention (0 < α < 1) 3 Coherency (0 < β < 1) 5 NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conference (1993) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 11 / 81

Slide 23

Slide 23 text

Universal Computational Scalability (8) Measurement meets model X(N) X(1) Thruput data c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 12 / 81

Slide 24

Slide 24 text

Universal Computational Scalability (8) Measurement meets model X(N) X(1) Thruput data −→ CN Capacity metric c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 12 / 81

Slide 25

Slide 25 text

Universal Computational Scalability (8) Measurement meets model X(N) X(1) Thruput data −→ CN Capacity metric ←− N 1 + α (N − 1) + β N(N − 1) USL model c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 12 / 81

Slide 26

Slide 26 text

Universal Computational Scalability (8) Measurement meets model X(N) X(1) Thruput data −→ CN Capacity metric ←− N 1 + α (N − 1) + β N(N − 1) USL model 0 20 40 60 80 100 5 10 15 20 Processes (N) Relative capacity, C(N) Linear scaling Amdahl−like scaling Retrograde scaling c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 12 / 81

Slide 27

Slide 27 text

Universal Computational Scalability (8) Why is coherency quadratic? USL coherency term β N (N − 1) is O(N2) Fully connected graph KN with N nodes K3 K6 K8 K16 has N 2 ≡ N! 2! (N − 2)! = 1 2 N (N − 1) edges (quadratic in N) that represent: communication between each pair of nodes 6 exchange of data or objects between processing nodes actual performance impact captured by magnitude of β parameter 6 This term can also be related to Brooks’ law (too many cooks increase delivery time) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 13 / 81

Slide 28

Slide 28 text

Universal Computational Scalability (8) Coherency or consistency? c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 14 / 81

Slide 29

Slide 29 text

Universal Computational Scalability (8) Coherency or consistency? c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 14 / 81

Slide 30

Slide 30 text

Universal Computational Scalability (8) Coherency or consistency? July 19, 2000 July 19, 2000 ers OK ers OK stic) stic) PODC Keynote, July 19, 2000 PODC Keynote, July 19, 2000 The CAP Theorem The CAP Theorem Consistency Availability Tolerance to network Partitions Theorem: You can have at most two of these properties for any shared-data system What the man said c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 14 / 81

Slide 31

Slide 31 text

Universal Computational Scalability (8) Coherency or consistency? July 19, 2000 July 19, 2000 ers OK ers OK stic) stic) PODC Keynote, July 19, 2000 PODC Keynote, July 19, 2000 The CAP Theorem The CAP Theorem Consistency Availability Tolerance to network Partitions Theorem: You can have at most two of these properties for any shared-data system What the man said C A P CA CP AP Correct Venn diagram c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 14 / 81

Slide 32

Slide 32 text

Universal Computational Scalability (8) Eventual consistency and monotonicity Consistency: partial order of replicates via a binary relation (posets) Hasse diagrams: replica events converge (like time) to largest (cumulative) value Monotonicity: any upward Hasse path terminates with the same result Developer view is not a performance guarantee Lattice of integers ordered by binary max relation Time 9 7 9 4 7 9 OO 2 4 7 9 Lattice of sets ordered by binary join relation Time {a, b, c, d} {a, b, c} {b, c, d} {a, b, d} {b, c} {a, b) {c, d} {a, d} OO {b} {c} {a} {d} Logical monotonicity = monotonically increasing function A monotonic application can exhibit monotonically decreasing scalability c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 15 / 81

Slide 33

Slide 33 text

Universal Computational Scalability (8) Determining the USL parameters Throughput measurements XN at various process loads N sourced from: 1 Load test harness, e.g., LoadRunner, Jmeter 2 Production monitoring, e.g., Linux \proc, JMX interface Want to determine the α, β, γ that best models XN data XN (α, β, γ) = γ N 1 + α (N − 1) + β N(N − 1) XN is a rational function (tricky) Brute force: Gene Amdahl 1967 (only α) Clever ways: 1 Optimization Solver in Excel 2 Nonlinear statistical regression in R c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 16 / 81

Slide 34

Slide 34 text

Queueing Theory View of the USL (15) Outline 1 Introduction 2 Universal Computational Scalability (8) 3 Queueing Theory View of the USL (15) 4 Distributed Application Scaling (22) 5 Distributed in the Cloud (30) 6 Optimal Node Sizing (38) 7 Summary (45) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 17 / 81

Slide 35

Slide 35 text

Queueing Theory View of the USL (15) Queueing theory for those who can’t wait Amdahl’s law for parallel speedup SA T1 /TN with N processors 7 SA (N) = N 1 + α (N − 1) (1) 1 Parameter 0 ≤ α ≤ 1 is serial fraction of time spent running single threaded 2 Subsumed by USL model contention term (with γ = 1 and β = 0) 3 Msg: Forget linearity, you’re gonna hit a ceiling at lim N→∞ SA ∼ 1/α Some people (all named Frank?) consider Amdahl’s law to be ad hoc: Franco P. Preparata, “Should Amdahl’s law be repealed?”, Proc. 6th International Symposium on Algorithms and Computation (ISAAC’95), Cairns, Australia, December 4–6 1995. https: //www.researchgate.net/publication/221543174_Should_Amdahl%27s_Law_Be_Repealed_Abstract Frank L. Devai, “The Refutation of Amdahl?s Law and Its Variants,” 17th International Conference on Computational Science and Applications ICCSA, Trieste, Italy, July 2017 https://www.researchgate.net/publication/ 318648496_The_Refutation_of_Amdahl%27s_Law_and_Its_Variants 7 Gene Amdahl (1967) never wrote down this equation and he measured the mainframe workloads by brute force c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 18 / 81

Slide 36

Slide 36 text

Queueing Theory View of the USL (15) 0 20 40 60 80 100 5 10 15 20 Processes (N) Relative capacity, C(N) Linear scaling Amdahl−like scaling Retrograde scaling Top blue curve is Amdahl approaching a ceiling speedup C(N) ∼ 20 Which also means α ∼ 0.05 or app serialized 5% of the runtime c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 19 / 81

Slide 37

Slide 37 text

Queueing Theory View of the USL (15) Queue characterization Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com W S L m ρ Time: R Space: Q + + = = Arriving Wai3ng Servicing Depar3ng A familiar queue

Slide 38

Slide 38 text

Queueing Theory View of the USL (15) Queue characterization Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com W S L m ρ Time: R Space: Q + + = = Arriving Wai3ng Servicing Depar3ng A familiar queue An abstract queue

Slide 39

Slide 39 text

Queueing Theory View of the USL (15) Queue characterization Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com W S L m ρ Time: R Space: Q + + = = Arriving Wai3ng Servicing Depar3ng A familiar queue An abstract queue 1. Where is the service facility?

Slide 40

Slide 40 text

Queueing Theory View of the USL (15) Queue characterization Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com W S L m ρ Time: R Space: Q + + = = Arriving Wai3ng Servicing Depar3ng A familiar queue An abstract queue 1. Where is the service facility? 2. Did anything finish yet?

Slide 41

Slide 41 text

Queueing Theory View of the USL (15) Queue characterization Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com W S L m ρ Time: R Space: Q + + = = Arriving Wai3ng Servicing Depar3ng A familiar queue An abstract queue 1. Where is the service facility? 2. Did anything finish yet? 3. Is anything waiting?

Slide 42

Slide 42 text

Queueing Theory View of the USL (15) Queue characterization Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com W S L m ρ Time: R Space: Q + + = = Arriving Wai3ng Servicing Depar3ng A familiar queue An abstract queue 1. Where is the service facility? 2. Did anything finish yet? 3. Is anything waiting? 4. Is anything arriving? c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 20 / 81

Slide 43

Slide 43 text

Queueing Theory View of the USL (15) Amdahl queueing model ... ... Interconnect network Processing nodes Figure 1: Amdahl queueing model Classical: Queueing model of assembly-line robots (LHS bubbles) Classical: Robots that fail go to service stations (RHS) for repairs Classical: Repair queues impact manufacturing “cycle time” Amdahl: Robots become finite number of (distributed) processing nodes Amdahl: Repair stations becomes interconnect network Amdahl: Network latency impacts application scalability c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 21 / 81

Slide 44

Slide 44 text

Queueing Theory View of the USL (15) Amdahl queueing theorem Theorem 1 (Gunther 2002) Amdahl’s law for parallel speedup is equivalent to the synchronous queueing bound on throughput in the machine repairman model of a multi-node computer system. Proof. 1 ”A New Interpretation of Amdahl’s Law and Geometric Scalability”, arXiv, Submitted 17 Oct (2002) 2 “Celebrity Boxing (and sizing): Alan Greenspan vs. Gene Amdahl,” 28th International Computer Measurement Group Conference, December 8–13, Reno, NV (2002) 3 “Unification of Amdahl’s Law, LogP and Other Performance Models for Message-Passing Architectures”, Proc. Parallel and Distributed Computer Systems (PDCS’05) November 14–16, 569–576, Phoenix, AZ (2005) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 22 / 81

Slide 45

Slide 45 text

Queueing Theory View of the USL (15) Queueing bounds on throughput 0 20 40 60 80 100 120 Processing nodes (N) Throughput X(N) Xlin (N) = 1 (R1 + Z) Xmax = 1 Smax Xmean (N) = N (RN + Z) Xsync (N) = N (N R1 + Z) Xsax = 1 R1 Figure 2: Average (blue) and synchronous (red) throughput X(N) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 23 / 81

Slide 46

Slide 46 text

Queueing Theory View of the USL (15) The speedup mapping 1 Z: is the mean processor execution time 2 Sk : service time at kth (network) queue 3 R1 : minimum round-trip time, i.e., k Sk + Z for N = 1 Amdahl speedup SA (N) is given by SA (N) = N · R1 N · k Sk + Z — queueing form (2) SA (N) = N 1 + α(N − 1) — usual “ad hoc” law (3) Connecting (2) and (3) requires identity k Sk R1 ≡ α →    0 as Sk → 0 (processor only) 1 as Z → 0 (network only) (4) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 24 / 81

Slide 47

Slide 47 text

Queueing Theory View of the USL (15) USL queueing model ... Load-dependent Interconnect Processing nodes Figure 3: USL load-dependent queueing model Amdahl: Robots become finite number of (distributed) processing nodes Amdahl: Repair stations becomes interconnect network Amdahl: Network latency impacts application scalability USL: Interconnect service time (latency) depends on queue length USL: Requests are first exchange sorted O(N2) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 25 / 81

Slide 48

Slide 48 text

Queueing Theory View of the USL (15) USL load-dependent queueing Theorem 2 (Gunther 2008) The USL relative capacity is equivalent to the synchronous queueing bound on throughput for a linear load-dependent machine repairman model of a multi-node computer system. Proof. 1 Guerrilla Capacity Planning: A Tactical Approach to Planning for Highly Scalable Applications and Services (Springer 2007) 2 A General Theory of Computational Scalability Based on Rational Functions, arXiv, Submitted on 11 Aug (2008) 3 Discrete-event simulations presented in “Getting in the Zone for Successful Scalability” Proc. Computer Measurement Group International Conference, December 7th–12, Las Vegas, NV (2008) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 26 / 81

Slide 49

Slide 49 text

Queueing Theory View of the USL (15) Experimental results 0 20 40 60 80 100 0 2 4 6 8 10 Processing nodes (N) Relative capacity C(N) Synchronized repairman simulation USL model: α = 0.1, β = 0 Load−dependent repairman simulation USL model: α = 0.1, β = 0.001 Discrete-event simulations (Jim Holtman 2008, 2018) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 27 / 81

Slide 50

Slide 50 text

Queueing Theory View of the USL (15) Key points for distributed USL Each USL denominator term has a physical meaning (not purely stats model): 1 Constant O(1) ⇒ maximum nodal concurrency (no queueing) 2 Linear O(N) ⇒ intra-nodal contention (synchronous queueing) 3 Quadratic O(N2) ⇒ inter-nodal exchange (load-dependent sync queueing) Useful for interpreting USL analysis to developers, engineers, architects, etc. Example 1 If β > α then look for inter-nodal messaging activity rather than the size of message-queues in any particular node Worst-case queueing corresponds to an analytic equation (rational function) USL is agnostic about the application architecture and interconnect topology So where is all that information contained? (exercise for the reader) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 28 / 81

Slide 51

Slide 51 text

Distributed Application Scaling (22) Outline 1 Introduction 2 Universal Computational Scalability (8) 3 Queueing Theory View of the USL (15) 4 Distributed Application Scaling (22) 5 Distributed in the Cloud (30) 6 Optimal Node Sizing (38) 7 Summary (45) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 29 / 81

Slide 52

Slide 52 text

Distributed Application Scaling (22) Memcached “Hidden Scalability Gotchas in Memcached and Friends” NJG, S. Subramanyam, and S. Parvu Sun Microsystems Presented at Velocity 2010 c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 30 / 81

Slide 53

Slide 53 text

Distributed Application Scaling (22) Distributed scaling strategies Scaleup Scaleout c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 31 / 81

Slide 54

Slide 54 text

Distributed Application Scaling (22) Scaleout generations Distributed cache of key-value pairs Data pre-loaded from RDBMS backend Deploy memcache on previous generation CPUs (not always multicore) Single worker thread ok — until next hardware roll (multicore) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 32 / 81

Slide 55

Slide 55 text

Distributed Application Scaling (22) Controlled load-tests 0 2 4 6 8 10 12 0 50 100 150 200 250 300 Worker threads (N) Throughput KOPS X(N) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 33 / 81

Slide 56

Slide 56 text

Distributed Application Scaling (22) Explains these memcached warnings Configuring the memcached server Threading is used to scale memcached across CPU’s. The model is by ‘‘worker threads’’, meaning that each thread handles concurrent connections. ... By default 4 threads are allocated. Setting it to very large values (80+) will make it run considerably slower. Linux man pages - memcached (1) -t Number of threads to use to process incoming requests. ... It is typically not useful to set this higher than the number of CPU cores on the memcached server. Setting a high number (64 or more) of worker threads is not recommended. The default is 4. c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 34 / 81

Slide 57

Slide 57 text

Distributed Application Scaling (22) Load test data in R c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 35 / 81

Slide 58

Slide 58 text

Distributed Application Scaling (22) R regression analysis c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 36 / 81

Slide 59

Slide 59 text

Distributed Application Scaling (22) USL scalability model 0 2 4 6 8 10 12 0 50 100 150 200 250 300 Worker threads (N) Throughput KOPS X(N) α = 0.0468 β = 0.021016 γ = 84.89 Nmax = 6.73 Xmax = 274.87 Xroof = 1814.82 c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 37 / 81

Slide 60

Slide 60 text

Distributed Application Scaling (22) Concurrency parameter 0 2 4 6 8 10 12 0 50 100 150 200 250 300 Worker threads (N) Throughput KOPS X(N) α = 0.0468 β = 0.021016 γ = 84.89 Nmax = 6.73 Xmax = 274.87 Xroof = 1814.82 α = 0 β = 0 Processes Capacity 1 γ = 84.89 2 Slope of linear bound as Kops/thread 3 Estimate of throughput X(1) = 84.89 Kops at N = 1 thread c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 38 / 81

Slide 61

Slide 61 text

Distributed Application Scaling (22) Contention parameter 0 2 4 6 8 10 12 0 50 100 150 200 250 300 Worker threads (N) Throughput KOPS X(N) α = 0.0468 β = 0.021016 γ = 84.89 Nmax = 6.73 Xmax = 274.87 Xroof = 1814.82 1/α α >> 0 β = 0 Processes Capacity α = 0.0468 Waiting or queueing for resources about 4.6% of the time Max possible throughput is X(1)/α = 1814.78 Kops (Xroof ) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 39 / 81

Slide 62

Slide 62 text

Distributed Application Scaling (22) Coherency parameter 0 2 4 6 8 10 12 0 50 100 150 200 250 300 Worker threads (N) Throughput KOPS X(N) α = 0.0468 β = 0.021016 γ = 84.89 Nmax = 6.73 Xmax = 274.87 Xroof = 1814.82 1/α 1/N α > 0 β > 0 Processes Capacity β = 0.0210 corresponds to retrograde throughput Distributed copies of data (e.g., caches) have to be exchanged/updated about 2.1% of the time to be consistent Peak occurs at Nmax = (1 − α)/β = 6.73 threads c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 40 / 81

Slide 63

Slide 63 text

Distributed Application Scaling (22) Improving scalability performance 0 10 20 30 40 50 0 5 10 15 20 25 Threads (N) Speedup S(N) mcd 1.2.8 mcd 1.3.2 mcd 1.3.2 + patch c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 41 / 81

Slide 64

Slide 64 text

Distributed Application Scaling (22) Sirius and Zookeeper M. Bevilacqua-Linn, M. Byron, P. Cline, J. Moore and S. Muir (Comcast), “Sirius: Distributing and Coordinating Application”, Usenix ATC 2014 P. Hunt, M. Konar, F. P. Junqueira and B. Reed (Yahoo), “ZooKeeper: Wait-free coordination for Internet-scale systems”, Usenix ATC 2010 c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 42 / 81

Slide 65

Slide 65 text

Distributed Application Scaling (22) Coordination throughput data 0 500 1000 1500 Writes per second 1 3 5 7 9 11 13 15 Cluster size Sirius Sirius−NoBrain Sirius−NoDisk USL model c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 43 / 81

Slide 66

Slide 66 text

Distributed Application Scaling (22) Coordination throughput data 0 500 1000 1500 Writes per second 1 3 5 7 9 11 13 15 Cluster size Sirius Sirius−NoBrain Sirius−NoDisk USL model It’s all downhill on the coordination ski slopes c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 43 / 81

Slide 67

Slide 67 text

Distributed Application Scaling (22) Coordination throughput data 0 500 1000 1500 Writes per second 1 3 5 7 9 11 13 15 Cluster size Sirius Sirius−NoBrain Sirius−NoDisk USL model It’s all downhill on the coordination ski slopes But that’s very bad ... isn’t it? c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 43 / 81

Slide 68

Slide 68 text

Distributed Application Scaling (22) Coordination throughput data 0 500 1000 1500 Writes per second 1 3 5 7 9 11 13 15 Cluster size Sirius Sirius−NoBrain Sirius−NoDisk USL model It’s all downhill on the coordination ski slopes But that’s very bad ... isn’t it? What does the USL say? c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 43 / 81

Slide 69

Slide 69 text

Distributed Application Scaling (22) USL scalability model 0 500 1000 1500 Cluster size Writes per second 1 3 5 7 9 11 13 15 Sirius Sirius−NoBrain Sirius−NoDisk USL model α = 0.05 β = 0.165142 γ = 1024.98 Nmax = 2.4 Xmax = 1513.93 Xroof = 20499.54 c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 44 / 81

Slide 70

Slide 70 text

Distributed Application Scaling (22) Concurrency parameter 0 500 1000 1500 Cluster size Writes per second 1 3 5 7 9 11 13 15 Sirius Sirius−NoBrain Sirius−NoDisk USL model α = 0.05 β = 0.165142 γ = 1024.98 Nmax = 2.4 Xmax = 1513.93 Xroof = 20499.54 α = 0 β = 0 Processes Capacity 1 γ = 1024.98 2 Single node is meaningless (need N ≥ 3 for majority) 3 Interpret γ as N = 1 virtual throughput 4 USL estimates X(1) = 1, 024.98 WPS (black square) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 45 / 81

Slide 71

Slide 71 text

Distributed Application Scaling (22) Contention parameter 0 500 1000 1500 Cluster size Writes per second 1 3 5 7 9 11 13 15 Sirius Sirius−NoBrain Sirius−NoDisk USL model α = 0.05 β = 0.165142 γ = 1024.98 Nmax = 2.4 Xmax = 1513.93 Xroof = 20499.54 1/α α >> 0 β = 0 Processes Capacity α = 0.05 Queueing for resources about 5% of the time Max possible throughput is X(1)/α = 20, 499.54 WPS (Xroof ) But Xroof not feasible in these systems c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 46 / 81

Slide 72

Slide 72 text

Distributed Application Scaling (22) Coherency parameter 0 500 1000 1500 Cluster size Writes per second 1 3 5 7 9 11 13 15 Sirius Sirius−NoBrain Sirius−NoDisk USL model α = 0.05 β = 0.165142 γ = 1024.98 Nmax = 2.4 Xmax = 1513.93 Xroof = 20499.54 1/α 1/N α > 0 β > 0 Processes Capacity β = 0.1651 says retrograde throughput dominates! Distributed data being exchanged (compared?) about 16.5% of the time (virtual) Peak at Nmax = (1 − α)/β = 2.4 cluster nodes Shocking but that’s what Paxos-type scaling looks like c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 47 / 81

Slide 73

Slide 73 text

Distributed Application Scaling (22) Comparison of Sirius and Zookeeper 0 500 1000 1500 Cluster size Writes per second 1 3 5 7 9 11 13 15 Sirius Sirius−NoBrain Sirius−NoDisk USL model α = 0.05 β = 0.165142 γ = 1024.98 Nmax = 2.4 Xmax = 1513.93 Xroof = 20499.54 0 20000 40000 60000 80000 Cluster size Reqs per second 1 3 5 7 9 11 13 15 Mean CI_lo CI_hi USL model α = 0.05 β = 0.157701 γ = 41536.96 Nmax = 2.45 Xmax = 62328.47 Xroof = 830739.1 Both coordination services exhibit equivalent USL scaling parameters: 5% contention delay — not “wait-free” (per Yahoo paper title) 16% coherency delay despite Sirius being write-intensive and Zookeeper being read-intensive (i.e., 100x throughput) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 48 / 81

Slide 74

Slide 74 text

Distributed Application Scaling (22) Distributed Ledger Scalability A. Charapko, A. Ailijiang and M. Demirbas “Bridging Paxos and Blockchain Consensus”, 2018 Team Rocket (Cornell), “Snowflake to Avalanche: A Novel Metastable Consensus Protocol Family for Cryptocurrencies”, 2018 c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 49 / 81

Slide 75

Slide 75 text

Distributed Application Scaling (22) Avalanche protocol VISA can handle 60,000 TPS Bitcoin bottlenecks at 7 TPS 8 That’s FOUR decades performance inferiority! Avalanche rumor-mongering coordination: Leaderless BFT Multiple samples of small nodal neighborhood (ns ) Threshold > k ns when most of system has “heard” the rumor Greener than Bitcoin (Nakamoto) No PoW needed 8 K. Stinchcombe, “Ten years in, nobody has come up with a use for blockchain, Hackernoon, Dec 22, 2017 c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 50 / 81

Slide 76

Slide 76 text

Distributed Application Scaling (22) Avalanche scalability 0 500 1000 1500 2000 0 500 1000 1500 2000 Network nodes Throughput (TPS) α = 0.9 β = 6.5e−05 γ = 1648.27 Npeak = 39.15 Xpeak = 1821.21 Xroof = 1831.41 c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 51 / 81

Slide 77

Slide 77 text

Distributed Application Scaling (22) Comparison with Paxos 0 500 1000 1500 2000 0 500 1000 1500 2000 Network nodes Throughput (TPS) α = 0.9 β = 6.5e−05 γ = 1648.27 Npeak = 39.15 Xpeak = 1821.21 Xroof = 1831.41 0 500 1000 1500 Cluster size Writes per second 1 3 5 7 9 11 13 15 Sirius Sirius−NoBrain Sirius−NoDisk USL model α = 0.05 β = 0.165142 γ = 1024.98 Nmax = 2.4 Xmax = 1513.93 Xroof = 20499.54 Contention: α about 90% X(1)/α ⇒ immediate saturation at about 2000 TPS Coherency: β extremely small but not zero Much slower ‘linear’ degradation than Zookeeper/Paxos Saturation throughput maintained over 2000 nodes (100x Sirius) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 52 / 81

Slide 78

Slide 78 text

Distributed Application Scaling (22) Hadoop Super Scaling NJG, P. J. Puglia, and K. Tomasette, “Hadoop Superlinear Scalability: The perpetual motion of parallel performance”, Comm. ACM, Vol. 58 No. 4, 46–55 (2015) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 53 / 81

Slide 79

Slide 79 text

Distributed Application Scaling (22) Super-scalability is a thing Superlinear Linear Sublinear Processors Speedup More than 100% efficient! c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 54 / 81

Slide 80

Slide 80 text

Distributed Application Scaling (22) Seeing is believing “Speedup in Parallel Contexts.” http://en.wikipedia.org/wiki/Speedup#Speedup_in_Parallel_Contexts “Where does super-linear speedup come from?” http://stackoverflow.com/questions/4332967/where-does-super-linear-speedup-come-from “Sun Fire X2270 M2 Super-Linear Scaling of Hadoop TeraSort and CloudBurst Benchmarks.” https://blogs.oracle.com/BestPerf/entry/20090920_x2270m2_hadoop Haas, R. “Scalability, in Graphical Form, Analyzed.” http://rhaas.blogspot.com/2011/09/scalability-in-graphical-form-analyzed.html Sutter, H. 2008. “Going Superlinear.” Dr. Dobb’s Journal 33(3), March. http://www.drdobbs.com/cpp/going-superlinear/206100542 Sutter, H. 2008. “Super Linearity and the Bigger Machine.” Dr. Dobb’s Journal 33(4), April. http://www.drdobbs.com/parallel/super-linearity-and-the-bigger-machine/206903306 “SDN analytics and control using sFlow standard — Superlinear.” http://blog.sflow.com/2010/09/superlinear.html Eijkhout, V. 2014. Introduction to High Performance Scientific Computing. Lulu.com c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 55 / 81

Slide 81

Slide 81 text

Distributed Application Scaling (22) Empirical evidence 16-node Sun Fire X2270 M2 cluster c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 56 / 81

Slide 82

Slide 82 text

Distributed Application Scaling (22) Smells like perpetual motion c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 57 / 81

Slide 83

Slide 83 text

Distributed Application Scaling (22) Terasort cluster emulations Want a controlled environment to study superlinearity Terasort workload sorts 1 TB data in parallel Terasort has benchmarked Hadoop MapReduce performance We used just 100 GB data input (weren’t benchmarking anything) Simulate in AWS cloud EC2 instances (more flexible and much cheaper) Enabled multiple runs in parallel (thank you Comcast for cycle $s) Table 1: Amazon EC2 Configurations Optimized Processor vCPU Memory Instance Network for Arch number (GiB) Storage (GB) Performance BigMem Memory 64-bit 4 34.2 1 x 850 Moderate BigDisk Compute 64-bit 8 7 4 x 420 High c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 58 / 81

Slide 84

Slide 84 text

Distributed Application Scaling (22) USL analysis of BigMem N ≤ 50 nodes 0 50 100 150 0 50 100 150 USL Analysis of BigMem Hadoop Terrasort EC2 m2 nodes (N) Speedup S(N) α = −0.0288 β = 0.000447 Nmax = 47.96 Smax = 73.48 Sroof = N A Ncross = 64.5 c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 59 / 81

Slide 85

Slide 85 text

Distributed Application Scaling (22) USL analysis of BigMem N ≤ 150 nodes 0 50 100 150 0 50 100 150 USL Model of BigMem Hadoop TS Data EC2 m2 nodes (N) Speedup S(N) α = −0.0089 β = 9e−05 Nmax = 105.72 Smax = 99.53 Sroof = N A Ncross = 99.14 c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 60 / 81

Slide 86

Slide 86 text

Distributed Application Scaling (22) Speedup on N ≤ 10 BigMem Nodes (1 disk) 0 5 10 15 0 5 10 15 BigMem Hadoop Terasort Speedup Data EC2 m2 nodes (N) Speedup S(N) Superlinear region Sublinear region c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 61 / 81

Slide 87

Slide 87 text

Distributed Application Scaling (22) Speedup on N ≤ 10 BigDisk Nodes (4 disks) 0 5 10 15 0 5 10 15 BigDisk Hadoop Terasort Speedup Data EC2 c1 nodes (N) Speedup S(N) Superlinear region Sublinear region c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 62 / 81

Slide 88

Slide 88 text

Distributed Application Scaling (22) Superlinear payback trap Superlinearity is real (i.e., measurable) But due to hidden resource constraint (disk or memory subsystem sizing) Sets wrong normalization for speedup scale Shows up in USL as negative α value (cf. GCAP book) Theorem 3 (Gunther 2013) The apparent advantage of superlinear scaling (α < 0) is inevitably lost due to crossover into a region of severe performance degradation (β 0) that can be worse than if superlinearity had not been present in the first place. Proof was presented at Hotsos 2013. Superlinear Linear Sublinear Processors Speedup Superlinear Payback Processors Speedup c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 63 / 81

Slide 89

Slide 89 text

Distributed in the Cloud (30) Outline 1 Introduction 2 Universal Computational Scalability (8) 3 Queueing Theory View of the USL (15) 4 Distributed Application Scaling (22) 5 Distributed in the Cloud (30) 6 Optimal Node Sizing (38) 7 Summary (45) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 64 / 81

Slide 90

Slide 90 text

Distributed in the Cloud (30) AWS Cloud Application “Exposing the Cost of Performance Hidden in the Cloud” NJG and M. Chawla Presented at CMG cloudXchange 2018 c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 65 / 81

Slide 91

Slide 91 text

Distributed in the Cloud (30) Production data Previously measured X and R directly on test rig Table 2: Converting data to performance metrics Data Meaning Metrics Meaning T Elapsed time X = C/T Throughput Tp Processing time R = (Tp /T)(T/C) Response time C Completed work N = X × R Concurrent threads Ucpu CPU utilization S = Ucpu /X Service time Example 4 (Coalesced metrics) Linux epoch Timestamp interval between rows is 300 seconds Timestamp, X, N, S, R, U_cpu 1486771200000, 502.171674, 170.266663, 0.000912, 0.336740, 0.458120 1486771500000, 494.403035, 175.375000, 0.001043, 0.355975, 0.515420 1486771800000, 509.541751, 188.866669, 0.000885, 0.360924, 0.450980 1486772100000, 507.089094, 188.437500, 0.000910, 0.367479, 0.461700 1486772400000, 532.803039, 191.466660, 0.000880, 0.362905, 0.468860 ... c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 66 / 81

Slide 92

Slide 92 text

Distributed in the Cloud (30) Tomcat data from AWS 0 100 200 300 400 500 0 200 400 600 800 Tomcat threads Throughput (RPS) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 67 / 81

Slide 93

Slide 93 text

Distributed in the Cloud (30) USL nonlinear analysis 0 100 200 300 400 500 0 200 400 600 800 Tomcat threads Throughput (RPS) α = 0 β = 3e−06 γ = 3 Nmax = 539.2 Xmax = 809.55 Nopt = 274.8 c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 68 / 81

Slide 94

Slide 94 text

Distributed in the Cloud (30) Concurrency parameter 0 100 200 300 400 500 0 200 400 600 800 Tomcat threads Throughput (RPS) α = 0 β = 3e−06 γ = 3 Nmax = 539.2 Xmax = 809.55 Nopt = 274.8 α = 0 β = 0 Processes Capacity 1 γ = 3.0 2 Smallest number of threads during 24 hr sample is N > 100 3 Nonetheless USL estimates throughput X(1) = 3 RPS c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 69 / 81

Slide 95

Slide 95 text

Distributed in the Cloud (30) Contention parameter 0 100 200 300 400 500 0 200 400 600 800 Tomcat threads Throughput (RPS) α = 0 β = 3e−06 γ = 3 Nmax = 539.2 Xmax = 809.55 Nopt = 274.8 1/α α >> 0 β = 0 Processes Capacity α = 0 No significant waiting or queueing Max possible throughput Xroof not defined c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 70 / 81

Slide 96

Slide 96 text

Distributed in the Cloud (30) Coherency parameter 0 100 200 300 400 500 0 200 400 600 800 Tomcat threads Throughput (RPS) α = 0 β = 3e−06 γ = 3 Nmax = 539.2 Xmax = 809.55 Nopt = 274.8 1/α 1/N α > 0 β > 0 Processes Capacity β = 3 × 10−6 implies very weak retrograde throughput Extremely little data exchange But entirely responsible for sublinearity And peak throughput Xmax = 809.55 RPS Peak occurs at Nmax = 539.2 threads c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 71 / 81

Slide 97

Slide 97 text

Distributed in the Cloud (30) Revised USL analysis Parallel threads implies linear scaling Linear slope γ ∼ 3: γ = 2.65 Should be no contention, i.e., α = 0 Discontinuity at N ∼ 275 threads Throughput plateaus, i.e., β = 0 Saturation occurs at processor utilization UCPU ≥ 75% Linux OS can’t do that! Pseudo-saturation due to AWS Auto Scaling policy (hypervisor?) Many EC2 instances spun up and down during 24 hrs c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 72 / 81

Slide 98

Slide 98 text

Distributed in the Cloud (30) Corrected USL linear model 0 100 200 300 400 500 0 200 400 600 800 Tomcat threads Throughput (RPS) α = 0 β = 0 γ = 2.65 Nmax = NaN Xmax = 727.03 Nopt = 274.8 Parallel threads Pseudo−saturation c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 73 / 81

Slide 99

Slide 99 text

Optimal Node Sizing (38) Outline 1 Introduction 2 Universal Computational Scalability (8) 3 Queueing Theory View of the USL (15) 4 Distributed Application Scaling (22) 5 Distributed in the Cloud (30) 6 Optimal Node Sizing (38) 7 Summary (45) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 74 / 81

Slide 100

Slide 100 text

Optimal Node Sizing (38) 1. Canonical sublinear scalability 0 50 100 150 0 100 200 300 400 500 Processes (N) Thruput X(N) α = 0.05 β = 6e−05 γ = 20.8563 Nmax = 125.5 Xmax = 320.47 Xroof = 417.13 Amhdahl curve (dotted) Linear scaling bound (LHS red) Saturation asymptote (top red) Nopt = 1/α = 20 at intersection USL curve (solid blue) is retrograde Nopt shifts right to Nmax = 125 Throughput is lower than Xroof c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 75 / 81

Slide 101

Slide 101 text

Optimal Node Sizing (38) 2. Memcache scalability 0 5 10 15 20 25 0 500 1000 1500 2000 Worker threads (N) Throughput X(N) α = 0.0468 β = 0.021016 γ = 84.89 Nmax = 6.73 Xmax = 274.87 Nopt = 21.38 0 10 20 30 40 50 0 100 200 300 400 500 Worker processes (N) Throughput X(N) α = 0 β = 0.00052 γ = 17.76 Npeak = 43.87 Xpeak = 394.06 Nopt = Inf Initial scalability (green curve) Nmax occurs before Nopt Nmax = 6.73 Nopt = 1/0.0468 = 21.36752 Sun SPARC patch (blue curve) Nmax = 43.87 Now exceeds previous Nopt And throughput increased by approx 50% c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 76 / 81

Slide 102

Slide 102 text

Optimal Node Sizing (38) 3. Cloud scalability 0 100 200 300 400 500 0 200 400 600 800 Tomcat threads Throughput (RPS) α = 0 β = 0 γ = 2.65 Nmax = NaN Xmax = 727.03 Nopt = 274.8 Parallel threads Pseudo−saturation 0 100 200 300 400 500 0.0 0.2 0.4 0.6 0.8 1.0 Tomcat processes (N) Response time R(N) α = 0 β = 0 γ = 2.65 Rmin = 0.38 Sbnk = 0.001365 Nopt = 274.8 Throughput profile X(N) data tightly follow bounds Nopt = 274.8 due to autoscaling policy Significant dispersion about mean X Response time profile Autoscaling throttles throughput Causes R(N) to increase linearly Will users tolerate R(N) > Nopt ? c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 77 / 81

Slide 103

Slide 103 text

Optimal Node Sizing (38) 4. DLT scalability 0 20000 40000 60000 80000 Cluster size Reqs per second 1 3 5 7 9 11 13 15 Mean CI_lo CI_hi USL model α = 0.05 β = 0.157701 γ = 41536.96 Nmax = 2.45 Xmax = 62328.47 Xroof = 830739.1 0 500 1000 1500 2000 0 500 1000 1500 2000 Network nodes Throughput (TPS) α = 0.9 β = 6.5e−05 γ = 1648.27 Npeak = 39.15 Xpeak = 1821.21 Xroof = 1831.41 Throughput profile Optimum is the smallest configuration Likely bigger in real applications Coordination kills Scaling degrades (that’s how it works!) No good for DLT Throughput profile There is no optimum Throughput similar to Zookeeper peak Pick a number N Throughput is independent of scale But N scale is 100x Zookeeper Generally needed for DLT c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 78 / 81

Slide 104

Slide 104 text

Summary (45) Outline 1 Introduction 2 Universal Computational Scalability (8) 3 Queueing Theory View of the USL (15) 4 Distributed Application Scaling (22) 5 Distributed in the Cloud (30) 6 Optimal Node Sizing (38) 7 Summary (45) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 79 / 81

Slide 105

Slide 105 text

Summary (45) Summary All performance is nonlinear Need both data and models (e.g., USL) USL is a nonlinear (rational) function USL is architecture agnostic All scalability information contained in parameters: 1 α: contention, queueing 2 β: coherency, consistency (pairwise data exchange) 3 γ: concurrency, X(1) value Apply USL via nonlinear statistical regression CAP-like consistency associated with magnitude of quadratic β term Can apply USL to both controlled (testing) and production platforms USL analysis often different from paper designs and algorithms c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 80 / 81

Slide 106

Slide 106 text

Summary (45) Thank you for the (virtual) invitation! www.perfdynamics.com Castro Valley, California Twitter twitter.com/DrQz Facebook facebook.com/PerformanceDynamics Blog perfdynamics.blogspot.com Training classes perfdynamics.com/Classes [email protected] +1-510-537-5758 c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 81 / 81